Reconstructability Analysis of Epistasis

Authors


Corresponding author: Martin Zwick, Systems Science Graduate Program, Portland State University, Portland OR 97207-0751. Tel.: (503) 725-4987; Fax: (503) 725-8489; E-mail: zwick@pdx.edu

Summary

The literature on epistasis describes various methods to detect epistatic interactions and to classify different types of epistasis. Reconstructability analysis (RA) has recently been used to detect epistasis in genomic data. This paper shows that RA offers a classification of types of epistasis at three levels of resolution (variable-based models without loops, variable-based models with loops, state-based models). These types can be defined by the simplest RA structures that model the data without information loss; a more detailed classification can be defined by the information content of multiple candidate structures. The RA classification can be augmented with structures from related graphical modeling approaches. RA can analyze epistatic interactions involving an arbitrary number of genes or SNPs and constitutes a flexible and effective methodology for genomic analysis.

Introduction

Reconstructability analysis (RA) is a modeling methodology developed in the systems community (Klir, 1976, 1985; Conant, 1981; Krippendorff, 1981, 1986; Broekstra, 1979; Cavallo, 1979; and others) based on the work of Ashby (Klir, 1986). It uses set theory or information theory to assess models whose structures are defined by graph theory. These hypergraph structures specify which relationships between variables satisfactorily model the data, and allow one to posit relationships that are not merely dyadic (two-way), but of arbitrary ordinality (triadic, tetradic, etc.). In the set-theoretic version of RA (SRA), the input data are a set theoretic relation, that is, a subset of the Cartesian product of the sets of values of the variables. In the information-theoretic version of RA (IRA), the input data are a frequency or probability distribution. In both, a model is the relation or distribution—henceforth the word “relation” will be used for either—that maximizes entropy subject to the constraints of the model's structure. IRA partially overlaps log-linear (LL) methods and logistic regression (LR) (Bishop et al., 1978; Knoke & Burke, 1980), and Bayesian networks and other graphical models (Lauritzen, 1996). In the areas of overlap, RA and these other methods are typically equivalent. For example, in many contexts the maximum entropy solutions of RA give identical results to the maximum likelihood solutions of other methods.

IRA, given a frequency distribution, is an information-theoretic approach to statistical multivariate analysis, but it can alternatively be given a probability distribution, with no sample size, for which the analysis is nonstatistical. Set-theoretic mappings can also be treated probabilistically (and nonstatistically) with IRA. In its “k-systems” version (Jones, 1985), IRA can analyze continuous functions of nominal variables by rescaling the functions and treating them as probability distributions. IRA also has a Fourier version (Zwick, 2004b), which, in conjunction with the k-systems approach, resembles regression. The possibility of additional versions of RA is also implicit in generalized information theory (Klir, 2005), which includes fuzzy distributions.

From the input data one can generate projections; for example, a relation ABZ generates projections AB, AZ, BZ, A, B, and Z. Projection drops variables by doing a logical “or” (in SRA) or a summation (in IRA) over the values of the projected variables. If data are decomposable without information loss into a set of projections (lower-dimensional distributions or set-theoretic relations) represented by a hypergraph, then the hypergraph—more precisely, the projections it indicates—determine a calculated distribution or relation that satisfactorily models the observed data. The possible hypergraphs define a lattice of structures, and a major concern of RA is how best to search this lattice for good models. Here, a “lattice” is a partially ordered graph that has a single upper node (structure) where all variables are included in one relation and a single lower node (structure) where variables or sets of variables are distributed into separate relations, that is, are independent of one another; this is explained more fully later in section “Methods”—Lattice of Structures. The RA lattice can also be augmented by graphical models related to but not encompassed by the RA formalism. By contrast, other methods (e.g., LR) that are similar or possibly equivalent to RA in the estimation of individual models often do not explicitly articulate this lattice of possible models or provide heuristics for searching it. When applied to genomic data, the RA lattice offers a taxonomy of epistasis types, where a type is defined coarsely by the simplest structure that fits the data without loss or defined in more detail by the vector of information values for different decompositions.

This paper amplifies the previous use of RA (Shervais et al., 2010) to detect epistasis in genomic data. This amplification consists of (i) a more complete description of RA methodology than offered in that earlier study, (ii) the use of RA to define a taxonomy of epistasis types, and (iii) new extensions of RA methodology. Because the focus of this paper is on RA methodology itself, no attempt is made to provide definitive expositions of the similarities and differences between RA and other methods (this is the subject of ongoing work); nonetheless, the relationships between RA and other methods are briefly considered in the final “Discussion” section.

Methods

The following is a compressed explanation of RA; for additional details, see the overview articles of Zwick (2001, 2004a).

Lattice of Structures

Consider two genes or SNPs, A and B, and a disease state, Z. We can think about this simple three variable ABZ relation in two ways: (1) we can regard A and B as inputs (independent variables) and Z as an output (a dependent variable), thus defining a “directed” system; or (2) we can abstain from making an input/output distinction among the variables, thus defining a “neutral” system. ABZ is the data, and it has projections (subrelations) AB, AZ, BZ, A, B, and Z. A structure is a nonredundant set of projections used to model (compress) data. For a three variable system, the RA lattice of structures is shown in Figure 1 for a neutral as well as for a directed system.

Figure 1.

Lattice of Specific Structures for a 2 input, 1 output system (The full set of structures is the lattice for neutral systems; the bold structures constitute the lattice for directed systems with inputs A & B and output Z. The three structures that model epistasis are boxed. The order of variables in a relation is arbitrary, e.g., AB = BA, and the order of relations in a structure is also arbitrary, e.g., AB:AZ:BZ = AZ:AB:BZ.).

The top structure is the data itself, the “saturated” model. AB:Z and A:B:Z are alternative bottom (independence) structures for the directed and neutral lattices, respectively. The colon, “:” means “independent of.” AB:Z, the bottom structure for the directed system lattice, means that Z is independent of AB; the AB relation allows for the possibility that A and B are associated. A:B:Z, the bottom structure for the neutral system lattice, means that all three variables are independent of one another. (In LL notation, one might write {AB}{Z} and {A}{B}{Z} instead of using colons.) The choice of bottom structure reflects the essential difference between the directed and neutral system lattices. The directed system lattice only depicts possible relations between the input variables and the output. It ignores associations between the inputs by allowing for all possible associations, and does so by including in every structure in the lattice one relation involving all the inputs. This allows the statistical testing of model differences that involve only input–output associations, and it insures that all models in the lattice are hierarchically related to its bottom model, AB:Z. By contrast, the neutral system lattice includes all structures hierarchically related to the bottom model, A:B:Z, where all variables are independent of one another, so this lattice considers not only input–output relations but also the presence or absence of relations among inputs. From another point of view, in the neutral system lattice every variable takes its turn as an output.

Structure AB:AZ allows for associations between A and B and between A and Z; using this structure to model data asserts that A is a predictor of Z; similarly for AB:BZ. Because in epistasis A and B are inputs and Z is an output, this suggests using the directed system lattice, but one structure in the neutral lattice, AZ:BZ, normally not considered for directed systems, is also of interest; this is commonly known as a “Naïve Bayes” model. In general, structures of the form PQ:QR mean that the PQ relation is independent of the QR relation. Because these relations overlap in Q, they are not completely independent; what the structure really says is that P and R are “conditionally independent” of one another, conditioned on Q. PQ:QR does not imply anything about whether P and R are nonconditionally independent of one another. For example, A and B might be independent of one another in data ABZ, but associated with one another in structure AZ:BZ. However, if A and B are independent of one another in data ABZ, they remain so in structure AB:AZ:BZ. Relations present in a structure indicate which relations in the data are imposed on and thus satisfied by the model.

Of the nine structures in Figure 1, one, namely AB:AZ:BZ, has a loop (is cyclic); to illustrate, this structure and the ones above and below it are shown in Figure 2. A structure has a loop if something remains after (i) removing variables that occur in only one relation, (ii) removing redundant relations, and iterating (i) and (ii) (Krippendorff, 1986). Structures with loops have greater computational requirements than those without loops; they also cannot be interpreted in terms of conditional independence. Structures are characterized by their complexity. For IRA this means degree of freedom, which is discussed below; for SRA, complexity is a more complicated notion that is beyond the scope of this paper.

Figure 2.

Examples of structures with and without loops (AB:AZ:BZ, which has a loop, is shown with ABZ above it and AB:BZ below it, which do not. Lines are variables and boxes are relations.).

The three structures at level 2 and level 3 just permute the variables, so there are five general structures (where permutations are not distinguished) and nine specific structures (where permutations are distinguished), of which five (shown in bold in Fig. 1) constitute the directed system lattice. Table 1 indicates how rapidly the numbers of structures scale up as the number of variables increases. The fraction of structures that have loops also markedly increases with the number of variables. For more than a handful of variables, exhaustive search of all structures becomes impossible, and one of the distinguishing features of RA is its explicit consideration of the problem of searching large lattices.

Table 1.  Numbers of structures.
 Total number of variables3456
Neutral# General structures52018016,143
Neutral# Specific structures91146,8947,785,062
Directed, 1 output# Specific structures5191677,580
Directed, 1 output, no loops# Specific structures481632

“Epistasis” normally connotes a joint association of two or more input variables with an output variable (conceivably more than one output variable, but only one is considered here) that cannot be represented as the result of independent associations of the individual inputs with the output. For three variables, three RA structures of Figure 1, namely ABZ, AB:AZ:BZ, and AZ:BZ, can model associations of A and B with Z that do not result from independent AZ and BZ associations. ABZ always meets this criterion, regardless of how association is represented. AB:AZ:BZ meets this criterion for entropy reductions and model penetrance. AZ:BZ meets this criterion for these two measures if and only if A and B are not mutually independent in the calculated distribution for this model; more about this below. For four variables, there are 17 structures in which the output depends on all three inputs (Table 2). Eight of the nine four-variable directed lattice structures have loops (the data does not); one of the eight additional neutral lattice structures has a loop.

Table 2.  Epistatic structures for a three input, one output system The 17 structures are listed according to their level of decomposition (the actual lattice, i.e., the parent–child relationships, is not shown.) Variable permutations are illustrated by the permutations of ABC:ABZ:BCZ which are ABC:ABZ:ACZ and ABC:ACZ:BCZ. The additional structures from the neutral lattice are naïve-Bayes-like.
Directed lattice structures (9 of 19)Additional structures from neutral lattice (8)
ABCZ
ABC:ABZ:BCZ:ACZ
ABC:ABZ:BCZ +two permutationsABZ:BCZ:ACZ
ABC:ABZ:CZ + two permutations
ABC:AZ:BZ:CZABZ:ACZ + two permutations
ABZ:CZ + two permutations
AZ:BZ:CZ

Table 2 illustrates the fact that RA can analyze epistasis involving an arbitrary number of inputs, but the discussion below is restricted to three variable structures, which suffices to explain the method. The next section discusses how structures are used to model data, after which subsequent sections give examples of the three epistasis types shown in Figure 1. The RA lattice is then augmented with structures from related graphical modeling formalisms, and a refined version of RA that is state-based is introduced.

Analysis of Data

One of the ways that RA differs from other methods that are similar to it is the variety of types of data that it can analyze. In SRA, the data are a set-theoretic relation or mapping. In IRA, it is either a probability distribution, a frequency distribution, or a function. A set-theoretic relation is a set of observed nominal (Ai, Bj, Zk) states which is a subset of all possible (Ai, Bj, Zk); in a mapping, A ⊗ B → Z, for every (Ai, Bj) there is only one Zk. An IRA distribution allows all (Ai, Bj, Zk) states to occur but assigns a probability or frequency (possibly 0) to each. Having no sample size, a probability distribution, p(Ai, Bj, Zk) is nonstatistical, while a frequency distribution f(Ai, Bj, Zk) = N p(Ai, Bj, Zk) is statistical. Instead of joint probabilities, p(Ai, Bj, Zk), the data may be given in terms of conditional probabilities, p(Zk|Ai, Bj). When Zk is a state of disease, this conditional probability is “penetrance,” and given such data, IRA analyzes a joint distribution that assumes uniform p(Ai, Bj), that is, a distribution where all joint probabilities are equal. In k-systems IRA (Jones, 1985), the data are a function A ⊗ B → Z, where A and B are discrete, but Z is continuous; this variant of RA does a linear transformation of Z so it can be treated as a probability.

Applied to particular data, a structure yields a model, m, which has a calculated relation ABZm, whose entropy, H, is maximum subject to constraints of the projections included in the structure. The subscript, m, indicates a structure in the lattice applied to some data; unsubscripted terms refer to the data itself, the “saturated” model. The entropy is the Hartley entropy for SRA or the Shannon entropy for IRA:

image

where |ABZm| is the cardinality of the calculated set-theoretic relation, and p(ABZm) is the calculated probability distribution for model m (which could be written equivalently as pm(ABZ)).

For example, ABZAB:AZ:BZ is the relation that has maximum entropy—in IRA, the distribution that is maximally uniform; in SRA, the relation with the maximum number of (Ai, Bj, Zk) states—while having its AB, AZ, and BZ projections agree with the AB, AZ, and BZ projections of the data. Given data, when one speaks of the AB:AZ:BZ model, one means the calculated relation for the AB:AZ:BZ structure. Another way of thinking about the AB:AZ:BZ model is that it indicates which projections of the data we know. (This is the same in LL modeling.) The number of independent parameters needed to specify these projections is the degrees of freedom (df) of the model. The parameters are obtained from the data directly and are not fitted. In IRA, for the AB:AZ:BZ model, they are the smallest subset of p(Ai, Bj), p(Ai, Zk), and p(Bj, Zk) values sufficient to specify these three projections. The computation of ABZm is algebraic if the model does not have loops, but requires iteration if it does; time and space requirements vary with sample size in the former situation, but pose a greater burden in the latter.

Generating and assessing an RA model involves three steps: (i) projection, in which projections of the data specified by a structure are obtained, for example, AB, AZ, BZ are obtained from ABZ; (ii) composition, in which the calculated model relation is generated by maximizing entropy subject to the projection constraints, for example, ABZAB:AZ:BZ is obtained from AB, AZ, and BZ; and (iii) evaluation, in which the calculated relation is assessed by being compared to the data, for example, ABZAB:AZ:BZ is compared to ABZ.

In the evaluation step, model m can be characterized by Im, the normalized information that it captures, which is 0 for the bottom (independence) structure and 1 for the top (data):

image

(This is a variation on the Kullback–Leibler information distance.) ABZind is the calculated relation for the independence model, which is either A:B:Z (for neutral systems) or AB:Z (for directed systems). For directed systems, it is useful to quantify how predictable Z is if one knows A and B; this is expressed as the reduction of the entropy of Z, knowing A and B, in the calculated distribution for model m:

image

where %ΔH(Z|AB) is the %entropy reduction in Z for the data. Reduction of entropy is the nominal data analog of variance explained. For IRA, entropy reduction is related to the model's conditional probability distribution, that is, to penetrance. For models without loops, this relation is algebraic:

image

Both the information and entropy reduction measures do not involve a sample size, so they are nonstatistical. For IRA, given a sample size, the likelihood-ratio χ2,

image

allows one to assess the p-value for the entropy reduction, given the difference in degrees of freedom between the model and independence. Degrees of freedom of a structure is the sum of the degrees of freedom of its relations, corrected for overlaps (Krippendorff, 1986); equivalently (the third equality in the equation that follows), it is the sum, over all relations and subrelations in the structure, of the product of the cardinalities of the variables minus one (Knoke & Burke, 1980):

image

Calculating a p-value is not the only way to trade off information and complexity. One can alternatively use the Akaike (AIC) or Bayesian Information Criteria (BIC), which linearly combine these two factors. This is quite different from the way these factors are traded off in the χ2 calculation of p-values. AIC and BIC also do not require that models being compared are hierarchically nested. The above L2m takes the bottom (independence) model as the reference, but one could also choose the top (data) as the reference and assess the error in the model relative to the data with

image

where Tm is the information theoretic transmission for the model. Tm is the difference between the entropy of the model and the entropy of the data, that is, the error in the model. When m is the independence model, Tm is also known as “mutual information.”

It is sometimes convenient to write equations in terms of transmissions rather than entropies. For example, the equation for L2 given above for testing the difference between a model and independence can be written equivalently as

image

T can be thought of not only as the error in a model but also as a measure of association between variables in the data. For example, TAB:Z is the error in the AB:Z model; equivalently, it is the association between AB and Z (which also means the entropy reduction in Z given the A and B inputs). TAB:Z− TAB:AZ is the difference between the errors of the independence model, AB:Z, and the AB:AZ model, and this transmission difference equals TA:Z, the association between A and Z, ignoring B. Transmission, like entropy, can be conditioned on variables. For example, TA:Z|B is the association between A and Z, conditioned on B; it equals TAB:BZ. Assume that A is only indirectly associated with Z via B, which in turn is directly associated with Z. TA:Z|B will then be zero, but TB:Z|A will not. RA can analyze data where inputs are associated and can distinguish between direct associations with a disease variable and indirect associations due to linkage disequilibrium.

For epistasis involving two inputs and an output, Figure 1 indicates three possible models: (1) ABZ, the data itself, (2) AB:AZ:BZ, and (3) AZ:BZ, the third of which is normally relevant only for neutral systems. These structures define a taxonomy of epistasis, as follows. If the data cannot be decomposed without information loss to (2), it is here called Type 1 epistasis. If it can be decomposed without loss to (2) but not to any lower structure, it is Type 2; if it can be decomposed without loss to (3) but not to a lower structure, it is Type 3. This is summarized in Table 3.

Table 3.  Observed and calculated relations in three types of epistasis.
Type 1: ABZ ≠ ABZAB:AZ:BZ
Type 2: ABZ = ABZAB:AZ:BZ≠ ABZAZ:BZ
Type 3: ABZ = ABZAB:AZ:BZ= ABZAZ:BZ≠ any calculated relation for a lower structure

Examples of all three types of epistasis are found in the literature. The first type is straightforward. When ABZ does not have any lossless decomposition, it inherently has a triadic relation (interaction effect) involving inputs A and B and output Z. This is epistasis in its strongest form. By contrast, AB:AZ:BZ and AZ:BZ do not have any triadic relation, but only two dyadic input–output relations; AB:AZ:BZ specifies an additional relation between the inputs.

The strength of Type 1 epistasis is quantified by TAB:AZ:BZ. (The “Discussion” section below mentions a different information-theoretic measure that has been incorrectly used to quantify a triadic interaction.) To test for the significance of such epistasis, one computes a χ2 p-value from L2AB:AZ:BZ and Δdf = dfABZ− dfAB:AZ:BZ; if the difference is significant, AB:AZ:BZ does not fit the data. The other hypotheses are similarly tested. If the data can be decomposed still further (to AB:AZ, AB:BZ, or AB:Z) without loss, then both inputs are not associated with the output.

Type 2 epistasis means that the data can be decomposed to AB:AZ:BZ but not lower. Structure AB:AZ:BZ does not actually assert nonzero associations between A and B, A and Z, and B and Z; it merely allows for such associations. Similarly, structure AZ:BZ does not actually assert a zero association between A and B, despite the absence of the AB relation. Rather, these structures indicate which projections of the data are used to generate the calculated relation, that is, which projections constrain the composition (entropy maximization) step of RA. There can exist data that must be modeled by AB:AZ:BZ because A and B are not in fact associated, where the ABZAZ:BZ distribution would incorrectly show them to be associated, so the AB projection of the data is needed in the model to impose the nonexistence of an association. This is discussed further in Example #2 and also when Bayesian networks are introduced.

Although the effects of A and B on Z, expressed in terms of entropy, cannot be separated in Type 1 epistasis because of the three-way interaction effect, and in Type 2 epistasis because models with loops have no closed form algebraic solution, these effects can be separated in Type 3 epistasis, the weakest type, if A and B are mutually independent in the calculated distribution for this model. In this case, the entropy reduction in Z due to A and B together that is achieved by AZ:BZ is simply the sum of the entropy reductions due to A and B separately. The general expression is the following:

image

If A and B are mutually independent in the model distribution, the rightmost term in brackets drops out. Because H(Z) − H(Z|A) =ΔHAB:AZ(Z|A) and H(Z) − H(Z|B) =ΔHAB:BZ(Z|B), this gives ΔHAZ:BZ(Z|AB) =ΔHAB:AZ(Z|A) +ΔHAB:BZ(Z|B). Using the proportionality of entropy reduction and normalized information noted earlier, when the effects of A and B on Z can be separated,

image

The aforementioned analysis suggests an information theoretic definition of epistasis as involving entropy reduction in an output that is not the sum of the entropy reductions of the inputs. (This would be a nominal data analog of defining epistasis in terms of nonadditivity of variance reductions.) This inequality is inherently true for ABZ; its negation (the additivity of entropy reductions) cannot be derived for AB:AZ:BZ because this model has a loop; the inequality is true also for AZ:BZ if A and B are not independent in the model distribution.

Additive independence (or lack of it) for entropy reduction corresponds to multiplicative independence (or lack of it) for penetrance. For structure ABZ, p(Z|AB) cannot be written in terms of a product of p(Z|A) and p(Z|B) because this misses the triadic interaction effect. For AB:AZ:BZ, this cannot be done because there is no closed form expression for calculated probabilities in a model with loops. For AZ:BZ, however, which does not have a loop, one has

image

This equation does not exhibit multiplicative independence of i- and j-terms (or additive independence if one takes the logarithms of both sides), but if p(AiBj) = p(Ai) p(Bj), that is, A and B are mutually independent, it becomes

image

which shows multiplicative independence (or, taking logarithms, additive independence). So AZ:BZ might or might not exhibit epistasis, in the strict sense of the term, depending on the presence or absence of association between the inputs. If IRA data are given as penetrance values, that is, as a conditional and not a joint distribution, the absence of association between inputs is in effect assumed by default; this is true also for SRA data in the form of a mapping. This means that in these cases, AZ:BZ does not manifest epistasis, in the strict sense of the term, as the discussion of Example 1 below indicates.

Results

Calculations for the examples below were done by the IRA program, OCCAM (Willett & Zwick, 2004; Fusion et al., 2010; Zwick, 2010), and by separate SRA and state-based IRA (Johnson, 2005) programs also developed at Portland State University.

Example #1: Type 3 Epistasis

Table 4 presents data from Table 1 of Cordell (2002); B and G there are here renamed A and B. The data are a genotype-to-phenotype mapping, that is, not a frequency or probability distribution. Mappings are naturally analyzed with SRA, but IRA can be used instead by assigning equal probability to all (Ai, Bj, Zk) states that occur and zero probability to states that do not occur. For this data, IRA and SRA decompose to the same structure; in other data, IRA sometimes decomposes data further than SRA.

Table 4.  Example #1: Data Cordell (2002)Table 1: “Example of phenotypes (e.g., hair colour) obtained from different genotypes at two loci interacting epistatically, under Bateson (1909) definition of epistasis.” Coding a/a, a/A, and A/A as states 1, 2, and 3, and similarly for B, and coding phenotypes White, Grey, and Black as 1, 2, 3 gives the A ⊗ B → Z mapping on the right, where genotype AB maps onto phenotype Z.
Genotype at locus AGenotype at locus B B123
b/bb/BB/B
a/aWhiteGreyGrey 1122
a/ABlackGreyGreyA2322
A/ABlackGreyGrey 3322

IRA gives the results at the top of Table 5. Because RA is here analyzing a probability distribution based on a mapping, there are no p-values. The simplest structure that fits the data is AZ:BZ. Example #1 thus illustrates epistasis of Type 3. The SRA analysis of Example #1 is shown at the bottom of Table 5. AZ:BZ is again identified as the simplest model that fits the data, but other models do not have the same information values as in the IRA analysis.

Table 5.  Example #1: IRA & SRA results (The model is followed by its normalized information content. For IRA, the information about Z in A and B in model AZ:BZ is the sum of the information about Z in A in model AB:AZ and the information about Z in B in model AB:BZ.).
AB:AZ 0.25IRA 
ABZ 1.00 
AB:AZ:BZ 1.00 
AB:BZ 0.75AZ:BZ 1.00
AB:Z 0.00AZ:B 0.25BZ:A 0.75
A:B:Z 0.00 
SRA 
ABZ 1.00 
AB:AZ:BZ 1.00 
AB:AZ 0.37AB:BZ 0.74AZ:BZ 1.00
AB:Z 0.00AZ:B 0.37BZ:A 0.74
A:B:Z 0.00 

In the IRA results in Table 5, IAZ:BZ= IAB:AZ+ IAB:BZ, which means that entropy reductions due to A and B are additive, so this data does not exhibit epistasis in the strict sense of the term. It is, however, included in this paper as a “Type 3 epistasis” because Cordell (2002) mentions it as an example of Bateson's definition of epistasis.

Additivity of entropy reductions is likely to be related to the distinction between interactions that are absent, that are removable, and that are essential and not removable (Wu et al., 2009). A removable interaction is one that is not significant on an odds ratio (OR) or log(OR) or some other risk scale. The example given by Wu of a removable interaction—revealed by the absence of an interaction in the F-test of the log(OR) scale—is classified by RA as Type 3 epistasis.

Example #2: Type 2 Epistasis

Cordell (2002) second example of epistasis, shown later in Table 6(a), is a penetrance table, but because the penetrance values are only 0 or 1, it could also be considered a set-theoretic genotype-to-phenotype mapping. With either interpretation, the table does not provide frequencies for the different genotypes. Assuming p(a) = p(A) = p(b) = p(B) = 0.5 and Hardy–Weinberg equilibrium between independent loci A and B, one obtains from the conditional probabilities of Table 6(a) the joint probabilities of Table 6(b), where p(ABZ) = p(A) p(B) p(Z|AB).

Table 6.  Example #2: Penetrance data (a) (Cordell (2002)Table 2: “Example of a penetrance table for two loci interacting epistatically in a general sense.” a/a, a/A, A/A and b/b, b/B, B/B are recoded as 1, 2, 3.) (b) Table converted to a joint ABZ distribution (Left: ABZ. Right: its AB projection.).
(a) B123         
 1000         
A2011         
 3011         
(b) Z 1  2       
 B123123  B123
 10.000.000.000.06250.1250.0625  10.06250.1250.0625
A20.000.250.1250.1250.000.00 A20.1250.250.125
 30.000.1250.06250.06250.000.00  30.06250.1250.0625

IRA results on Table 6 are shown at the top of Table 7. The simplest structure that fits the data with no information loss is AB:AZ:BZ. This illustrates epistasis of Type 2. SRA applied to Table 6 gives different and inferior results, shown at the bottom of Table 7. IRA decomposes the data of Example #2 further than SRA. IRA analysis of the conditional (penetrance) distribution of Table 6 (as opposed to the joint distribution) gives results very similar but not identical to Table 7 (top).

Table 7.  Example #2: IRA results (The structure is followed by its normalized information content.).
 IRA 
ABZ 1.00 
AB:AZ:BZ 1.00 
AB:AZ 0.382AB:BZ 0.382AZ:BZ 0.764
AB:Z 0.00AZ:B 0.382BZ:A 0.382
A:B:Z 0.00 
SRA 
ABZ 1.00 
AB:AZ:BZ 0.47 
AB:AZ 0.26AB:BZ 0.26AZ:BZ 0.47
AB:Z 0.00AZ:B 0.26BZ:A 0.26
A:B:Z 0.00 

Because we assumed that A and B are independent in constructing Table 6(b), Table 7 shows that IAZ:BZ= IAB:AZ+ IAB:BZ, as in Example #1. However, in this case, AZ:BZ does not fit the data. We need the AB relation in AB:AZ:BZ to guarantee that AB exhibits no association. If we had modeled the data in Table 6 with AZ:BZ, we would have obtained the ABZAZ:BZ distribution shown in Table 8. Its AB projection is shown there on the right, and it differs from the AB projection of the data shown on the right of Table 6(b). This illustrates the point made earlier that if data are accurately fit by AB:AZ:BZ but not by AZ:BZ, this does not mean that A and B are associated; in the present situation, it means the opposite: that A and B are not associated, and the AB relation in the model is needed to assure this.

Table 8.  Example #2: ABZAZ:BZ probability distribution (The AB projection of ABZAZ:BZ on the right differs from the AB projection of the data.).
AZ 1  2      
B123123 B123
10000.14290.07140.0357 10.14290.07140.0357
200.250.1250.07140.03570.0179 A20.07140.28570.1429
300.1250.06250.03570.01790.0089 30.03570.14290.0714

Cordell (2002) gives another table—Table 3—to illustrate epistasis (a heterogeneity model), but this table is equivalent to Table 2 (epistasis “in a general sense”) if states are suitably relabeled, and need not be discussed.

Example #3: Type 1 Epistasis (Synthetic Probability Data)

Example #3 comes from an RA study of epistasis (Shervais et al., 2010). The simulated penetrance data of Model 5 from Table 1 of that study are shown here in Table 9 (left). Because penetrance values are continuous and not only 0 or 1, these data can be analyzed only by IRA and not also by SRA. This penetrance table was constructed so that there is no main effect of either A or B on Z, and the construction assumed p(a) = p(A) = p(b) = p(B) = 0.5 and Hardy–Weinberg equilibrium. With these assumptions, the joint distribution is given in Table 9 (right). As in Example #2, A and B are mutually independent.

Table 9.  Example #3: Penetrance data and its joint distribution (synthetic data (Shervais et al., 2010); a/a, a/A, A/A and b/b, b/B, BB are again coded as 1, 2, and 3. Left: Penetrance table (heritability = 0.008). Right: ABZ joint distribution with above assumptions.).
AB123  Z 1  2 
10.000.040.08  B123123
20.060.040.02  10.000.0050.0050.06250.120.0575
30.040.040.04 A20.00750.010.00250.11750.240.1225
      30.00250.0050.00250.060.120.06

IRA results on the joint distribution of Table 9 (right) are shown in Table 10. Because the data were constructed with no main effects, every decomposition has no information at all. This is the strongest possible example of Type 1 epistasis. If IRA is instead done on the conditional distribution of Table 9 (left), in effect assuming the AB frequencies are uniform, similar but not identical results are obtained.

Table 10.  Example #3: IRA results on penetrance table (Table 9) (The model is followed by its normalized information content.).
AB:AZ 0.00ABZ 1.00 
AB:AZ:BZ 0.00 
AB:BZ 0.00AZ:BZ 0.00
AB:Z 0.00AZ:B 0.00BZ:A 0.00
A:B:Z 0.00 

Example #3 was one of five synthetic datasets evaluated in the Shervais et al. (2010) study. In all of these datasets there was only one epistatic pair, and this fact was assumed to be known in earlier work on these datasets and in the Shervais study. All five datasets showed Type 1 epistasis and the same results as Table 10. RA performance on these datasets is summarized in Table 11. When eight or 50 noise SNPs were added, RA detected the correct epistatic pair 100% of the time for models 1–4, and close to that for model 5. By contrast, previous work using only eight noise genes found one of the two active genes 47% of the time (Ritchie et al., 2004) and both genes only 19% of the time (Hahn et al., 2003); Hahn used multifactor dimension reduction and Ritchie used neural nets. The Shervais et al. study also evaluated results in terms of p-values: a gene pair was counted a false positive if it was not the correct pair yet had a p-value less than 0.000, and a false negative if it was correct yet did not meet this criterion; Table 11 gives the error rates in 1350 tests. The Ritchie et al. and Hahn et al. studies did not report information on false positives.

Table 11.  Effectiveness of RA in identifying gene–gene interactions (synthetic data) (The RA results below are from Shervais et al. (2010); Example #3 is model 5 in that paper. Compare these to the results of dimension reduction (MDR) and neural nets (NN): with eight noise SNPs, NN detected at least one correct SNP 47% of the time (Ritchie et al., 2004), and MDR detected the two correct SNPs 19% of the time (Hahn et al., 2003). FP = false positives; FN = false negatives; Error rate = (FP+FN)/#tests; the Shervais paper has typos in the error rates for models 4 and 5, as the FP and FN numbers given there indicate.).
Genetic modelHeritability% both active genes in top RA model (8 noise SNPs) n = 30% both active genes in top RA model (50 noise SNPs) n = 30Error rate (8 noise SNPs) n = 30
10.053100%100%0
20.051100%100%0
30.026100%100%0
40.012100%100%0.007
50.00893%80%0.005

Example #4: Type 1 Epistasis (Real Frequency Data)

Examples #1–3 were nonstatistical because sample size is meaningless for set-theoretic relations and undefined for probability distributions. For real data in the form of frequencies, not probabilities, statistical considerations enter. Table 12 is from Shervais et al., (2010), which replicated prior evidence (Cox et al., 1999) for epistatic interactions in type 2 noninsulin-dependent diabetes between SNPs on chromosomes 2 and 15.

Table 12.  Example #4: joint frequency distribution, diabetes data (Cox et al., 1999) (Left: a frequency distribution for SNPs A35 and B47 and disease, Z:Z = 1 controls; Z = 2 cases. Right: the AB projection of the data & its independence model distribution.).
AZ 1  2       
B123123 AB  A:B 
11832527301345621849.160.215.8
22713292153634730.237.19.7
32621103724.75.81.5

The IRA analysis of this joint frequency distribution is shown in Table 13. p-values of the models relative to the data (not to independence) are calculated from their L2 and Δdf (degrees of freedom) values. The analysis shows that Table 12 exemplifies Type 1 epistasis because the identity of AB:AZ:BZ with the ABZ data can be rejected with confidence (p = 0.013). The data cannot be decomposed without significant loss of constraint. The strength of the constraint in the data are %ΔH(Z|AB) = 8.52%, which is sizeable because entropy involves a logarithm term. Note that the entropy reduction of AB:AZ:BZ is much smaller, namely 4.27%; this difference shows the strength of the purely triadic interaction. In the Shervais et al. (2010) study, all of the 36 candidate epistatic SNP pairs showed Type 1 epistasis.

Table 13.  Example #4: IRA results (Model, [Im for neutral lattice; Im and %ΔHm(Z|AB) for directed lattice], Δdf and p-value relative to the data, ABZ, not relative to independence. The big difference between %ΔHABZ(Z|AB) = 8.52 and %ΔHAB:AZ:BZ(Z|AB) = 4.27 indicates the strength of the triadic interaction.).
AB:AZ [0.418; 0.327, 2.79]6 0.009ABZ [1.00;1.00, 8.52]0 1.0 
AB:AZ:BZ [0.569; 0.502, 4.27]4 0.013 
AB:BZ [0.281; 0.169, 1.44]6 0.002AZ:BZ [0.429,-]8 0.033
AB:Z [0.135,0.00]8 0.001AZ:B [0.283,-]10 0.021BZ:A [0.146,-]10 0.005
A:B:Z [0.00,-]12 0.004 

Unlike Example #3, information here does not immediately drop to zero upon decomposition to AB:AZ:BZ; this model retains 50% of the information relative to AB:Z. Possible constraint in AB is suggested by IAB:Z= 0.135 relative to A:B:Z; this is conceivable because the data were not filtered to avoid linkage disequilibrium. However, the p-value for A:B relative to AB (these two distributions are shown in Table 12 on the right) is 0.415, so the nonidentity of A:B and AB cannot be asserted.

A Finer Information-Vector Taxonomy

Examples #3 and #4 earlier are both Type 1 epistasis, but the difference between their RA results shows that within this type of epistasis, one can distinguish between instances of epistasis by noting how rapidly information declines as one goes down the lattice of structures. Reading the IRA results of Table 10 and Table 13 from top to bottom and left to right, gives two vectors of information values, as follows:

image

where information values are here relative to A:B:Z and values within the curly brackets, {}, are for models that merely permute the input variables. The information vector characterizes the data by how rapidly information is lost as ABZ is decomposed. From this perspective, Example #3 is clearly the most extreme case of epistasis: all the interaction between A, B, and Z is triadic. There is no main effect due to A or to B and no AB association, that is, there are no dyadic constraints at all. The ABZ relation is maximally nondecomposable.

Any particular vector of values defines an equivalence class which one can consider a type of epistasis. A taxonomy based on these classes makes finer discriminations than those that just note how far down the lattice one can go and still have 100% information. The idea of such a finer taxonomy based on equivalence classes of the information vector can be illustrated by applying it to the classification of Li and Reich (2000), who showed that for two locus penetrance tables having only 0 and 1 values, there are 50 types of epistasis, that is, 50 different ways (considering symmetries) that the nine values in the penetrance table can be assigned to either 0 or 1. An IRA analysis of the penetrance tables of these 50 types yields the taxonomy shown in Table 14. There are 11 equivalence classes (a–k) of the information vector. Ignoring k, the last of these, which is not actually epistatic because Z depends on only one input, there are five equivalence classes including 26 models within Type 1 epistasis (ABZ), and five equivalence classes including 22 models within Type 2 epistasis. This information vector approach does not depend on penetrance values being only 0 or 1; it could classify tables of continuous penetrance values, as done by Hallgrímsdottir and Yuster (2008).

Table 14.  RA taxonomy of the Li–Reich penetrance tables (EC = equivalence class based on information vector.).
ECTypeInformation vector      
ABZAB:AZ:BZAB:AZAB:BZAZ:BZAB:ZAZ:BBZ:AA:B:Z
aABZ100000000
bABZ10.160.070.070.1500.070.070
cABZ10.330.330.000.3300.3300
dABZ10.420.200.200.4000.200.200
eABZ10.550.380.070.4600.380.070
fAB:AZ:BZ110.330.330.6700.330.330
gAB:AZ:BZ110.380.380.7600.380.380
hAB:AZ:BZ110.690.070.7600.690.070
iAB:AZ:BZ110.390.390.7800.390.390
jAB:AZ:BZ110.600.200.7900.600.200
kAZ:B111010100
ECLi–Reich Models         
a84, 98         
b78, 85, 86, 94, 99, 106, 113, 114, 170         
c14, 21, 97, 28, 42, 70         
d10, 12, 17, 68         
e29, 30, 43, 101, 108         
f11, 13, 19, 26, 41, 69         
g27, 45, 186         
h15, 23, 57, 58, 59, 61         
i1, 2, 16         
j3, 5, 40, 18         
k7, 56         

The Li–Reich classification, with its 50 types, is a refined classification which allows for biological interpretation, but this number of types is somewhat large, and will scale up greatly with additional input variables. For two inputs, the three type classification of ABZ, AB:AZ:BZ, and AZ:BZ is perhaps too coarse, but the first two types expand into 10 equivalence classes, so RA provides both coarse and medium classifications. (State-based RA, discussed further, provides a fine classification that is comparable to that of Li–Reich.) For three inputs, the coarse RA classification gives 17 types (Table 2), and the approach of Li and Reich would give too many.

Extending the RA Lattice

Extending the RA lattice with related formalisms

In Example #4, the null hypothesis that A and B are independent could not be rejected, and in Example #3, the inputs were independent by construction. Despite this fact, both examples represent Type 1 epistasis: the triadic interaction between A, B, and Z makes it impossible to decompose ABZ without loss. Still, the ABZ model cannot express the fact that A and B might be independent. Another class of models, Bayesian networks (BN), also known as recursive models, includes a model that can assert this, namely that pm(ABZ) = p(A) p(B) p(Z|AB) is close enough to the data, for which p(ABZ) = p(AB) p(Z|AB). This BN model is here written as ABA:B:ABZ, where the second term means Z conditioned on AB. This model asserts independence between A and B, but in the p(Z|AB) term it also asserts a triadic interaction. This model is not encompassed in the RA lattice, although one might regard it as a multi-lattice RA model: it is ABZ on the three-variable lattice, but A:B on the two-input lattice.

Although the RA lattice does not encompass this BN structure, the BN lattice is also incomplete: it does not include the RA structure AB:AZ:BZ because standard Bayesian networks do not allow loops. RA and BN thus augment one another. There is yet another structure, applicable to epistasis, which RA does not encompass. Recall the point in the discussion of Example #2 that model AB:AZ:BZ might fit the data better than AZ:BZ because the AB relation in AB:AZ:BZ imposes the non-association present in the data that AZ:BZ does not impose. One can thus also consider an ABA:B:AZ:BZ model, a hybrid between RA and BN. This model's calculated relation has maximum entropy subject to the constraints that its AZ and BZ projections agree with those of the data, but its AB projection must agree with ABA:B, and not the AB of the data. This class of models is known as recursive hierarchical and block recursive models (Lauritzen, 1996).

If the RA lattice of Figure 1 is reduced to the three epistatic structures defined earlier as well as these two additional models structures, this gives Figure 3, which defines five epistatic types. With this altered taxonomy, Example #2 should be reclassified as Type 5 instead of Type 2, and Example #3 should be reclassified as Type 4 instead of Type 1. Types 1 & 4 and Types 2 & 5 each constitute a pair, where the second of each pair asserts the absence of association between the inputs, and the first permits linkage disequilibrium. Example #4 might also be regarded as being Type 4, instead of Type 1 because the independence of A and B could not be rejected.

Figure 3.

RA, BN, and hybrid RA-BN 2 input, 1 output epistatic structures (Five epistasis types are now indicated: 3 from RA plus 2 new types.).

Extending the lattice with state-based RA

The RA lattice of Figure 1 can be extended within the RA formalism itself using “state-based” RA. This extension was first introduced by Jones (1985), as part of his k-systems analysis, and was later integrated into the mainstream IRA formalism (Johnson & Zwick, 2000; Zwick & Johnson, 2004). RA, as discussed so far, which can now be labeled “variable-based” RA, uses structures that are subsets of variables, for example, AB:AZ:BZ. State-based RA uses structures that can also specify particular states of one or more variables, for example, AB:A1Z2:B2Z. It identifies which variable states are salient, that is, informationally rich, not merely which variables are salient. State-based RA resembles the Li & Reich (2000) and Hallgrímsdottir & Yuster (2008) epistasis classifications, and, like the latter, can analyze continuous penetrance values.

Within variable-based RA, models with loops can make finer discriminations than models without loops; state-based RA makes still finer discriminations, as depicted in Figure 4. Adding loops to a variable-based model or using state-based RA may allow one to choose a more complex—and thus more predictive—model. (In the figure dotted lines represent models too complex to be statistically significant; a thick solid line represents the most complex model that is significant in each of the three model types.) There is another way a state-based model can be superior to a variable-based model: it may have more information than a variable-based model (without or with loops) having the same or even greater complexity (df), as is illustrated below.

Figure 4.

Degrees of refinement of RA models (This is a general scheme for many variables; for 3 variables only one model has loops.).

The down side of the additional refinement of the state-based approach is that, as the number of variables increases, its lattice of structures grows even more explosively than the variable-based lattice (Table 1). A variable-based structure is defined independently of variable cardinalities, so the number of structures in the variable-based lattice is also independent of these cardinalities. But state-based structures are defined in terms of individual states, so the number of structures in the state-based lattice expands greatly with higher variable cardinalities. Computationally, this is handled by doing a greedy search that successively adds single parameters (Δdf = 1) to a starting model, which is typically variable-based and often the bottom reference model.

State-based RA is illustrated in Table 15 by its application to Example #3 (the synthetic data of Table 9), which exhibited Type 1 (or Type 4) epistasis. In Table 15 variable-based models are interspersed among the state-based models and shown in italics. Because the data are simulated probabilities, there is no sample size, and one must select a model by some other means of trading off information and complexity. The table shows that while variable-based RA indicates that any decomposition loses all the information in the data, state-based RA indicates that a simplification of five degrees of freedom (from Δdf of 8 to 3) still retains about 91% of the information.

Table 15.  Example #3: State-Based RA results (nonstatistical) (Unlike Table 13, Δdf here is given relative to the directed system independence model. Results for variable-based models (italics) are from Table 10. Variable-based models without loops are written with extra spaces.).
ImΔdfAB:ZStructure
1.0008A B Z
1.0004AB:Z:A1B1Z:A2B3Z:A1B3Z:A2B1Z
0.04AB:AZ:BZ
0.9063AB:Z:A1B1Z:A2B3Z:A1B3Z
0.7562AB:Z:A1B1Z:A2B3Z
0.02A B : A Z or A B : B Z
0.5351AB:Z:A1B1Z
0.0000A B : Z

Applying state-based RA to Example #4 (diabetes data of Table 12), also an instance of Type 1 epistasis, gives the results shown in Table 16. The left-most column provides an arbitrary index for each model. The column to its right gives Im. After Δdf relative to AB:Z, two p-values are given: pcum and pincr. pcum is the cumulative p-value relative to the constant reference of the independence model (m = 0). pincr is the incremental p-value relative to the model indexed by mincr; for state-based models this is the next lower state-based model. Variable based models, in italics, are added to the table for purposes of comparison. Considering only variable-based loopless models (in italics with extra spaces), there are no models between Δdf = 2 and Δdf = 8. Considering also the variable-based model with loops, AB:AZ:BZ, there are no models between Δdf = 2 and Δdf = 4 and between Δdf = 4 and Δdf = 8. (For two inputs and one output there is only one model with loops, but as the number of inputs increases, most models do have loops, as was shown dramatically in Table 1.) As Table 16 indicates, state-based models offer intermediate, more refined, options. For three variables, this effect is modest, but for more variables it is more substantial, as schematically suggested by Figure 4.

Table 16.  Example #4: SB-RA results (statistical) (Results for variable-based models are in italics; variable-based models without loops are written with extra spaces; # is the best state-based model; * is the best variable-based model; m = model index; mincr= reference model for pincr.).
mImΔdfAB:ZpcumpincrmincrStructure
111.00080.00140.968410A B Z
101.00070.00070.87369AB:Z:A2B1Z:A3Z:A1B2Z:A3B3Z:B3Z:A3B1Z:A2B2Z
90.99960.00030.50368AB:Z:A2B1Z:A3Z:A1B2Z:A3B3Z:B3Z:A3B1Z
80.98150.00020.30227AB:Z:A2B1Z:A3Z:A1B2Z:A3B3Z:B3Z:A3B1Z
70.93940.00010.36955AB:Z:A2B1Z:A3Z:A1B2Z:A3B3Z
60.50240.01290.10983AB:AZ:BZ
0.01482 
50.90730.00000.05754AB:Z:A2B1Z:A3Z:A1B2Z
4#0.76520.00010.00451AB:Z:A2B1Z:A3Z
3*0.32720.01610.01610A B : A Z
20.16920.11880.11880A B : B Z
10.44510.00080.00080AB:Z:A2B1Z
00.00001.001.0000 A B : Z

If one requires for model acceptance that both cumulative and incremental p-values be less than 0.05, then the structure in Table 16, AB:Z:A2B1Z:A3Z, marked by a #, is the best state-based model, and the model, AB:AZ, marked by a *, is the best variable-based model. This state-based model identifies a triadic effect resulting from the interaction of specific states A2, B1, and Z (because Z is dichotomous, a specific Z state does not need to be specified), and also identifies a main effect due to genotype A3. This model has the same Δdf as the best variable-based model but captures far more information (0.765 vs. 0.327). One might also note that the state-based model with Δdf = 3 (m = 5) almost meets the p ≤ 0.05 standard and captures as much as 0.907 of the information. In summary, Table 16 shows that state-based RA considerably augments and refines the variable-based analysis of epistasis.

These facts illustrate the capacity of state-based RA to capture more information with simpler models than variable-based RA. Because of their refinement, state-based models are likely to have greater power and give fewer false positives than variable-based models; similarly, within variable-based RA, using models with loops is likely to have greater power and give fewer false positives than using loopless models. However, models with loops take longer to compute than those without loops (because calculated relations must be generated iteratively if there are loops), and state-based models take longer to search than variable-based models (because there are many more state-based models).

Discussion

This paper describes the use of reconstructability analysis to study epistasis. It does not discuss all variations of RA, for example, the k-systems method (Jones, 1985) that decomposes continuous functions (not necessarily distributions) of discrete arguments or the Fourier version of RA (Zwick, 2004b) that minimizes square error rather than maximizing entropy in the composition step. When both of these variations are combined, if an output is a sum of quantitative functions of subsets of the inputs, RA will identify the subsets, that is, the variables participating in interaction effects without requiring any assumptions about the mathematical form of the interactions. For example, if Z = f1(A, B) + f2(B,C), this variant of RA will give model AB:BC. Here, independence is additive not multiplicative, as it normally is in probabilistic systems, and RA comes to resemble regression and thus standard models of epistasis (Cordell, 2002). Or, the k-systems method can be used with standard maximum entropy composition to analyze continuous phenotypes; when this is done, the lattice of structures is defined only by the input variables. Nor does this paper describe other possible augmentations of RA offered by the graphical models literature. For example, there is another class of graphical models, known as simset models (Studeny, 2005), which could further augment the RA lattice of structures; simset models allow multiple simultaneous structural hypotheses.

As noted in the Introduction, IRA is graph theory plus information theory, where graph theory defines structures and information theory (for SRA, set theory) evaluates them with data. Although information theoretic methods have been used in genomic analysis, (Tsalenko et al., 2003; Dawy et al., 2006; Chanda et al., 2007; Dong et al., 2008; Kang et al., 2008), these methods have not yet fully exploited the capacities of information theory, nor have they involved searches through large lattices of models. For example, Tsalenko et al. (2003) used an information theoretic approach that involves only single variable loopless models. Information theory is also not always properly applied. For example, Chanda et al., (2007) use two information theoretic measures to quantify interaction:

  • (a) −H(A) − H(B) − H(Z) + H(AB) + H(AZ) + H(BZ) − H(ABZ).
  • (b) H(A) + H(B) + H(Z) − H(ABZ).

The first of these measures equals TA:Z|B− TA:Z or equivalently TB:Z|A− TB:Z and can be either positive or negative. Both positive and negative values can reflect the presence of an ABZ triadic interaction because one input alters the association of the other input with the output. One might perhaps consider taking the absolute value of this quantity, but an interaction may be present even if this quantity is zero. For example, assume that B has two states. TA:Z|B− TA:Z=[p(B1) TA:Z|B1+ p(B2) TA:Z|B2]− TA:Z. TA:Z|B is an average; TA:Z|B1 might be bigger than TA:Z while TA:Z|B2 might be smaller, or vice versa, so their sum might be zero even though the association of A and Z is affected by both states of B. There is thus no value of this measure that is a definitive indication that a triadic interaction is absent, hence this measure does not properly quantify such an interaction. The strength of the triadic interaction needs to be measured instead by H(ABZAB:AZ:BZ) − H(ABZ), the entropy of the model minus the entropy of the data, which is always positive. (H(ABZAB:AZ:BZ) cannot be written algebraically in terms of the entropy of the data and its projections because the model has a loop.) To test for the significance of this entropy difference, a p-value is calculated from L2AB:AZ:BZ and Δdf = dfABZ− dfAB:AZ:BZ. This issue—the correct way to quantify interactions—has been elucidated by (Krippendorff, 2009). The second of the Chanda et al. (2007) measures is TA:B:Z, which measures the total constraint in ABZ, not only the constraint involving Z. TA:B:Z will be nonzero even when Z is independent of both A and B if A and B are mutually associated (in linkage disequilibrium). It is TAB:Z that measures the constraint that involves Z, and that is why AB:Z, not A:B:Z, is taken as the bottom reference for directed systems. So measure (b) also does not properly quantify the triadic interaction.

Readers will note similarities between RA and other methods for studying epistasis, for example logistic regression (LR). LR applied to nominal input variables, where dummy variables code variable states, is the same as log-linear (LL) modeling; where these formalisms overlap, RA, LL, and LR are equivalent. Still, RA employs entropy and transmission measures not normally reported in LL or LR, and these measures are useful and intuitively easy to understand. LR does not normally evaluate the structure AZ:BZ, which can model epistasis, because it is not hierarchically related to AB:Z. RA is also different from LR as implemented in the PLINK software (Purcell et al., 2007) which has been employed for the analysis of epistasis. PLINK regresses against allele dose, that is, treats variables as quantitative rather than nominal, and is inappropriate when the dependence of disease on allele dose is not monotonic (or if monotonic, not linear). When genotypes are coded nominally, inputs with three or more states are sometimes recoded with two or more binary variables, and with this type of dummy variable coding, LR resembles state-based RA. Whether the two are equivalent is under investigation, but even if they are, there remain differences, at least computational ones: LR maximizes likelihood, typically with the Newton–Raphson algorithm, while RA maximizes entropy with Iterative Proportional Fitting. LR software sometimes uses the Wald test instead of the more robust likelihood test. More critically, LR software is usually not designed for exploratory purposes and is sometimes unable to handle interactions between many variables. As already noted, LR as a methodology does not explicitly articulate the lattice of possible models or provide heuristics for searching it. More generally, RA's fusion of information theory and graph theory connects it strongly with the graphical models literature. In its graph theory aspects RA explicitly considers the lattice of possible structures and offers heuristics for searching this lattice. Also RA has a set-theoretic version, can analyze continuous outputs, and has a Fourier-based variation. In summary, while RA and LR (and LL) may be identical where they overlap, RA has distinctive features, both theoretical and computational, which make it useful for the study of epistasis.

RA and other nominal data methods are inherently more appropriate to studying genomic data than other approaches such as neural nets (Ritchie et al., 2003) or support vector machines (Chen et al., 2008) that presuppose metric information. The predictive relation in an RA (or LL, LR, or BN) model is precisely the conditional probability of the discrete output, given the discrete inputs. Because conditional probability of the diseased output state is penetrance, information theory is a natural and transparent way to represent relations between genotype and phenotype. Also, the entropy of the nominal output variable is analogous to variance for continuous variables, and %entropy reduced is analogous to %variance explained, so the core concepts of RA are intuitively grasped. By contrast, a neural network fits data via hard-to-interpret weights, and usually does not include statistical assessment. Also, neural networks are designed for deterministic input–output relations, and often do not perform well when relations are stochastic, which is typically the case for genotype–phenotype relations.

An earlier study utilizing both simulated and real data (Shervais et al., 2010) showed that RA can be used as a tool in genomics research. In that study, RA performed better than two other multivariate methods (multifactor dimension reduction and neural nets) in detecting epistasis in simulated data and was also applied successfully to detecting epistasis in type 2 noninsulin-dependent diabetes data. This paper follows up that previous study (i) by putting the methods used there in a more encompassing framework, (ii) by showing that RA has additional capacities not used in that study, and (iii) by introducing innovations in RA methodology that enhance its potential value for genomic research. The models that RA offers at different levels of refinement, as shown in Figure 4, and the variations in RA methodology discussed earlier make RA a very flexible methodology, suitable for studies of genome-wide association, gene-expression, disease risk factors, and other biomedical applications. For example, one could, in a GWAS shift from fast searches of coarse variable-based loopless models to slow searches of fine variable-based models with loops to slower searches of ultra-fine state-based models while progressively reducing the number of SNPs under consideration. Results of a GWAS of two interacting loci using LR (Marchini et al., 2005) suggest that variable-based loopless RA models would also have the power to analyze an initially very large number of SNPs because ABZ models in RA are equivalent to fully saturated LR models. By first reducing the number of SNPs with loopless model analysis, it would then be possible to examine epistatic interactions involving many more than two SNPs, using variable-based models with loops and state-based models. Having this spectrum of models and having multiple methodological variants is a distinctive asset of RA.

Acknowledgements

I thank Patricia Kramer and Steve Shervais for the enjoyable and productive research collaboration that suggested this study, Rajesh Venkatachalapathy for his investigations of non-RA graphical models, and Joe Fusion for his OCCAM software development work. I also thank the reviewers of this paper and especially the editor of this volume, Heather Cordell, for valuable comments on earlier drafts.

Ancillary