Estimating the proportions in a mixed sample using transcriptomics

Often, biomedical researchers take a tissue sample from a tumour that unintentionally contains tumour cells and stromal cells so that gene expression from the sample represents two distinct kinds of cells. This is tolerated because techniques for separating tumour and stromal cells are expensive and time‐consuming and the act of separation can alter gene expression. So, it is desirable to have a technique for estimating the proportion of tumour cells in a mixed sample to improve detection of differential expression in cancer cells. © 2014 The Authors. Stat Published by John Wiley & Sons Ltd.


Introduction
Metagenomics is defined as the study of multiple genomes either from different species or from the same species but different states. It has become an important focus for researchers in several biomedical fields. The reason is that outside unusually well-controlled experiments, genetic material or measurements usually represent two or more distinct biological sources, for example, cells, tissue types, or organisms. Accordingly, methods for analysing metagenomic samples that take into account multiple sources must be developed.
Specifically, physical samples containing two or more kinds of tissue are frequently obtained from a subject even though a single tissue type is sought. When this happens, and the sample is used to generate gene expression data, it is unclear how much of the expression is due to the tissue of interest and how much is due to the contaminating tissue. This is common in cancer research: the part of the tissue sample of interest is often the tumour; however, the tissue also contains stromal cells. So, when a whole genome expression array is carried out for such a sample, we must correct for the presence of stromal cells to permit determination of whether the expression of a gene is significantly higher or lower in the cancer cells as compared with stromal cells or with a separate normal reference. This problem is often referred to as computational deconvolution or purification (Quon et al., 2013). We assume that expression values from a second sample of pure stromal cells are available, and we want to estimate the proportion p of stromal cells in the mixed sample. It will be seen that one way to do this is to transform the raw data, thereby introducing a parameter˛and using the transformed data to estimate p. This will necessitate estimating˛as well. Our procedure for estimating˛and p was presented by Clarke et al. (2010) where details on the implementation and evidence that it works well in practice in numerous and diverse contexts can be found. The estimators are based on ratios (Gosink et al., 2007). The estimator for the transformation parameter comes from graphing the mean of a ratio of gene expressions as a function of a multiplicative parameter in a logarithmic transform and looking for the minimum radius of curvature. The estimator for the proportion is based on minimizing a similar ratio.
Here, we present a theoretical justification for our approach to computational deconvolution; to the best of our knowledge, no other technique for estimating p has a formal consistency proof. Other approaches to deconvolution are discussed by Quon et al. (2013) and Ahn et al. (2013); hence, we mention them briefly here. One is due to Lu et al. (2003). The idea is that genes known to be expressed exclusively in one cell type can be used to estimate the proportion of expression coming from that type. This method depends on known tissue-specific or cell-specific genes and technology that can detect their expression with little or no cross-hybridization. A second approach requires expression profiles from each cell type. For instance, Wang et al. (2006) use a method similar to that of Lu et al. (2003) to generate estimates by obtaining solutions to linear equations via simulated annealing. Alternatively, Quon & Morris (2009) use a latent Dirichlet allocation model with [estimation via the expectation-maximization algorithm] in place of the linear equations and simulated annealing. These methods are not applicable in our context because we do not assume pure expression profiles.
There are other methods for expression deconvolution that do not require estimation of p (or˛). Several researchers have used expression data from purified reference tissue types to determine the expression of each tissue type in heterologous samples (Venet et al., 2004;Lahdesmaki et al., 2005). Another method uses proportions of each sample or cell type, assessed by pathologists, to establish tissue-specific expression. Stuart et al. (2004) use linear regression models, regressing expression on fractional content of each cell type, to estimate the expected cell type expression as the regression coefficients. Shen-Orr et al. (2010) extended this method to detect differential expression. Ghosh (2004) developed a method to determine differential expression in the presence of mixed cell populations in which a pathologist's assessments of the proportions of each cell type were used in a hierarchical mixture model. More recently, Quon et al. (2013) used sets of tumour and healthy tissue expression profiles to generate a purified cancer profile for each tumour sample, and an estimate of the proportion of RNA originating from cancerous cells. Finally, Ahn et al. (2013) developed an approach that estimates both tissue proportions and tissue-specific expressions using raw measured data from groups of samples. However, none of these approaches have a formal theory.
The structure of this paper is as follows. In Section 2, we formulate a model in which both p and˛are well defined as population quantities and provide estimators for them. In Section 3, we establish consistency of these estimators. In Section 4, we discuss several issues related to the proof and the problem.
2 Defining the setting The gene expression from two samples -one mixed tumour and stroma and the other pure stroma -can be regarded as a collection of bivariate two-stage experiments. The first stage is the selection of a gene from a population of genes, and the second stage is the measurement of the expression of the gene in the stromal sample and the mixed sample. Thus, we have a distribution of the form in which U j is the gene expression from the pure stromal sample for gene j D 1, : : : , m, while V j is the gene expression from the mixed sample. Note that J is regarded as a random variable, so if the subjects are independent and identical, (1) holds for each of them. Later, m will be allowed to increase slowly as a mathematical convenience for proofs. In fact, m is usually large, so limits over m are realistic.
Let us suppose that the gene expression from the mixed sample can be expressed in terms of the gene expression from samples of pure stroma and pure tumour. So, for each j, where p 2[ 0, 1] is the proportion of stromal tissue expression in the sample and W j is what the expression for the j-th gene in a pure tumour sample would be. It is understood that V j , U j , and W j are all taken per unit tissue. In (2), we know V j and U j , and we want to find p so we can determine W j . (An extension to three or more components is possible; see Remark 3 in Section 4.) Heuristically, we can look at V j =U j D p C .1 p/.W j =U j / and minimize over j in the hopes that min j .W j =U j / will effectively give zero so that min j .V j =U j / will be a good estimate of p. However, it seems hard to estimate p directly from a function of the ratios of U j 's and V j 's. The problem seems to be that the variability in U j and V j is so high that ratios are unstable (Gosink et al., 2007). Here, we use the same heuristic argument but apply it to stabilized forms of U j and V j .
Specifically, we stabilize U j and V j by taking logarithms. Write where˛> 0 is a scaling parameter; as˛increases, the degree of stabilization on the log scale decreases.
assumed to have finite moments j .˛/ D ER j .˛/ for all j with K 1 Ä j Ä K 2 , for finite K 1 and K 2 , over a large range of˛'s (to be specified shortly).
Clearly, R j is well defined as long as U j ¤ 0, that is, U j ¤ 0, meaning that no genes from the stromal sample are unexpressed. If this fails, one may ignore such genes (provided they are not too numerous) or use the symmetry of the problem to estimate .1 p/. Similarly, if there are cases where E.R j / D 1, some of the mixed tissue genes may have to be ignored to satisfy the moment conditions. In practice, these limitations have not been a problem because the difference in expression between stromal and mixed tissue samples is one of degree, not presence or absence.
For now, consider the value of˛ where the derivative on the right is with respect to˛and the bar on R indicates the mean over the m genes. This maximization is equivalent to minimizing the radius of curvature of the curve .˛, R.˛//.
Given this, the true value of p can be taken as The reason is that V j =U J D p C .1 p/W j . So, the smallest ratio of V j =U j occurs when W j =U j is smallest; the smallest value of W j =U j in real data sets is typically zero. Transforming to V j =U j does not change the reasoning as can be seen by expressing U j and V j in terms if U j and V j from (3) and (4). Therefore, assuming p satisfies (6) is reasonable. Now, given a good estimate of˛, say Ǫ, we can estimate p by So, an independent and identical (IID) sample of size n, .U ij , V ij / for i D 1, : : : , n, gives values .U ij , V ij / for i D 1, : : : , n for a given˛and hence ratios can be used as estimators of˛ and p .

Consistency
Consistency of Ǫ and O p will be seen to follow from a uniform law of large numbers as n ! 1 (Lemma 3.1 and Proposition 3.1) and from a large deviation principle as m ! 1 (Theorem 3.1).
The form in which the uniform law of large numbers occurs is encapsulated in conditions C1 and C2. Let Readily verifiable conditions for C1 can be found in Andrews (1987); see also Pötscher & Prucha (1989). An excellent discussion of the strategy of proof can be found in Bierens (2005a, Chapter 6), and a particularly clear statement of sufficient conditions for C1 can be found in Theorem 3 of Bierens (2005b). Essentially, aside from technical conditions, C1 holds when expected local suprema and infima of any marginal, for example, R 1 , are bounded away from 1 and 1, respectively.
for some > 0 small enough, where R 1 .˛/ merely means R i .˛/ for i D 1 because the R i 's are IID. Apart from , expression (9)  Informally, this means that the inverse of the mean converges to the inverse of its expectation, uniformly inˇeven though .R/ 1 . / is a random function.
Because sufficient conditions for C1 are available, we give further conditions under which C1 implies C2. Taken together, the resulting conditions will enable us to state easily the other hypotheses required for consistency of O p and Ǫ. Because R.˛/ is a random function of the form R.!,˛/ where ! 2 , the underlying measure space, we have the following.
We consider only (11) as the proof for (12) First, consider (13) (14) and (15) are similar, and the statement for slowly increasing m is obvious: merely choose a large enough n for a given and m.
We remark that the hypotheses of Lemma 3.1 are little more than what is minimally required for the result. However, the hypotheses are difficult to verify, specifically that the R's are uniformly continuous on compact sets uniformly over n. On the other hand, the R's are means (over i) of second derivatives of functions which are themselves means over genes, that is, over j. Because means of finitely many uniformly continuous functions are uniformly continuous, R.˛/ is uniformly continuous for each n and hence for any finite number of n's. To ensure that the uniform continuity of R.˛/ is uniform in n, it is enough to assume R ! ER as n ! 1 uniformly in˛and the limit is uniformly continuous. However, this, too, is hard to verify. In practice, one looks at the curve .˛, R.˛/ for˛2 [ Á, 1=Á] in the unit speed parametrization and verifies it looks smooth with a unique minimum radius of curvature (Clarke et al., 2010).
Given Lemma 3.1, we can establish the consistency of Ǫ for˛ . Given the consistency of the selection of the transformation in Proposition 3.1, we can now establish the consistency of O p for p . Let > 0 and for j D 1, : : : , m consider the local supremum where B.˛ , / is the interval with half-length centred at˛ . Also, let M j be the moment-generating function of R 1j .˛ / where R 1j merely means R ij for i D 1 as the R ij 's are IID over i. We have the following.

Remark 3.2
Assuming independence across the genes, that is, for different j's, is physically unrealistic. However, the patterns of dependence are poorly understood. Consequently, it is common to pre-process gene expression data to minimize dependence across genes (Qiu et al., 2005;Klebanov & Yakovlev, 2007). Also, it is (unfortunately) typical to assume weak or minimal dependence to permit statistical analyses to proceed; see Heller & Qin (2007) for but one example. Conceptually, this also holds if the unit of interest is changed from genes to pathways.
It is also well known that both the independence and identicality assumptions can be relaxed while still maintaining a large deviation principle. This is seen in Dembo & Zeitoni (1998, Chapter 2.3), and in the large deviation principles that can be proved under various mixing conditions (e.g. Mohri & Rostamizadeh, 2010). However, the arguments are very specialized.
We begin with (22). For all j and˛2 B.˛ , /, (21) gives By standard reasoning, this leads to Because each .j, / ! 0 in probability, the maximum over j D 1, : : : , m also goes to zero, for each . Therefore, For (25), we give a version of the argument used to prove the upper bound of Theorem 1.2.6 in Deuschel & Stroock (1989); see also Stroock (1984). We see that Consider a fixed value of m. We show that for each m, there is an N so that for n > N, P.Y j Ä EY j C =2/ ! 1 uniformly over j D 1, : : : , m. To see this, observe that for each j, Y j is an average of n independent and identically distributed random variables; write j D EY j . Then for any positive , because M j is the moment-generating function of any R ij . Therefore, any factor in (26) satisfies P.Y j Ä EY j C =2/ 1 min e n. j C =2/ .M j . // n 1 e n , uniformly in j for some > 0 using the hypotheses on the moment-generating functions. Hence, Q j P.Y j Ä EY j C =2/ ! 1 and (25)! 0.
Consider a sequence j k such that EY j k ! max j EY j . Then for any k, There also exists K so that k, k 0 > K implies P Y j k < EY j k 0 =2 Ä P Y j k < EY j k =4 . So there 9 > 0 for which for k > K, by the lower-bound portion of Theorem 1.2.6 in Deuschel & Stroock (1989). (For brevity, we do not give this argument here, but see Stroock, 1984, for an accessible version.) The bound in (28)

Concluding remarks
The contribution of this paper is to identify a set of models and prove the consistency of the estimators for parameters in those models for the deconvolution of gene expression from a mixed sample. Although expressed in the context of cancer and metagenomics, our techniques should apply to any deconvolution setting using array data. To conclude, we discuss four aspects of the overall strategy.
(1) Relation of the formal hypotheses to the physical problem Note that in our proof, given m, , and , one lets n increase first and then chooses another triple .m, , /, again letting n increase. While this is reasonable for formal results, actual data are the reverse: it is m that is large and n that is small. Despite the discrepancy between mathematical hypotheses and the actual data, statistical techniques give good performance (Speed, 2010).
(2) Correcting estimates of gene expression: Suppose the U ij 's represent the pure stromal gene expression and the V ij 's represent the mixed sample gene expression. Then, we can use our estimates (7) and (8) to estimate˛