Supported by: Swedish Cancer Society, the Cancer Research Funds of Radiumhemmet in Stockholm, the King Gustav V Jubilee Clinic Cancer Foundation in Gothenburg, the Swedish State under the ALF agreement in Gothenburg and in Stockholm, the West Health Care Region in Sweden.
Susceptibility to diseases, including cancer in humans, may depend on common genetic variations (Stranger et al., 2007). Chromosomal aberrations such as deletions, amplifications, or other forms of structural rearrangements have a major impact on tumor development (Vogelstein and Kinzler, 2004). Generally, chromosome fragments exhibit a high degree of intra- and intertumor heterogeneity. A chromosome fragment may show no genuine genomic change in some tumors, while in others the same fragment may be significantly changed due to deletion or varying degrees of amplification.
Deletions and amplifications, commonly known as DNA copy number aberrations (CNA), frequently contribute to alteration in gene expression and consequently altered messenger RNA (mRNA) levels (Pinkel and Albertson, 2005). Chromosome aberrations often reflect the occurrence of DNA aberrations in specific genes. However, all DNA aberrations do not result in altered gene expression profiles and ultimately gene products.
For a better understanding of the effect of CNA on the studied phenotype, we need to integrate the two biological levels (DNA and RNA) and seek to elucidate the relationship between them. There is no general consensus in the literature as to which statistical method answers and best depicts the nature of the relationship between gene and gene product. Researchers seek understanding of CNA and mRNA levels concomitantly. In general, studies determine CNA and aberrations in mRNA levels separately and then match the identified aberrations to determine if alterations in mRNA levels may be attributed to CNA (Heidenblad et al., 2005; van Wieringen et al., 2006; Bicciato et al., 2009; Oudejans et al., 2009). Considering CNA and mRNA in a bivariate approach represents a leap forward from the two-step procedure. Currently, correlation dominates applied integrative genomic analyses (Bussey et al., 2006; Gu et al., 2008; Lee et al., 2008; Tayrac et al., 2009; Horlings et al., 2010; Salari et al., 2010; Green et al., 2011). Correlation gives insight into the strength of the relationship between CNA and relative mRNA levels. Similarly, regression analysis finds its place in integrative genomic analysis (Pollack et al., 2002; Stranger et al., 2007; Menezes et al., 2009; Peng et al., 2010; Asimit et al., 2011). Some authors choose to combine the two-step analysis with an assessment of relationship strength (Pollack et al., 2002; Heidenblad et al., 2005; van Wieringen et al., 2006; Tsukamoto et al., 2008; Bicciato et al., 2009). The two-step analysis identifies abnormal chromosome regions and gene expression patterns and then subsequently assesses the strength of association between CNA and mRNA levels by means of correlation or regression analysis, among others. Schäfer et al. (2009) took this idea a step further and derived a modified correlation coefficient to measure equally directed derivations of CNA and mRNA from the median values in the reference samples. Multivariate methods emerged as tools of integrative genomic analysis as well (Soneson et al., 2010). Jornsten et al. (2011) proposed a holistic approach that combines genome-wide DNA and RNA data in causal networks and provides prognostic scores.
The method set forth here continues the tradition of bivariate analyses in a regression analysis framework. Our framework approaches integrative genomic analysis from a slightly different angle as we aim primarily to describe patterns of the relationship between abnormally expressed genes due to aberrant DNA copy numbers. Here, we are interested to know if the variation of gene expression pattern changes over the domain of CNA. Statistically, this change in gene expression pattern can be depicted as a change in regression slopes. The framework set forward here assumes the existence of one (or more) identifiable point(s) where the relationship between CNA and mRNA levels (e.g., slope of regression line) changes. We propose segmented regression as an exploratory tool to achieve the above-stated goal. We are unaware of any previous work using segmented regression as a tool in integrative genomic analysis. Segmented regression has previously been applied to link CGG repeats to mRNA levels in trinucleotide repeat disorders patients (Garcia-Alegria et al., 2007; Minguez et al., 2009) and gene compactness driven nonmonotonic expression levels in eukaryote cells (Carmel and Koonin, 2009).
We illustrate the implementation and applicability of the proposed model in integrative genomic analysis of primary tumors from 97 breast cancer patients. First, we identify chromosome regions where relative mRNA levels follow changes in DNA copy number and assess if the relationship can be depicted satisfactorily with a single straight line. If not, we approximate the nonlinear relationship with two lines that meet at the change-point, estimate the location of the change-point, and then validate the results. We proceed by demonstrating the finer details of the method on two randomly chosen genes.
MATERIALS AND METHODS
Patients and Data
Primary invasive tumors from 97 diploid breast carcinoma patients were selected from the fresh-frozen tissue bank at the Sahlgrenska University Hospital Oncology Lab (Gothenburg, Sweden). Parris et al. (2010) describes the patient data and study design.
Array-CGH and Gene Expression Profiling
The integrative genomic profiling experiments were conducted as previously described by Parris et al. (2010). In brief, CNA were assessed using whole-genome tiling array-based comparative genomic hybridization (array-CGH) using 38,043 reporters. Data preprocessing and pin-based Lowess normalization were performed using the BioArray Software Environment system (BASE) and analysis to segment the data into regions of genomic gain and loss using the Rank Segmentation algorithm with Nexus Copy Number Professional 4.1 software (BioDiscovery). Gene expression profiling was performed using Illumina HumanHT-12 Whole-Genome Expression Beadchips containing ∼ 49,000 reporters. Data preprocessing and quantile normalization were performed using BASE and further analysis with Nexus Expression 2.0 (BioDiscovery) using log2-transformed, normalized values, and a variance filter. Relative gene expression values were calculated using normalized values from five normal breast samples (Lim et al., 2009). Probes with 100% nucleotide sequence similarity were paired from both platforms, which spanned recurrent genomic aberrations (smoothed log2 ratio ± 0.2).
After appropriate preprocessing to counteract the effect of missing array-CGH data (Troyanskaya et al., 2001), we proceeded with the regression analysis. For a given chromosome fragment, we denote with xi,j the log2 ratio normalized CNA measurement at probe j for individual i, and we let yi,j denote the corresponding log2 ratio normalized mRNA measurement. We assume that the pairs (xi,j, yi,j) are ordered, so that . For each probe j, we build a linear model for the relationship between CNA and relative mRNA levels
where for any given j, εi,j are i.i.d. normal errors with mean zero. We assume that for some probes, the linear model is not adequate, and we approximate the unknown nonlinear function by a sequence of joined linear submodels
where for and otherwise 0, τk's are unknown change-points and .
The model coefficients, , are estimated by minimizing the residual sum of squares,
We used the grid search over and at each step of the grid estimated θ by the usual linear regression methods (Lerman, 1980; Chen et al., 2011a). The online additional Supporting Information gives further details about parameter estimation. We estimated variances for the regression coefficients based on results by Hinkley and Feder (Hinkley, 1969, 1971; Feder, 1975). For inference about β(H0: β = 0; H1: β ≠ 0), we used the Wald-statistic, .
Addition of extra parameters will inevitably improve the model fit; however, we have to avoid overfitting. We have to choose the best working model and to find a trade-off between complexity and interpretability. Intuitively, one could expect two change-points; one at the boundary between loss and no CNA and one at the boundary between no CNA and gains. As our data set contains mainly gains and the sample size is relatively small, we fitted only two-segment regression. Consequently, we defined three competing models; the null model with only intercept yi,j = αj, simple linear model , and a two-segment regression with one change-point , where βi,R is the regression slope for the segment on the right side of the change-point τj1, and βj,L is the regression slope for the segment on the left side of the change-point. The indicator functions and take the value 1 if the condition is met, otherwise 0. Statistical literature on inference in segmented regression focuses mostly on the number of change-points and implicitly assumes a relationship between the predictor and outcome. However, integrative genomic data violates this assumption, and one main issue is weeding out significant mRNA–CNA relationships. With the assumption that a nonlinear relationship has a weaker but still significant linear component, we used the following sequential procedure. First, we tested the simple linear model against the intercept only model with the F-test based on the residual sum of squares. After excluding nonsignificant relationships, we assessed if the addition of a change-point improves the model fit. We based our assessment on F-test (Worsley, 1983), defined as
where n is sample size and RSSLM and RSSSR are the residual sum of squares for the one line and segmented regression models, respectively. However, if the change-point(s) has to be estimated from the data, the test statistics no longer follow an F-distribution under the null hypothesis, and the P-values have to be based on the bootstrapped (Hinkley, 1988) or simulated distribution of the test statistics (Julious, 2001; Lund and Reeves, 2002), or on permutations (Kim et al., 2000). Because of computational constraints, we chose to simulate the distribution for the F-statistics. Under the null hypothesis of no change-point, the regression parameters for the two sides must agree, namely and should be close to zero, and and , respectively. The null model was a linear model with parameters αLM = −0.578, βLM = 0.775, and = 0.158, corresponding to the mean parameter values from the significant linear relationship, and , with μ = 0.248 and σ2 = 0.047 being the average over the means and variances of CNA values in the significant linear relationships. We fitted a simple linear model for RSSLM and a two-segment model for RSSSR to the simulated data and extracted the F-statistics. The simulation was repeated 5,000 times. As a validation procedure, we used leave one out cross-validation (Burman, 1989). P-values were adjusted to control the false-discovery rate with the Benjamini and Hochberg method (1995), and P-values smaller than 0.05 were considered as statistically significant.
Of the 1,161 chromosome fragments examined after multiple adjustments, 341 showed significant associations between CNA and relative mRNA levels. For 269 of the 341 significant relationships (78%), the addition of a change-point and subsequent segmented regression provided no genuine improvement over linear regression, while for the remaining 72 chromosome segments, the two-segment regression had a significantly better fit. The mean-squared error assessed with cross-validation was significantly smaller for the two-segment regression than for a simple linear model for the 72 chromosome fragments (MSESR = 0.177, MSELM = 0.184, paired t-test = 3.72, df = 71, P-value = 0.0003). At 9 of 72 chromosome fragments, cross-validation showed lower MSE values for the linear model than for the segmented regression. Cross-validation also proved that slope estimates for the segmented regression were robust; the average deviation from the estimated parameter was 0.0009 for the left side and 0.0008 for the right side. The mean change-point over the 72 significant segmented regression equations was log2ratio 0.332 (±1 SD: 0.19) with a median of 0.336. For 59 of 72 chromosome fragments (82%), we observed an initial increase in mRNA levels due to changes in CNA. After the change-point was passed, the mRNA levels reached a plateau, and a further increase in DNA copy numbers did not induce further elevation in mRNA levels. For 13 chromosome fragments, the change-point marked the point where mRNA production accelerated, and accumulation was faster than DNA levels suggested.
For RNASEL, gene the relationship between CNA and relative mRNA levels was best described by a two segment regression line (P-value = 0.03, MSESR = 0.108, MSELM = 0.111). DNA copy numbers under log2ratio 0.31 (95% CI: 0.08–0.51) on the normalized CNA scale did not explain satisfactorily the observed relative mRNA levels (βL = 0.048, P = 0.8). Over this change-point, the observed relative mRNA levels increased with accumulated DNA copy numbers, a one-unit increase in CNA coincided approximately with a one-unit increase in mRNA (βR = 1.169, P-value = 0.0004; Fig. 1). For RNASEL, gene relative mRNA levels was determined by other factors than CNA under the threshold of 0.31 on the normalized CNA scale, while over 0.31 mRNA levels was related to CNA.
Similarly for the HDGF gene, the relationship between CNA and relative mRNA levels were best described by a two segment regression line (P-value = 0.03, MSESR = 0.074, MSELM = 0.076). Contrary to the RNASEL gene, relative mRNA levels initially increased as a function of CNA; one-unit increase in CNA coincides with almost two-unit increase in mRNA (βL = 1.760, P = 0.01). Further increase in DNA copy numbers over the estimated change-point (0.235 95% CI: 0.162–0.404) was not followed by subsequent increases in mRNA levels (βR = 0.432, P-value = 0.07). Instead, the observed relative mRNA levels remained approximately at the observed level at the change-point. Concomitant inspection of these two genes revealed additional insights. Not only did the two genes exhibit different gene expression patterns, but they genuinely differed in the amounts of mRNA in spite of similar genomic profiles (Fig. 1).
In this note, we aimed to present a framework based on regression analysis that is flexible enough to capture different patterns in CNA and mRNA relationships but concomitantly gives easy-to-use and biologically meaningful effect sizes. The accelerated increase in genetic data and rising number of studies and methods identifying the implications DNA CNA have on abnormal gene expression patterns raise the need for an exploratory tool easy to use and interpret.
Mileyko et al. (2008) concluded that some CNA and mRNA relationships might not be linear. The segmented regression approach advocated in this note not only handles nonlinear relationships, but it also gives insight into the relationship of DNA copy number and gene expression and helps formulate a hypothesis that could lead to important discoveries. Alternative methods, such as nonparametric regression or rank correlation, can easily circumvent the problem of nonlinearity. However, both approaches have drawbacks when applied to integrative genomic data. Nonparametric regression describes the pattern of changes in mRNA levels due to CNA, but fails to identify possible change-points. Moreover, assessment of relationship strength and statistical inference is not straightforward. Rank correlation, on the other hand, offers straightforward statistical inference and a measure of association but fails at depicting the pattern of changes in mRNA levels due to CNA.
The segmented regression approach provides an easy exploratory tool for integrative analysis of DNA copy number and relative mRNA levels. Like correlation analysis, it helps identify the association between copy number and gene expression but it takes a further step and allows detection of changing patterns. The identified change-points and the pattern of relationship between DNA copy number and gene expression can generate further hypotheses. Researchers could investigate what causes swift changes in mRNA production, what causes overproduction of mRNA, or the leveling out of the mRNA levels in spite of accelerated DNA accumulation in tumors. Tumors characteristically consist of a mixture of cells, cells with aberrations at particular chromosome fragments, cells with high degrees of genomic instability, and normal cells without CNA. In some cases, the change-point could simply divide the sample into subsets of tumors with high levels of normal cells without CNA and tumors with substantial CNA. As Chen et al. (2011b) demonstrated, segmented regression can simultaneously classify observations and provide statistical inference. More importantly, change-points could represent the degree of genomic instability that causes disruptions in negative-feedback mechanisms. Defects in feedback mechanisms might enhance proliferate signaling, thus inducing an ever increasing gene expression (Hanahan and Weinberg, 2011).
The motivation for identifying potential driver genes will not abate, and the methodological aspects are bound to get more and more refined. We believe that the segmented regression approach could function as a useful tool in integrative analysis. We do not propose segmented regression as an alternative to already proven methodologies, but as an aid, perhaps an enhancement. Indeed, weeding out genes whose expression is seemingly not associated with CNA can be done with a number of methods. Conversely, our approach, likewise similar tools of integrative genomic analyses, will fail to identify key tumor suppressor genes or oncogenes that are partially or completely regulated by mechanisms other than CNA. Regression analysis in general and segmented regression in particular can consider subject-specific covariates; however, the practical implementation still needs to be clarified.
In conclusion, we believe that the segmented regression approach could work as a useful component in integrative analysis of genetic material and provide important insights into the biology of tumors.
We thank the anonymous reviewers for constructive criticism.