Landscape of somatic allelic imbalances and copy number alterations in human lung carcinoma



Lung cancer is the worldwide leading cause of death from cancer and has been shown to be a heterogeneous disease at the genomic level. To delineate the genomic landscape of copy number alterations, amplifications, loss-of-heterozygosity (LOH), tumor ploidy and copy-neutral allelic imbalance in lung cancer, microarray-based genomic profiles from 2,141 tumors and cell lines including adenocarcinomas (AC, n = 1,206), squamous cell carcinomas (SqCC, n = 467), large cell carcinomas (n = 37) and small cell lung carcinomas (SCLC, n = 88) were assembled from different repositories. Copy number alteration differences between lung cancer histologies were confirmed in 285 unrelated tumors analyzed by BAC array comparative genomic hybridization. Tumor ploidy patterns were validated by DNA flow cytometry analysis of 129 unrelated cases. Eighty-nine recurrent copy number alterations (55 gains, 34 losses) were identified harboring genes with gene expression putatively driven by gene dosage through integration with gene expression data for 496 cases. Thirteen and 26 of identified regions discriminated AC/SqCC and AC/SqCC/SCLC, respectively, while 48 regions harbored recurrent (n > 15) high-level amplifications comprising established and putative oncogenes, differing in frequency and coamplification patterns between histologies. Lung cancer histologies displayed differences in patterns/frequency of copy number alterations, genomic architecture, LOH, copy-neutral allelic imbalance and tumor ploidy, with AC generally displaying less copy number alterations and allelic imbalance. Moreover, a strong association was demonstrated between different types of copy number alterations and allelic imbalances with tumor aneuploidy. In summary, these analyses provide a comprehensive overview of the landscape of genomic alterations in lung cancer, highlighting differences but also similarities between subgroups of the disease.

Lung cancer constitutes a heterogeneous group of lesions with differences in clinical presentation, pathological features and biological behavior. Because of high incidence and poor survival, lung cancer is the worldwide leading cause of death from cancer with cigarette smoking as the major pathogenic factor.1 Lung cancer may be broadly divided into small cell lung cancer (SCLC), accounting for about 15% of all diagnoses, and non-small cell lung cancer (NSCLC), constituting the majority of cases and primarily including adenocarcinoma (AC), squamous cell carcinoma (SqCC) and large cell carcinoma (LCC). Microarray-based studies have demonstrated that lung cancer constitutes a biologically heterogeneous disease regarding both gene expression patterns and somatic copy number alterations (SCNAs).2–7 SCNAs have been the main focus of most lung cancer studies using genomic microarrays, such as array comparative genomic hybridization (aCGH)8, 9 or SNP microarrays,2, 3, 5, 7, 10 reported to date. Although SNP microarrays allow simultaneous detection of both SCNAs and allelic imbalances, including loss-of-heterozygosity (LOH), a fully driven integration of these measurements in a large collection of lung cancer cases has not yet been conducted.

With the aim to characterize the landscape of copy number alterations and allelic imbalances in lung cancer, we assembled data from different repositories comprising 2,141 tumors and cell lines representing the four major tumor histologies. Through bioinformatical methods for SNP and aCGH microarrays combined with quantitative DNA flow cytometry, we delineate the pattern of copy number alterations, allelic imbalances and tumor ploidy across various lung cancer subgroups. Taken together, these analyses provide a comprehensive overview of the complex landscape of genomic alterations in lung cancer, identifying subgroup specific patterns of genomic alterations as well as patterns shared between major subgroups of the disease.


AC: adenocarcinoma; aCGH: array-based CGH; CAAI: complex arm aberration indexes; CN-FGA: fraction of the genome affected by copy number alterations; CNN-AI: copy number neutral allelic imbalance; CNN-LOH: copy number neutral LOH; FGA: fraction of the genome altered; GAP: genome alteration print; GISTIC: genomic identification of significant targets in cancer; LCC: large cell carcinoma; LOH: loss of heterozygosity; NSCLC: nonsmall cell lung cancer; SCLC: small cell lung cancer; SCNA: somatic copy number alteration; SqCC: squamous cell carcinoma

Material and Methods

Tumor material

Genomic profiles of >3,400 lung tumors and cell lines analyzed by Agilent aCGH, ROMA aCGH, Illumina SNP beadchips or Affymetrix SNP arrays were combined from public repositories for 20 studies, from which 2,141 cases with sufficient data quality were extracted (Additional file 1). Sample uniqueness was assured using name/CEL file matching and correlation analysis of genomic profiles, and all cases were analyzed in an unmatched fashion without the usage of a matched normal control (Additional file 1). Of 1,970 tumors, only 10 were annotated as metastases (all SCLC, GSE21468). Patient and tumor characteristics are summarized in Table 1 on a data set level based on information from public records. Specifically, never-smoking status was inferred based on either a present never-smoking annotation or by an annotation of zero pack-years when available. Associations between smoking and copy number alterations were analyzed primarily in AC, as 95% of never-smokers were of AC histology. Gene expression profiles were available for 496 cases from public repositories21, 24 or author's websites,22 including profiles for 440 tumors from the Chitale, GSE28572, Weir (GSE12667), TCGA-SqCC and GSE19804 data sets and profiles for 56 lung cancer cell lines obtained from GSE4824.25 Gene expression data were processed as described in Additional file 1.

Table 1. Characteristics of individual data sets stratified by microarray platform
inline image

SNP and aCGH preprocessing

Affymetrix SNP array data for 1,673 lung carcinomas were obtained from public repositories14, 24, 26 or author's web sites. CEL files were normalized using CRMAv227 and ACNE28 for generation of copy number and B-allele frequency estimates as described,29 with exception of the TCGA and parts of the Weir et al.2 data set (see Additional file 1). Copy number and B-allele frequency estimates for 40 cases analyzed by Illumina SNP beadchips were obtained from E-TABM-1169.9, 26 Normalized Agilent 44K, 105K, 180K and 244K copy number data were obtained for 262 cases from public repositories24, 26 or Chitale et al.22 Normalized ROMA 85K data for 166 cases were obtained from Gene Expression Omnibus as series GSE31586.24 Genomic profiles from all array platforms except the TCGA data were partitioned using GLAD30 and centralized similarly as described.31 Probe annotations for all array platforms were updated to the hg18 genome build.32 Partitioned genomic profiles from different array platforms were merged to a common 3,000-bp probe set similarly as described.33 Data processing steps are further described in Additional file 1.

Identification of allelic imbalance, focal copy number alterations and measurements of genomic architecture

A modified version of the BAFsegmentation29, 34 software was used to partition B-allele frequency estimates from SNP arrays, which were subsequently integrated with copy number data as described in Additional file 1. Given the complexity of the 2,141-sample set, including, for instance, different microarray platforms, a modified version of Genomic Identification of Significant Targets in Cancer,35 referred to as mGISTIC hereon, was derived for the identification of focal SCNAs and recurrent high-amplitude copy number gains (referred to as amplifications hereon; see Additional file 1 for explicit definitions). The fraction of the genome altered by copy number alterations (CN–FGA) was defined as described.29 Measurements of genomic architecture in the form of whole arm-level copy number events (gain and loss) and complex arm aberration indexes (CAAI)31 were calculated as described (Additional file 1).

GAP analysis

Integrated B-allele frequency and copy number data for tumors and cell lines analyzed by SNP arrays (n = 1,169) merged to the common 3,000-bp probe set were subjected to Genome Alteration Print (GAP)36 analysis for the estimation of allele-specific copy numbers, fraction of aberrant cells and an in silico tumor ploidy (referred to as GAP-ploidy hereon) as described29 (Additional file 1). Frequency of gains and losses from GAP profiles, relative to the GAP-ploidy, as well frequency of copy number neutral allelic imbalance (CNN–AI) was calculated as described.37 LOH, copy number neutral LOH (CNN–LOH) and estimates of the fraction of the genome altered by LOH (LOH–FGA), CNN–AI (CNN–FGA) and CNN–LOH (CNN–LOH–FGA) were defined as described.29

BAC aCGH validation lung cancer data set

A BAC aCGH reference data set comprising 285 primary lung tumors was created by combination of normalized data from GSE3179838 with genomic profiles from 62 patients with AC or SqCC tumors analyzed by 32K BAC aCGH microarrays39 at the SCIBLU Genomics Resource Center, Lund University, Sweden, available through Gene Expression Omnibus as series GSE29065.24 The 62 tumors were collected between 1989 and 2007 at the Skåne University Hospital, Sweden, under the approval of the Regional Ethical Review Board in Lund, Sweden (Registration no. 2004/762 and 2008/702). Written informed consent was obtained from all patients diagnosed after 2004, whereas for patients diagnosed earlier than 2004, study inclusion was approved by the Regional Ethical Review Board if patients (or their family members/survivors) not stated otherwise when informed about the study in 2006. Labeling, hybridization, scanning and image analysis were performed as described,39 and data were normalized as described.40 The combined 285-sample set was partitioned using GLAD and centralized as previously described. SCNAs for both GSE31798 and GSE29065 were identified using sample adaptive thresholds40 (Additional file 1). In total, the 285-sample set comprised 195 AC and 90 SqCC cases.

DNA flow cytometry analysis

DNA histograms obtained from flow cytometry analysis of a cohort of 129 unrelated lung cancer patients, comprising 48 AC, 13 LCC, 63 SqCC and 5 SCLC cases, were evaluated using ModFitLT (Verity Software House, Topsham, ME) as described.29 An experimental tumor ploidy (FCM-ploidy) was determined as the largest aneuploid fraction if ≥1 aneuploid population was present, otherwise set to two (diploid).


Significant copy number alterations in lung cancer

Overall, the 2,141 lung cancers and cell lines displayed complex DNA copy number profiles, with numerous recurrent copy number alterations observed in all chromosomes, also when stratified by NSCLC status and individual tumor histology (Fig. 1). With the aim to identify important copy number alterations in lung cancer, we performed mGISTIC analysis of the entire cohort of 2,141 cases, which identified 89 regions (55 gains and 34 losses) distributed across all autosomes (Fig. 1, Supporting Information Table S1 in Additional file 2). mGISTIC permutation analysis (see Additional file 1) using random subsets of 1,606 cases (75% of 2,141) showed that (i) overall, 62 and 80% of the 89 regions were detected in >90% or >70% of permutations, respectively, and (ii) that when stratified by gain and loss status 71% of gain regions and 94% of loss regions were detected in >70% of permutations (Supporting Information Table S1 in Additional file 1). No region was detected in <46% of permutations. Regions with the lowest permutation detection rates showed the lowest g-scores,35 consistent with the presence/amplification in a small subset of the 2,141 cases. Correlation analysis was performed for 496 cases with matched gene expression data to identify genes in mGISTIC regions with mRNA expression potentially driven by copy number change (see Additional file 1), both in the overall setting, as well as when stratified into AC, SqCC and LCC histology groups (Supporting Information Table S1 in Additional file 2).

Figure 1.

Pattern of copy number alterations in lung cancer. Frequency of copy number gain (above zero base line) and loss (below zero base line) across NSCLC and SCLC using log2ratio ± 0.12 as thresholds for the identification of copy number gain and loss, respectively. Probes matched to copy number variations are excluded. Black regions indicate positions of significant mGISTIC regions identified from analysis of the entire 2141 sample cohort. (a) 1885 NSCLC cases; (b) 1206 AC cases; (c) 467 SqCC cases; (d) 37 LCC cases; (e) 88 SCLC cases.

Differences in the gross amount of copy number alterations (fraction of the genome altered by copy number gain and loss, CN–FGA) were observed for (i) tumors stratified by histology with AC showing lower fractions (p = 4 × 10−40, ANOVA), (ii) AC tumors stratified by smoking history with never-smokers showing slightly lower fractions (p = 0.04, Student's t-test) and (iii) between NSCLC and AC tumors stratified by stage with stage I tumors showing lower fractions (p = 0.002 and p = 0.01, respectively, ANOVA), but not for SqCC tumors stratified by stage (Supporting Information Fig. S1a in Additional file 3). Moreover, CN–FGA estimates in the 285-sample validation set corroborated the overall trend of less copy number alterations in AC compared to SqCC tumors (p = 0.07, Student's t-test). Finally, comparison of the frequency of copy number gain and loss for mGISTIC regions across individual AC and SqCC data sets showed a high degree of consistency between data sets within the same histology, despite differences in, for example, tumor size, stage, sex and patient ethnicity and the fact that the mGISTIC regions were not defined on a histology-specific basis (Supporting Information Figs. S1d and S1e in Additional file 3).

Genomic architecture in lung cancer

The genomic architecture in the histological subgroups of lung cancer was explored using estimates of whole-arm copy number alterations and CAAI-estimates, where the latter aim to highlight complex architectural distortions characterized by physically tight clusters of break points with large amplitude changes.31 Stratification of samples into high and low-level complexity groups based on CAAI-estimates (using the definition by Russnes et al.,31 Additional file 1) revealed significant overall differences between lung cancer histologies (p = 6 × 10−18, Chi-square test, Supporting Information Fig. S2a in Additional file 4). Specifically, AC displayed less high-complexity cases compared to the other histological groups, corroborated by similar results in the 285-sample validation data set (AC vs. SqCC, p = 0.01, Fisher's exact test). Stratification of AC and SqCC cases by stage revealed no significant differences between stages, although a trend was observed toward less high-complexity cases in the stage I groups (data not shown).

The frequency and pattern of whole arm copy number gain and loss events also differed between lung cancer histology groups (Supporting Information Figs. S2b–S2d in Additional file 4). Overall, SCLC cases generally displayed more whole arm-level events compared to the other histology groups, while AC the least (Supporting Information Figs. S2b and S2c in Additional file 4). Across histologies, the relatively most frequent arm-level alterations were gains of chromosomes 5p and 20 (excluding AC) and loss of 3p, 13q (excluding SCLC) and 17p (Supporting Information Fig. S2d in Additional file 4). Several arm-level alterations were relatively common between combinations of histology groups, for example, losses of 4q and 5q (SCLC and SqCC), loss of 10p (SCLC, LCC) and gain of 7p (AC, LCC). Finally, certain arm-level alterations appeared more frequent in specific histology groups, for example, gains of chromosomes 18 and 19 (SCLC), loss of 22q and 10q (SCLC) and gain of 22q (SqCC). Despite that the definition of whole arm-level events is relatively strict (>98% probe alteration frequency and low variability of copy number estimates required, see Additional file 1), the identified alterations account for a notable part of the overall frequency of copy number gain and loss for several chromosomes in SCLC, but also for specific chromosomes in other histological subgroups. The latter is illustrated by, for example, gain of 20q in LCC and loss of 10p in LCC and SqCC (Supporting Information Fig. S2d in Additional file 4 compared to Fig. 1). Taken together, these analyses indicate that the histological subgroups display differences in complexity of genomic alterations and that the architectural type of rearrangements affecting certain chromosomes differs, similar to the findings in breast cancer.31

Copy number alterations stratifying lung cancer histology groups

Exploratory analysis identified 13 mGISTIC regions, involving six gains at chromosome 3q and seven losses distributed across 3p, 4p, 4q and 5q that discriminated AC from SqCC, all with higher alteration frequency in SqCC (Bonferroni adjusted Fisher's exact test p < 0.01 and frequency difference >25%; Supporting Information Fig. S3a, Additional file 5). In addition, 26 mGISTIC regions including 12 gains distributed across chromosomes 1p, 2p, 3q, 18q, 19q and 22q and 14 losses distributed across chromosomes 2q, 3p, 4p, 4q, 5q, 9p, 10q, 13q and 22q were found to discriminate between AC, SqCC and SCLC (Supporting Information Fig. S3b, Additional file 5). Notably, all 13 regions discriminating AC from SqCC were confirmed in the independent aCGH validation set (p < 0.05, Fisher's exact test; Supporting Information Fig. S3c, Additional file 5).

Pattern of recurrent amplification in lung cancer

Genes residing in regions subjected to genomic amplification represent obvious oncogene candidates, and, in lung cancer, certain amplifications have been reported to be highly selective for specific histological subgroups and harbor cancer-causing genes.2, 3, 13, 22 To characterize the pattern of recurrent amplifications in lung cancer, we focused on the 55 mGISTIC regions of gain. Recurrent amplification (n > 15) involving 874 unique samples (41%) was identified in 48 of 55 regions, with frequencies between 0.75 and 7.47% in the total cohort (Supporting Information Table S2 in Additional file 6). Integration of gene expression data with recurrent amplifications for 496 cases identified several genes with putative amplification driven expression when analyzed on both a data set level and grouped by histology (Supporting Information Table S2 in Additional file 6). Overall, recurrent amplifications were more frequent in (i) SqCC and SCLC compared to AC (p = 5 × 10−31, Chi-square test), (ii) in higher stage NSCLC cases (stage II or III) compared to stage I cases (p = 0.0008, Chi-square test), (iii) in higher stage tumors of AC and SqCC histology compared to stage I tumors of these subtypes (p = 0.008 and 0.06, respectively, Fisher's exact test), (iv) in SqCC stage I compared to AC stage I tumors (p = 4 × 10−13, Fisher's exact test) and (v) in NSCLC current or former smokers compared to NSCLC never-smokers (p = 0.03, Chi-square test). However, the higher frequency of recurrent amplification among NSCLC smokers appears dependent on SqCC cases present in this group, because in AC, no significant difference in overall amplification frequency was observed after stratification for smoking history (p > 0.05, Chi-square test).

A number of amplifications differed in frequency for different clinicopathological variables, including tumor histology (n = 30) and patient smoking history in AC cases (n = 4; Fig. 2 and Supporting Information Table S2 in Additional file 6). For instance, stratification by tumor histology revealed that amplification at 3q24–q29, 8p12 (FGFR1/WHSC1L1) and 11q13.2–q13.3 (CCND1) were more frequent in SqCC, whereas amplification at 14q13.3 (including NKX2-1) was the only amplification more prevalent in AC (Fig. 2a). Moreover, amplifications at 1p34.2, 2p24.3 and 8q24.21 were clearly more prevalent in SCLC cases, all harboring MYC family genes (MYCL1, MYCN and MYC, respectively; Fig. 2a). Stratification of the AC cohort based on smoking history revealed that amplifications at 7p15.3, 7p11.2 (EGFR) and 12q15 (MDM2) were more common in never-smokers compared to smokers, while amplification of 11q13.2–q13-3 (CCND1) was more common in smokers (Fig. 2b). In addition, well-known lung cancer oncogenes such as PIK3CA (3q26.32) and TP63 (3q28), not included in any mGISTIC region, were amplified in notable fractions (6.6% and 4.6%, respectively) in primarily SqCC (22.5% and 16.1%, respectively), whereas, for instance, ALK (2p23.2–p23.1) amplifications were uncommon across all investigated groups (0.14% total rate). In comparison, of the considerable number of genes recently reported as amplified oncogenes in lung cancer based on analyses of smaller studies, including AKT1, CDK5, SHC1, BCL11B, MAFB, FUS, TRAF6 and BRF2,9, 38, 41, 42 only SHC1 and BRF2 were found amplified in >15 cases in the current study.

Figure 2.

Pattern of recurrent amplifications in lung cancer. (a) Recurrent amplifications in mGISTIC regions with significantly different frequency between AC (black), SqCC (gray) and SCLC (white; Fisher's exact test, p < 0.05). (b) Recurrent amplifications with significantly different frequency between AC never-smokers (black) and smokers (current and former, white; Fisher's exact test, p < 0.05).

In addition, we also investigated the pattern of co-occurrence of identified recurrent amplifications in NSCLC and SCLC cases (Supporting Information Table S3 in Additional file 7). As expected, the highest co-occurrence was generally observed for amplifications located on the same chromosomal arm. Moreover, several coamplifications observed in NSCLC are explained by the strong association between specific amplifications and tumor histology. For instance, amplifications at the 3q26–q29 region are predominantly observed in SqCC cases, while the co-occurrence of 14q13.3 (NKX2-1) with other amplifications is directed by the high frequency of this amplification in AC (Supporting Information Tables S2 and S3 in Additional files 6 and 7).

Pattern of allelic imbalance in lung cancer

The pattern of allelic imbalance in lung cancer was delineated through GAP-analysis of 1,169 cases (1,046 tumors) analyzed by SNP arrays. For NSCLC cases (n = 1,119), LOH was most frequent (>40%) in regions of copy number loss, for example, 3p, 5q, 8p, 9p, 13q, 17p and 19p (Figs. 1a and 3a). Stratification of LOH into copy-neutral LOH (CNN–LOH) revealed an overall low prevalence in NSCLC (generally ≤10%), with the highest frequency on chromosome 17p (Fig. 3b). Similarly, the frequency of copy number neutral allelic imbalance (CNN–AI) was also relatively even and lower across chromosomes in NSCLC (generally ≤25%), with the highest prevalence on chromosomes 2, 6p, 9qter, 11p, 12q and 17 (Fig. 3c).

Figure 3.

Frequency of LOH and copy number neutral allelic imbalance in subgroups of lung cancer. (a) Genome-wide frequency of LOH for 1119 NSCLC cases analyzed by GAP. (b) Genome-wide frequency of copy number neutral LOH (CNN-LOH) for NSCLC cases. (c) Genome-wide frequency of copy number neutral allelic imbalance (CNN-AI) for NSCLC cases. (d) Fraction of the genome affected by LOH for 1,046 lung tumors stratified by different clinicopathological variables. (e) Fraction of the genome affected by CNN-LOH for 1,046 lung tumors stratified by different clinicopathological variables. (f) Fraction of the genome affected by CNN-AI for 1,046 lung tumors stratified by different clinicopathological variables. Top-axis indicates number of samples in each group in subpanels. p-values were calculated using Student's t-test or ANOVA for indicated groups.

Stratification of the 1,169 cases by lung cancer histology confirmed the expected overall association of LOH with copy number loss, with certain genomic regions showing very high-LOH frequency (>60%) in individual histology groups, for example,, 3p, 5q, 9p, 13q and 17p in SqCC, 3p, 9p, 13q, 17p and 19p in LCC and 3p, 4q, 5q, 13q and 17p in SCLC (Supporting Information Fig. S4 in Additional file 8). Across histology groups, a generally low frequency of copy-neutral LOH (10–20%) was observed for most chromosomes, with chromosome 17p consistently being among the most affected. Differences between histology groups included, for example, higher variability in SqCC compared to AC and markedly elevated frequencies of copy-neutral LOH on 5q, 13q and 17q in SCLC (Supporting Information Fig. S4 in Additional file 8). For copy-neutral allelic imbalance (CNN–AI), AC and SqCC resembled NSCLC in that CNN–AI was more evenly distributed across chromosomes and less frequent, while the pattern varied more in LCC and SCLC (Supporting Information Fig. S4 in Additional file 8). Stratification of the 1,046 lung tumors analyzed by GAP into 11 clinicopathological subgroups revealed differences in the gross amount of allelic imbalances between histology groups (Figs. 3d3f). Specifically, AC tumors generally displayed less LOH compared to other histology groups, while LCC and SCLC tumors displayed more copy-neutral allelic imbalance compared to AC and SqCC tumors. In contrast, no differences in copy-neutral LOH were observed for any tested group, and no significant differences in allelic imbalance estimates were observed for AC or SqCC when stratified by stage.

DNA ploidy and fraction of aberrant tumor cells in lung cancer as estimated by GAP and flow cytometry

To investigate the pattern of tumor ploidy across subgroups of lung cancer, we first analyzed GAP-ploidy estimates for the 1,169 cases profiled by SNP arrays. Stratification of cases by histology revealed differences in the distribution of GAP-ploidy estimates between subgroups (Fig. 4a). For instance, AC and SqCC cases showed (i) the highest proportions of GAP-ploidy values close to 2N (most often diploid), (ii) a second more prominent peak around 3N (triploid) and (iii) an overall similar distribution of ploidy estimates (Fig. 4a). In contrast, LCC and SCLC cases showed a more mixed distribution, with, for example, SCLC displaying the most prominent peak around 3N (Fig. 4a). To validate the GAP-ploidy pattern across tumor histologies, we analyzed flow cytometry-derived DNA histograms from 129 external lung cancer cases, finding a similar overall tumor ploidy pattern for AC, SqCC, SCLC and some extent LCC (Fig. 4b). Neither AC nor SqCC showed differences in GAP-ploidy estimates when stratified for stage (ANOVA; p > 0.05).

Figure 4.

Tumor ploidy and percentage of aberrant tumor cells in lung cancer as determined by GAP and flow cytometry. (a) Distribution of GAP-ploidy for 1,169 cases (tumors and cell lines) grouped by lung cancer histology. (b) Distribution of FCM-ploidy across 129 external lung tumors grouped by tumor histology. (c) Variation of fraction of the genome altered (FGA) estimates versus GAP-ploidy for 1,119 NSCLC cases for copy number (black), copy number neutral allelic imbalance (CNN-AI; red), LOH (blue) and copy number neutral LOH (CNN-LOH; light blue). GAP-ploidy estimates were binned in bins of size 0.1 represented by tick marks on the x-axis. Bin limits indicated by brackets and parentheses, for example, [1.96, 2.06] correspond to GAP ploidy <1.96 but ≤2.06. For each bin, the median FGA value of included samples is plotted (points) for copy number, LOH, CNN-LOH and CNN-AI. Bins contain different number of samples as indicated by the top axis. CN-FGA is based on gain and loss calls from GAP-analysis. (d) Distribution of percentage of aberrant cells estimated by GAP-analysis for 1,046 tumors analyzed by GAP grouped by NSCLC status and individual tumor histology. A significant difference in aberrant cell estimates is observed between the individual tumor histologies (p = 1 × 10−11, ANOVA), but not between AC, SqCC and LCC cases (p = 0.46, ANOVA). (e) Size of the aneuploid population (% of all counted cells) differ across tumor histology groups for 129 external lung tumors analyzed by DNA flow cytometry (p = 0.002, ANOVA). In (a) and (b), curves were generated by application of an Epanechnikov smoothing kernel with 0.1 smoothing bandwidth to the data.

In relation to tumor ploidy, NSCLC cases with lower GAP-ploidy (estimated to be diploid or near diploid) showed lower fractions of LOH, copy number alterations and copy-neutral allelic imbalances compared to cases with higher GAP-ploidy (Fig. 4c). Similar results were found for AC and SqCC cases, while corresponding patterns in LCC and SCLC were more variable potentially due to the lower number of investigated cases (Supporting Information Fig. S4 in Additional file 8). In contrast, fractions of copy-neutral LOH did not increase with increasing GAP-ploidy in NSCLC or individual histologies.

Besides differences in tumor ploidy, lung cancer histologies also displayed differences in aberrant cell estimates derived from GAP analysis. Notably, SCLC tumors showed significantly higher aberrant cell estimates compared to tumors in other groups (p = 1 × 10−11, ANOVA; Fig. 4d), corroborated by flow cytometry data for the five analyzed SCLC tumors. In these tumors, the size of the aneuploid fraction, in the form of percentage of counted cells, was significantly larger compared to the other histology groups (p = 0.002, ANOVA; Fig. 4e).


Lung cancer is a heterogeneous disease at the molecular level due to a spectrum of genetic alterations, including mutations, translocations and copy number alterations, that have an impact on disease development and progression. Investigation of focal copy number alterations in lung cancer has led to the identification of several cancer-causing genes with potential therapeutic implications.2, 3, 22 Consequently, further characterization of lung cancer at the gene level may therefore have implications for improved future clinical management of the disease as well as for increased understanding of the biology behind the major histological groups of lung cancer.

With the aim to characterize the genomic landscape of somatic copy number alterations, amplifications and allelic imbalances in lung cancer, we analyzed 2,141 tumors and cell lines representing cases from all four major tumor histologies. mGISTIC analysis of the total cohort pinpointed 89 genomic regions (55 gains and 34 losses) as relevant SCNAs in lung cancer (Supporting Information Table S1 in Additional file 2). Moreover, 48 of the 55 gain regions harbored recurrent amplifications with differences in prevalence and coamplification patterns across histological subgroups (Fig. 2 and Supporting Information Table S3 in Additional file 7). Notably, the identified amplifications included several known or putative oncogenes in lung cancer, for example, FGFR1, WHSC1L1, SOX2, NKX2-1, NKX2-8, PAX9, CDK4, MYCN, MYCL1, KRAS, MDM2, MYC, MET, CCND1, CCNE1, EGFR, ERBB2, ID1, BCL2L1, CRKL and TERT, of which several showed a potential copy number driven expression and a lineage-specific appearance (Supporting Information Table S2 in Additional file 6). The mGISTIC loss regions included besides loci harboring established tumor suppressor genes, for example, PTEN, RB1, CDKN2A/CDKN2B, STK11 and CHD5, also regions involving genes with a large genomic footprint, for example, FHIT, LRP1B, CSMD1, PTPRD and WWOX, suggested to represent fragile sites in the genome.43

Overall, the pattern of copy number alterations for NSCLC and individual lung cancer histology groups observed in the current study is well in line with the reports in the literature (see, e.g., Ref.44 and references therein). In line with the current study, certain chromosomal imbalances, for example, gain of regions on chromosome 1q, 5p, 7p and 8q and losses on 3p and 9p, have been reported at fairly high rates in AC.2, 7, 22 However, in the mGISTIC regions found to significantly differ between AC and SqCC and AC/SqCC/SCLC, respectively, no region was more frequently altered in AC. In the literature, a wide set of genomic alterations stratifying AC from SqCC have been reported, including imbalances at 3q26-qter, 20p13, 12p, 6, 3p, 4q, 5p, 5q, 7q and 13q,17, 45–47 although these findings are based on analyses of smaller data sets. The 13 regions identified and subsequently validated in the current study highlight genomic alterations on 3q26–q29 (gains/amplifications), 3p (losses), 4p15.31 (loss), 4q (losses) and 5q14.3–q15 (loss) as significantly altered between AC and SqCC. However, for the identified regions discriminating between AC/SqCC and AC/SqCC/SCLC, it should be noted that (i) the majority of regions were identified due to lower alteration frequency in AC, (ii) several regions show similar prevalence in SqCC and SCLC and (iii) that notable fractions of AC cases (20–30%) do in fact display copy number alterations in these regions indicating heterogeneity within the group (Supporting Information Fig. S3 in Additional file 5).

Taken together, comparison of different features of genomic alteration including alteration frequency, amplification patterns, fraction of the genome affected by copy number alterations and genomic architecture across lung cancer histology groups revealed (i) generally a lower frequency of copy number alterations in AC compared to other histologies, (ii) a strong association of certain alterations, for example, amplifications, with individual histologies, (iii) differences in the architectural types and complexity of rearrangements and (iv) that several features/copy number alterations are markedly shared between certain/all histology groups (see Table 2 for summarization). Examples of the latter include the frequently altered mGISTIC regions at chromosomes 1q, 5p and 8q observed in all histological subtypes and the highly frequent alterations at 3q26–q29 in SqCC (80–84%), SCLC (60–67%) and LCC (60–68%). Notably, mGISTIC regions on 5p include a region at 5p15.33 harboring the TERT gene, which has been reported as a major susceptibility locus in NSCLC.48 However, this region also displayed copy number gain in 75% of SCLC cases including frequent amplification, pointing out the 5p15.33 region as relevant to the development of all types of lung cancer (Supporting Information Tables S1 and S2 in Additional files 2 and 6).

Table 2. Differences and similarities in genomic alterations and allelic imbalances between lung cancer histological subgroups
inline image

Similar to the case of copy number alterations, patterns of different forms of allelic imbalance, tumor ploidy and aberrant cell fractions also varied between major subgroups of lung cancer. In NSCLC as well as individual lung cancer histology groups, the pattern of LOH was, as expected, strongly associated with copy number loss and moreover reached high frequencies (>70%) in SqCC, LCC and SCLC (Figs. 3a and S4 in Additional file 8). In contrast, copy-neutral allelic imbalances, including copy neutral LOH, were overall more evenly distributed across chromosomes in lung cancer and occurred at lower frequencies, although differences in variability and targeted regions were observed between histologies (Supporting Information Fig. S4 in Additional file 8). These findings suggest that copy-neutral allelic imbalances appear as less frequent overall additive events in the majority of lung cancers. Furthermore, lung cancer histology groups also displayed differences in tumor ploidy patterns and aberrant cell estimates, demonstrated by both GAP and DNA flow cytometry analysis (Fig. 4). Interestingly, the higher aberrant cell estimates seen for GAP-analyzed SCLC tumors are well in line with the reported dense growth pattern for these tumors.49 Moreover, the often-triploid (3N) appearance of SCLC cases and, to some extent, also LCC in the current study provides an apparent explanation to the higher occurrence of copy-neutral allelic imbalance in these groups, as the centering step of genomic profiles in the data analysis automatically infers a high degree of this type of allelic imbalance for cases with an uneven dominant copy number. These observations in combination with findings of an overall higher genomic complexity for aneuploid lung cancer, demonstrated by higher fractions of allelic imbalances and copy number alterations in NSCLC, AC and SqCC cases with increasing GAP-ploidy (Fig. 4c and Supporting Information Fig. S4 in Additional file 8), emphasizes the overall connection between different types of genomic alterations and tumor ploidy.

In summary, the current study provides an important overview of genomic alterations occurring in the major subgroups of lung cancer, confirming several findings from smaller studies but also extending the knowledge about patterns of genomic alterations in and between major subgroups of the disease. Intriguingly, the pattern of different aspects of genomic alteration across lung cancer histologies revealed in this study potentially suggests that genome instability affects AC to a lesser extent compared to other histology groups (Table 2). In contrast, mutation-driven tumorigenesis may potentially be of more importance in AC, as illustrated by the frequent EGFR, STK11, TP53 and KRAS mutations observed in this subtype.50 It remains an intriguing aspect to investigate whether the heterogeneity of genomic alterations observed within lung cancer histology groups in general, and AC in particular, may be explained by existence of molecular subtypes with clinical relevance. In contrast to several reports of molecular subgroups of AC based on gene expression patterns,4, 6 only a few whole-genome profiling studies have to date attempted to derive molecular subtypes in AC based on genomic alterations.9, 22 However, the overlap and validity of these subtypes remain to be confirmed.


The authors thank David Lindgren, Department of Molecular Tumor Biology, Lund University, Sweden for critical reading of the manuscript.