Applying gene expression microarrays to pulmonary disease



    1. Section of Computational Biomedicine, Department of Medicine and Pulmonary Center, Boston University Medical Center
    2. Bioinformatics Program, Boston University, Boston, Massachusetts, USA
    Search for more papers by this author

    1. Section of Computational Biomedicine, Department of Medicine and Pulmonary Center, Boston University Medical Center
    2. Bioinformatics Program, Boston University, Boston, Massachusetts, USA
    Search for more papers by this author

    Corresponding author
    1. Section of Computational Biomedicine, Department of Medicine and Pulmonary Center, Boston University Medical Center
    2. Bioinformatics Program, Boston University, Boston, Massachusetts, USA
    Search for more papers by this author

  • The Authors: Joshua D. Campbell, BA is a PhD candidate in the bioinformatics programme at Boston University and is interested in applying novel computational methods to gene expression data for the study of human disease. Marc E. Lenburg, PhD is an Associate Professor of Medicine in the section of Computational Biomedicine at Boston University School of Medicine. His research interests include using genomic data to gain insights into pulmonary disease and to develop biomarkers for the clinical management of these diseases. Avrum Spira, MD is an Associate Professor of Medicine, Pathology and Bioinformatics and is chief of the Division of Computational Biomedicine in the Department of Medicine at Boston University. His research interests focus on exploring the complex interaction of genes and the environment in human lung disease in order to develop genomic biomarkers that can directly impact management of these conditions.

  • Conflict of Interest Statement: M.E.L. and A.S. are shareholders in and have acted as paid consultants to Allegro Diagnostics Inc.


Marc E. Lenburg, Section of Computational Biomedicine, Boston University Department of Medicine, 72 East Concord Street, Boston, MA 02118, USA. Email:


Gene expression microarrays are high throughput technologies that can simultaneously measure the expression levels of most known genes in the human genome within a biological sample. The study of gene expression has revealed new understanding into the biological complexities of the cell and can impact the field of medicine by providing new insights into disease. Examining gene expression in samples from patients with pulmonary disease can elucidate molecular mechanisms responsible for disease pathogenesis or uncover novel molecular subtypes within a disease. Gene expression signatures of disease pathogenesis can further be used to suggest novel therapeutic compounds. Biomarkers can be developed from gene expression data that can aid clinicians in diagnosing disease or can guide clinicians in tailoring therapeutic strategies to individual patients. To demonstrate the applications of gene expression microarray technology, we will review several studies in pulmonary disease that utilize gene expression profiling techniques to gain biological insights into disease or to develop clinically relevant biomarkers for disease management.


High throughput genomic technologies have opened up a new world of understanding into the complexities of human biology by providing a means in which we can gain considerable amounts of information about a sample for relatively small amounts of time and cost. One such technology, the DNA microarray, is a versatile tool that can be used to explore multiple aspects of the cell including gene and microRNA expression, single nucleotide polymorphisms, DNA methylation and DNA copy number variation.1–3 In this review, we focus mainly on the gene expression microarray, which has the ability to measure the level of expression for thousands of genes in one sample. This approach to studying gene expression has been used extensively in molecular biology to understand processes of development, examine cellular responses to a variety of stimuli, identify transcriptional regulatory modules and explore a host of other biological questions. In the field of medicine, studying gene expression patterns in samples from patients can add to our knowledge of disease by revealing underlying causes of disease pathogenesis, molecular differences between histologically similar diseases, completely novel subclasses within a disease and mechanisms of drug resistance. Gene expression signatures of disease pathogenesis can further be used to suggest novel therapeutic compounds. In addition, new tools for disease diagnosis can be developed from gene expression patterns. More specifically, biomarkers can be developed that identify disease when pathological diagnosis is difficult or inconclusive, predict disease outcome in order to identify which patients will be likely to require more aggressive treatment strategies and predict what therapeutic strategies will be most efficacious.

This review briefly covers scientific concepts and milestones that led to the development of gene expression microarray technologies followed by a short introduction to how different types of microarray data analysis strategies can be used to answer various clinical and biological questions. The focus of this review will be to highlight several studies in the field of pulmonary medicine that utilize gene expression profiling techniques. Many pulmonary diseases are heterogeneous and characterized by complex environmental and genetic interactions. Gene expression profiling offers the unprecedented ability to gain biological insights into pulmonary diseases, develop clinically relevant biomarkers for the clinical management of these diseases and aid in the development of novel therapeutics for treatment.



The foundations for microarray technology began with breakthroughs in molecular biology and computer science. The ability to perform rapid DNA sequencing led to an explosion of genomic information and opened a new window into understanding human disease. One of the most notable landmarks was the advent of the Human Genome Project, which sought to derive a complete DNA sequence for the human genome.4 The project was initiated in 1990 as a joint effort between the U.S. Department of Energy and the National Institute of Health and lasted 13 years. During the height of the project, centres from multiple locations around the world were producing raw sequence data at a rate of about 1000 nucleotides per second, 24 h a day, 7 days a week.5 The completion of this project revealed immense challenges in understanding how incredible biological complexity can be achieved from the human genetic blueprint and how abnormalities in this process can lead to human disease. Our ability to make progress towards this fundamental challenge has been facilitated by increased computational capacity to analyse and store the large volume of genomic data that have been generated. Key computational technologies have been increases in data storage capacity, the advent of advanced database software, allowing for efficient information storage and retrieval, and increases in computing power that allow for the implementation of novel algorithms and statistical methods designed to transform raw data into biological understanding.

Gene expression profiling

One of the most important pieces of biological information that emerged from the Human Genome Project was the identification and characterization of the 20 000–25 000 protein-coding genes in the human genome. With this catalogue of sequences for all protein-coding genes, it became possible to develop assays to measure the expression level of any gene in the genome. Methods that measure gene expression levels generally take advantage of DNA hybridization, the principle that a strand of DNA or RNA will bind to a complementary strand of DNA via Watson-Crick base-pairing. Northern blotting was one of the first approaches developed based on this idea, and is a technology that can measure the expression of a single gene in a small number of samples at a time. In this process, RNA from a biological sample is isolated, separated by size using gel electrophoresis and fixed to a membrane surface. A probe with a complementary DNA sequence that incorporates a detection reagent (such as a fluorescent dye or radioactive tracer) is then hybridized to the membrane and will bind to the RNA transcript of the gene of interest. The level of hybridized probe will then correspond to the amount of the RNA transcript in the sample. Gene expression microarrays use a reversed paradigm of this same principle by fixing the probe to a solid surface and labelling all RNA transcripts from a biological sample. Advances in rapid DNA synthesis technology allowed thousands to millions of probes to be printed side by side on a single chip with each probe interrogating a single gene. Several other methods such as Serial Analysis of Gene Expression (SAGE) were also developed to globally measure gene expression levels. However, the low cost and high throughput of microarrays has made them one of the most popular platforms for performing this task. The gene expression microarray is simply the amalgamation of these three concepts: a catalogue of sequences for all human genes, the principle of hybridization and a collection of probes on a single platform. However, this union offers something more than the sum of the individual parts. It presents the ability to simultaneously examine the expression of every gene in the human genome within a small amount of sample, such as that which can be obtained from a routine clinical biopsy sample, thus giving us a more complete snapshot of biology than what we have been able to previously ascertain.

Processing gene expression microarray data

Gene expression microarrays can produce millions of data points for each sample and require a series of data analysis steps to take the raw hybridization intensities for each probe on the array and ultimately arrive at conclusions regarding which genes may be involved in the disease of interest. While many procedures for analysing microarray data exist and may be specific to the type of microarray platform used, a key step in each procedure involves a data normalization and summarization step. The goal of this step is to generate numerical gene expression estimates from the raw probe-level hybridization data while attempting to minimize biases such as technical variation between arrays. Commonly, fluorescent signals corresponding to the amount of material hybridized to each probe on the array are captured using an optical imaging system that produces an image of the entire array. The intensity per pixel is converted to an intensity per probe using feature-extraction algorithms. Common approaches for normalizing these probe-level hybridization intensities involve linear scaling, locally weighted scatterplot smoothing and quantile normalization. For array technologies that include multiple probes per transcript, these per-probe hybridization intensities are then summarized to a per-transcript expression estimate often taking into account the degree of hybridization to negative-control probes, and estimates of the different hybridization efficiencies of each probe. Various types of microarrays and corresponding types methods of normalization and summarization have been reviewed previously.6,7 Quality control is another important procedure that must be tackled before data analysis. Several problems can occur during the processes of sample preparation, hybridization, and scanning or defects can exist in the microarray chip itself, which will result in data that do not accurately reflect gene expression levels. These arrays needed to be identified and excluded in order to not adversely affect downstream data analyses. Although specifics about quality control are beyond the scope of this review, refer to Reimers in 2010 for a set of general guidelines.8

Methods of gene expression microarray analysis

In data analysis, the principal goal is to identify differences in the array measurements that are due to biologically driven differences in gene expression. The method used to accomplish this usually depends on the types of clinical questions that are being examined. Techniques for identifying gene expression profiles related to disease generally fall into two broad categories of supervised and unsupervised methods (Fig. 1). Supervised methods rely on prior knowledge about the biological condition from which the clinical samples originated. An example of a supervised method is one that takes in account which samples in the experiment were from diseased versus non-diseased tissue. Statistical tests can use this information to identify genes that are significantly differentially expressed between tissues from patients with and without pulmonary disease or find gene expression patterns that correlate with continuous clinical features such as survival time. Almost all of the statistical approaches for analysing continuous variables can be applied in this type of data analysis, and are based on comparing the magnitude of the gene expression difference associated with the phenotype to the magnitude of the variability in gene expression (e.g. the variability in gene expression between biologically similar samples). These methods generally give a P-value for each gene representing the probability of observing the amount of phenotype-associated differential expression for a particular gene given the null hypothesis that gene expression does not truly vary with that phenotype. All genes with a P-value below a specified threshold (to be discussed more below) are considered significant and thus differentially expressed. In example A of Figure 1, the expression profiles for the cases (blue boxplot) are significantly lower than the controls (orange boxplot) for a particular gene by a Welch's t-test (P = 0.002). As gene expression data generally represent a snapshot of the complex processes occurring within a biological sample, genes found to be differentially expressed may have mechanistic roles in the disease, may correspond to downstream consequences of disease pathogenesis or may reflect host response to the disease. Further experiments are often needed to distinguish between these possible causes of differences in gene expression.

Figure 1.

Overview of analysis strategies for gene expression microarray data. (A) This gene is differentially expressed between cases (blue boxplot) and controls (orange boxplot) using a Welch's t-test and may have a potential role in the molecular pathogenesis of the disease. (B) The expression profiles of gene 1 (x-axis) and gene 2 (y-axis) are able to separate cases (blue dots) from controls (orange dots) using a predictive rule (dotted line) identified in training data and can be used to classify new samples (squares) in a test set. In this case four out of five samples are classified correctly in the test set resulting in an accuracy of 80%. These two genes could potentially be used as a diagnostic biomarker in a clinical setting. (C) Hierarchical clustering is used to group a set of 10 samples using 10 gene expression profiles. The data are displayed as a heatmap where the columns correspond to samples and the rows correspond to genes. Red represents higher relative expression and blue represents lower relative expression. These data potentially reveal molecular subtypes within the disease.

One difficulty in this type of analysis is the so-called ‘multiple comparison’ or ‘multiple hypothesis testing’ problem. When statistical inferences are made for thousands of genes in one analysis, a sizable number are expected to pass a particular P-value threshold (e.g. P < 0.05) just by chance even when there is no true differential expression with respect to the clinical phenotype. For example in a microarray dataset that measures the expression of 20 000 genes, we would expect approximately 1000 genes to have a P-value < 0.05 just by chance alone. Methods such as the False Discovery Rate correction adjust the P-value in order to estimate the fraction of genes passing a certain P-value threshold that are likely to be false positives and should usually be applied before statistical inferences are made.9 Other supervised methods fall under the realm of class prediction. These methods use a set of case and control samples (i.e. training set) to identify and combine the expression levels of multiple genes into a single composite measure that can predict the disease status of samples where disease status is unknown. In many instances, only gene expression profiles that are informative (e.g. are differentially expressed) are selected to generate the model, a process referred to as ‘feature selection’. Methods that have been applied to microarray data for combining these features into composite predictor include weighted voting, k-nearest neighbours, random forests, artificial neural networks and support vector machines.10–14 For example B in Figure 1, two gene expression profiles are used to build a prediction rule (dotted line) that differentiates between a set of 20 controls (orange dots) from a set of 20 cases (blue dots). When trying to classify an independent cohort, any new samples that fall above the dotted line are predicted to be controls while any new samples that fall below the dotted line are predicted to be cases. In a test set of five samples (squares), the two cases that fall below the dotted line and the two controls that fall above the dotted line are correctly classified. However, one sample is misclassified because it is predicted by the model to be a case when in reality it is a control. Therefore, this two-gene biomarker has an accuracy of 80% as four out of five samples in a test set were classified correctly. The clinical utility of these approaches lies in their ability to potentially diagnose otherwise occult disease or disease properties.

An important aspect especially relevant in supervised gene expression studies with a ‘case versus control’ design is what types of samples are chosen for the control group. The biological interpretation for a list of genes found to be differentially expressed will depend on what type of control samples was used to represent a ‘normal’ baseline. For example, genes differentially expressed between the lung tissue of disease patients and the lung tissue of completely healthy individuals may reflect general biological processes of being sick (e.g. a common inflammatory response) rather than the specific biological processes responsible for the pathogenesis of the disease. Including controls from patients with other diseases may be helpful for finding genes associated with processes specific to that particular disease. In addition, the cases and controls should be isolated and processed in the same way and at the same time in order to minimize gene expression differences arising from non-biological sources. Another consideration pertinent to gene expression studies in whole lung tissue is the fact that the lung is composed of many different types of cells. A gene may appear to be differentially expressed because the proportion of the cell type expressing that gene is changing between cases and controls. Further in vitro studies may be needed to confirm that the gene is differentially expressed within the relevant cell type.

Unsupervised methods seek to use the gene expression data alone to identify molecular subtypes of disease, which may or may not be associated with known clinical phenotypes but might also represent subclasses that will respond differently to molecularly targeted therapy. Unsupervised clustering techniques attempt to identify genes expression profiles that can separate samples into distinct groups without using any prior knowledge of the sample origin. Hierarchical clustering, k-means clustering, self-organizing maps and consensus clustering are examples of commonly used clustering methods.15,16 An example of how distinct groups can be observed in gene expression data can be seen via a heatmap in example C of Figure 1. A heatmap is a common way of displaying gene expression data where each column represents a sample and each row represents a gene. Each square is the relative expression level of a gene within a patient. In this example, red represents higher relative expression and blue represents lower relative expression. Hierarchical clustering was applied to a set of measurements for 10 genes to group patients with disease into distinct clusters. These clusters could represent molecular subclasses of disease. Knowledge of these molecular subclasses can aide in understanding mechanisms of pathogenesis by uncovering how different patterns of biological dysregulation can lead to a common disease state and can also be important for developing targeted therapeutic strategies when these patterns of dysregulation can be ascribed to different biological pathways.

While a few of the most straightforward methods for analysing gene expression data have been summarized here, the list of computational approaches is much longer and continually growing based on the need to develop sensitive methods for discovering potentially subtle differences in gene expression that distinguish clinically important states. In a similar way, the list of possible clinical applications for gene expression microarrays is considerable. In the next section we focus on four of these applications and provide examples of each focusing on studies in pulmonary diseases (Fig. 2).

Figure 2.

Overview of clinical applications for gene expression microarrays.


Disease pathogenesis

Elucidating molecular mechanisms of disease pathogenesis

One of the primary uses of microarrays has been to illuminate the underlying molecular mechanisms responsible for disease pathogenesis (Fig. 2). Whole-genome expression profiling can reveal genes that are dysregulated between samples from patients with and without lung disease, which can in turn be analysed to gain insights into the biology of a disease such as the deregulation of signalling pathways or an abnormal immune response. As microarrays measure the expression of majority of known genes, no specific hypotheses about which genes are involved in disease pathogenesis must be in place before a microarray experiment can take place. Instead, gene expression microarrays allow researchers to cast a much wider net in discovering gene-disease associations compared with a candidate ‘gene-by-gene’ approach that is guided by prior knowledge to determine which genes to test for differential expression. Whole-genome approaches are especially useful when prior information about the molecular pathogenesis of disease is sparse or when examining a few selected genes may not encompass the biological complexity of heterogeneous diseases.

While gene expression microarrays provide an unbiased and comprehensive method for associating genes with a disease, important follow-up experiments are often needed to show that a candidate gene from the microarray analysis has a mechanistically important role in the disease. Many studies take the approach of identifying a specific, differentially expressed gene with an interesting biological function and perturb the expression of that gene in cell lines or animal models relevant to the disease. While the selected gene's biological function is usually consistent with prior knowledge of disease pathogenesis, the researcher may have not thought to test that particular gene without having first found the gene to be differentially expressed in the microarray profiling experiment. If the cell line or animal model exhibit characteristics of the disease after perturbation, the hypothesis of that gene's involvement in pathogenesis is further supported. Gaining insights into the mechanisms of disease pathogenesis will hopefully provide better targets for therapeutic development. While many high-quality gene expression studies that elucidate molecular mechanisms in pulmonary diseases exist, we chose to highlight a few examples from a variety of different diseases that identified aberrantly expressed genes using microarray technology and pursued these findings with experiments in cell lines or animal models.

A recent study by Kicic et al.17 profiled primary airway epithelial cell cultures from children with asthma (n = 36), healthy atopic control subjects (n = 23) and healthy non-atopic control subjects (n = 53) in order to identify cellular processes that are affected by asthma. They found that fibronectin, along with other genes involved in repair and tissue remodelling, were downregulated in samples from asthmatic children while genes involved in apoptosis and metabolism tended to be upregulated. Protein levels of fibronectin were also found to be significantly reduced in culture supernatants and cell lysates from asthmatic children. Following these findings, fibronectin was knocked down in cells cultured from healthy non-atopic patients. These cells exhibited reduced wound repair abilities. When fibronectin was added back into these cultures, wound repair was restored, which supports the notion that fibronectin is necessary for epithelial restoration under normal conditions.

Several studies have examined gene expression in patients with COPD.18–22 In order to find genes that play a role in the development of COPD pathogenesis, Ning et al.23 performed gene expression profiling by microarrays and SAGE on pools of RNA from smokers with COPD. Expression profiles for 261 genes by microarray and 327 genes by SAGE were found to be differentially expressed between smokers without COPD (GOLD-0) and those with moderate COPD (GOLD-2). These gene expression profiles were enriched in genes with functions related to adhesion and cytoskeleton, metabolism, ECM production, cell cycle and oxidative stress. The levels of expression of EGR1, CTGF, CYR61 and TGFB1 were validated by qRT-PCR. EGR1, a transcription factor involved in a variety of biological processing including response to tissue injury,24 and TGFB1, a growth factor involved in tissue remodelling and repair, had been previously implicated in emphysema,25,26 while CTGF and CYR61 that have roles in angiogenesis were novel associations. These genes were also found to be significantly induced in primary lung fibroblasts from emphysema patients compared with normal donors. Further investigation showed that the protein levels of the EGR1 gene increased in fibroblasts after exposure to cigarette smoke extract and that EGR1 was important for matrix metalloproteinase (MMP) activity in mice. MMP are a class of proteins hypothesized to be involved in COPD pathogenesis because of their capability to cause ECM degradation.27 The abnormal expression of EGR1 in COPD and the ability of EGR1 to regulate of MMP activity provide new insights into mechanisms to COPD pathogenesis and may provide new therapeutic strategies in the future.

Zuo et al.28 profiled five patients with IPF and compared them with three samples of normal tissue adjacent to cancer and one sample of pooled RNA from five normal lungs. Several proteases were observed to increase in expression in lungs with IPF. Surprisingly, MMP7 was the most induced gene in IPF lungs even though no prior association between MMP7 and pulmonary fibrosis had been made. Immunohistochemistry localized MMP7 to alveolar and bronchiolar epithelial cells and showed that the protein levels of MMP7 were significantly increased in fibrotic lungs. To determine whether MMP7 was necessary for fibrosis, bleomycin, a compound previously used to induce alveolar injury and stimulate fibrosis, was administered to MMP-7−/− and wild-type mice in two different strains. The amount of fibrosis was approximated in the lungs of mice by measuring hydroxyproline, an estimate of total collagen levels. Hydroxyproline increased significantly less in MMP-7−/− compared with wild-type when treated with bleomycin. Histological examination of the mouse lungs also revealed less bleomycin-induced fibrosis in MMP-7−/− mice further supporting the role of this MMP in IPF.


Developing diagnostic biomarkers

In addition to providing new insights into disease pathogenesis, gene expression can be used to develop clinically relevant tools to aide in decision making processes. This application is more focused on the predictive power of gene expression rather than the biological properties of genes differentially expressed in the disease. The goal is to develop biomarkers that can be used by clinicians in everyday settings to diagnose disease, establish a patient's prognosis and prescribe treatment. Many of the biomarkers that have been developed from microarray-based profiling combine the expression levels of multiple genes into a single score, where it is the score that is considered to be the biomarker rather than the expression levels of any individual gene.

An example of this approach is in the diagnosis of lung cancer. Current and former smokers with suspicion of lung cancer routinely undergo flexible bronchoscopy as an initial, relatively non-invasive diagnostic test. However, in cases where bronchoscopy is non-diagnostic, physicians are faced with the difficult decision of ordering further and more invasive diagnostic tests or following a less aggressive approach of radiographical monitoring. Based on the concept that cigarette smoke creates a ‘field of injury’ throughout the respiratory tract29–32 and that differences in an individual's response to smoking in the airway can be used detect the presence of lung cancer, our group profiled gene expression in the cytologically normal airway epithelium of 129 current and former smokers undergoing bronchoscopy for suspicion of lung cancer.33 Patients were followed until a diagnosis of lung cancer or an alternative benign diagnosis was made. To develop a biomarker for the presence of lung cancer, 60% of patients (n = 77) were randomly assigned to a training set while the remaining 40% of patients (n = 52) were assigned to a test set. Using an algorithm in which votes are derived based on the expression of a panel of genes within a sample and combined to produce a single prediction for the presence of cancer, an 80-gene biomarker was developed on the training set that could distinguish between smokers with and without lung cancer in the test set with 83% accuracy (80% sensitivity, 84% specificity). The biomarker also performed with an 80% accuracy in a prospective validation set (n = 35) collected independently at a separate institution. Importantly, the 80-gene biomarker was independent of other clinical risk factors for establishing the likelihood that a patient has lung cancer.34 Traditional cytopathology of cells obtained at bronchoscopy diagnosed cancer in only 53% of patients with lung cancer and yielded a definitive alternate diagnosis of a non-cancer pathology in 7% of patients without lung cancer; thus highlighting the potential benefit of companion diagnostics like the airway gene expression biomarker.

Pulmonary arterial hypertension (PAH) encompasses a wide range of diseases that result in similar clinical phenotypes. Bull et al.35 profiled gene expression in peripheral blood mononuclear cells (PBMC) to explore alternative means of defining and diagnosing severe PAH. Generating diagnostic gene expression biomarkers in PBMC compared with lung tissue specimens is attractive because of the relative ease of collection and higher availability of PBMC samples. A cohort was collected of seven patients diagnosed with idiopathic PAH (IPAH), eight patients with PAH related to a secondary cause (sPAH), and six normal volunteers. A Leave-One-Out-Cross-Validation (LOOCV) procedure was used to test the feasibility of a biomarker for PAH. In this procedure, one sample is ‘set aside’ and a biomarker is created by identifying genes that are differentially expressed between the cases and controls in the remaining samples. The biomarker is then used to predict the class of the sample that was left out. This procedure is repeated for each sample in the cohort. In this study, each sample was correctly predicted as PAH or normal control using class prediction algorithms such as linear discriminate analysis or support vector machines. Two genes, ECGF1 and ADM, were validated to be differentially expressed by q-PCR in a prospective cohort of IPAH patients (n = 14) and healthy volunteers (n = 6). The authors also found that 28 genes were differentially expressed between IPAH and sPAH patients (P < 0.01). Although this is fewer genes than is expected by chance (as the array used in these studies contained approximately 5000 genes), one of these genes called HVEM was shown to be differentially expressed in the prospective cohort by q-PCR suggesting the possibility of a biomarker for the different types of PAH.

Developing prognostic biomarkers

Several studies have used gene expression profiling to define biomarkers prognostic of survival in lung cancer patients or to predict recurrence in order to improve the selection of patients who are most likely to benefit from aggressive adjuvant chemotherapy.36,37 Several potential challenges and complications exist in moving these biomarkers from the experimental bench into the clinic as a useful prognostic tool. Gene signatures that are predictive of outcome in one study tend to show little overlap with other signatures from other studies. This may reflect the idea that many genes could potentially serve as biomarkers, and each biomarker requires only a subset of these genes. In this scenario, we expect that the biomarkers developed in different cohorts will make similar predictions across datasets, but this has not always been observed to be the case. Differences in sample collection methods, processing protocols and microarray platforms are all potentially confounding factors between different studies. Also most biomarkers were developed and tested on samples collected at single institutions. One of the largest studies attempting to overcome these obstacles was recently performed by Shedden et al.38 A total of 442 lung adenocarcinomas, along with relevant clinical and outcome information, were collected across four sites using standardized sample collection and processing procedures. Samples from two sites were used as a training set while samples from the other two sites served as a blinded external validation set. They examined the ability of several proposed analytical methods to build prognostic biomarkers from gene expression data and predict outcome in the independent cohorts. Several methods were able to make significantly better predictions than by chance. The best method divided the entire dataset into 100 clusters of genes with similar expression patterns and chose a representative gene from each cluster to be a part of the biomarker. This approach may suggest that, at least for the biomarker algorithm used in this approach, simply choosing genes that have expression patterns correlated with survival may over fit to the training set and thus will not give the most accurate results in predicting outcome in a test set. An important observation made by this study is that most of the prognostic classifiers performed better when trained on a set of samples containing all stages compared with training classifiers only on samples diagnosed as stage I. This result suggests that gene expression patterns in tumours with more advanced stages are informative for predicting outcome in stage I tumours where large amounts of heterogeneity may exist. Similar to what was seen with the lung cancer diagnosis biomarker, predictive models built using both clinical variables and gene expression signatures performed better than the models using only one or the other suggesting that gene expression profiles contain independent information about risk of recurrence and that gene expression in combination with clinical risk factors may produce the most powerful predictors.

Studies investigating how to properly build accurate biomarkers that can be used in every day clinical studies using microarrays are ongoing. Simple mistakes in study design, data preprocessing or statistical analysis in any microarray experiment can result in erroneous findings and put patients at additional risk. Thus, accurate and reproducible computational methods as well as rigorous and independent validation need to be applied to any proposed biomarkers before they are available in a clinical setting. The Microarray Quality (MAQC) Project is a consortium funded by the federal Food and Drug administration to address various issues in biomarker development and microarray analysis. The first phase of the project (MAQC-I) developed guidelines for microarray data analysis by establishing quality control metrics and thresholds for objectively assessing the performance of various microarray platforms.39 The goal of the second phase (MAQC-II) was to reach a consensus on the best procedures for the development and validation of predictive models based on microarray gene expression data.40 An important outcome of this work was the finding that across a range of phenotypes the clinical phenotype is a much more important predictor of biomarker performance than the biomarker algorithm employed: suggesting that some clinical phenotypes may be better suited to detection by gene expression profiling than others.

Subtypes of disease

Identifying molecular differences between diseases

Some diseases of the lung can present similar symptoms or histopathological features making them difficult to distinguish from one another, potentially resulting in delayed or inappropriate treatment. Gene expression can be used to find molecular differences between known pathological subclasses of disease. For example, interstitial lung disease (ILD) refers to a wide group of diseases affecting the lung interstitium. IPF is an ILD of unknown aetiology characterized by progressive scarring of the lungs and poor prognosis. In contrast, hypersensitivity pneumonitis (HP) is an ILD caused by an allergic response with a prolonged exposure to inhaled organic substances and can usually be reversed by reducing contact with the antigen. Similar to IPF, fibrosis and destruction of the lung parenchyma can develop in chronic HP. In order to gain insights into how these ILD may differ on a molecular level, Selman et al.41 profiled lung tissue from 15 patients diagnosed with IPF and 12 diagnosed with HP. Gene expression profiles were able to differentiate samples from IPF and HP lung in both LOOCV and unsupervised hierarchical clustering analyses. Genes with upregulated expression in IPF lungs included those involved in ECM structure and turnover, cell motility and muscle contraction, suggesting significant induction of tissue remodelling processes. In contrast, genes upregulated in HP lungs included those involved in a variety of host defence and inflammatory functions such as T-cell activation, supporting the notion of an increased immune response in HP compared with IPF. Non-specific interstitial pneumonia (NSIP) is another ILD characterized by varying degrees of inflammation and fibrosis that can be difficult to pathologically differentiate from either HP or IPF. This study also profiled eight patients who were classified as NSIP based primarily on histology of interstitial pneumonia that did not meet the criteria for other idiopathic interstitial pneumonias. Based on the differential gene expression signatures for HP and IPF, two NSIP lung tissue samples were classified as IPF and one was classified as HP supporting the idea that molecular diagnostics may be useful in cases where pathologic diagnosis is difficult. The remaining five NSIP samples could not be classified with certainty into either group suggesting that they represent true idiopathic NSIP. Further characterizing the distinct biological features in the gene expression profiles of the NSIP patients compared with either HP or IPF may provide new molecular markers that can aide in diagnosis.

Identifying novel molecular subtypes of disease

In addition to identifying differences in known classes of disease, gene expression can uncover novel subtypes within a disease. Bhattacharjee et al.42 used microarray technology to both characterize gene expression differences between known histological subtypes of lung cancer and to find potentially novel subtypes of adenocarcinoma. They profiled lung adenocarcinomas (n = 127), squamous cell lung carcinomas (n = 21), pulmonary carcinoids (n = 20), small cell lung cancer (n = 6) cases, normal lung (n = 17) specimens and other adenocarcinomas (n = 12) suspected to be extrapulmonary metastases based on clinical history. Hierarchical clustering separated tumours by histological subtype and identified molecular markers for each subtype except for adenocarcinoma, which did not appear to have a distinct set of marker genes. In addition to characterizing the gene expression profiles of known histological subclasses, they also sought to define new molecular subclasses of lung cancer. Hierarchical clustering of only primary adenocarcinomas and normal samples revealed four distinct subclasses. One cluster was defined by neuroendocrine markers that were also expressed in small cell lung cancer and pulmonary carcinoids, but also contained serine proteases not expressed in other neuroendocrine lung tumours. The tumours in this cluster had the least favourable outcome (median survival = 21 months) compared with all other tumours (median survival = 40.5 months). In contrast, a cluster of tumours defined by expression of cathepsin and mucin genes had the most favourable outcome (median survival = 49.7 months) compared with all other tumours (median survival = 33.2 months). These molecular differences suggest that adenocarcinomas can be further stratified into subclasses with varying degrees of aggressiveness. This information could be incorporated into clinical decisions about treatment options.


Tailored therapeutics

The ability to improve treatment is an important goal for many gene expression studies in human disease. The knowledge gained by using the previously discussed applications of gene expression microarrays could be used to improve treatment of disease in a number of ways. For example, finding genes involved in the molecular pathogenesis of disease presents new targets for drug development, generating biomarkers for diagnosis or prognosis can aid the clinician in deciding how aggressive the treatment strategies should be, and discovering novel subtypes of disease could elucidate why certain patients seem resistant to therapeutics or have different survival times. However, other types of approaches for identifying which therapeutics strategies should be assigned to which patients or for identifying novel therapeutic used for existing drugs also exist and will be highlighted here.

The process of tumorigenesis is a complicated process involving the accumulation of mutations and the dysregulation of signalling pathways that control cellular growth and differentiation. As a result, different pathways may be dysregulated in different tumours making it difficult to develop a ‘one-size-fits-all’ therapeutic strategy. Akin to the idea of identifying subtypes within a disease, an approach to tailoring therapy for cancer is to identify the specific oncogenic pathways active in each tumour and apply drugs that target that particular pathway. For an example, a study by Bild et al.43 identified gene expression signatures of oncogenic pathway activation within various types of cancer and used these signatures to predict therapeutic efficacy. To accomplish this goal, activated forms of key signalling proteins in known oncogenic pathways such as Ras were infected into human mammary epithelial cells using an adenovirus. Gene expression profiles from these cells with constitutively active proteins were compared with profiles from cells infected with a control vector in order to develop gene expression signatures that are downstream transcriptional consequences of oncogenic pathway activation. As previous work had linked Ras activation to lung adenocarcinomas,44 they examined gene expression in a set of adenocarcinomas and squamous cell carcinomas to validate this approach. Using the Ras pathway activation gene expression signature, they found that the probability of Ras activation was significantly higher in adenocarcinomas compared with squamous cell carcinomas. Furthermore, a subset of tumours that had increased activity of other oncogenic pathways in combination with the Ras pathway was found in patients with significantly shorter survival times compared with all other patients suggesting that concerted deregulation of these pathways leads to a worse prognosis. Once a pathway is determined to be active within a clinical specimen, an appropriate drug targeting proteins in that pathway can be applied. To illustrate this concept, the authors examined the activity of these pathways across many breast cancer cell lines and then measured the ability of Ras to reduce proliferation of each cell line. The cell lines that exhibited higher Ras pathway activation were significantly more likely to have a reduction in proliferation when the Ras inhibitor was applied compared with cell lines without Ras pathway activation. These results demonstrate that determining which oncogenic pathways are activated in cancer patients can lead to tailored therapeutic decisions.

We have extended this approach of identifying activated oncogenic pathways in tumour tissue to cytologically normal airway epithelium from lung cancer patients.45 We examined the gene expression signatures of activation of several oncogenic pathways in the same cohort of current and former smokers undergoing bronchoscopy for suspicion of lung cancer where we developed our airway biomarker for lung cancer, and found that the expression of a PI3K pathway signature was higher in smokers who were diagnosed with lung cancer compared with smokers with an alternative benign diagnosis. In a separate lung cancer tissue dataset, PI3K activity was increased in adenocarcinomas compared with matched adjacent non-tumour tissue. Elevated expression of the PI3K activation signature was also observed in patients with dysplastic lesions, and this pattern reverted towards baseline specifically in individuals who had regression of their dysplastic lesions following treatment with myo-inositol, a PI3K inhibitor. These data suggest that relevant oncogenic pathway activation can be measured in cytologically normal airway epithelium and might be useful for tailoring pathway-specific therapy.

Discovering novel therapeutics

The Connectivity Map (CMap) is an example of an approach for using gene expression microarrays to discover therapeutic applications for existing biomolecules and could become a powerful tool to discover new treatments for lung diseases.46 The CMap contains a compendium of gene expression signatures resulting from treating cancer cell lines with different small molecules. A key strength of the CMap dataset is that gene expression profiles have been generated from multiple cell lines treated with many hundreds of compounds. The CMap data can be queried with disease-related patterns of gene expression to identify compounds that coordinately affect the expression of the genes in these expression signatures: for example, identifying compounds that cause the cell lines to have a more ‘diseased-like’ pattern of gene expression, or more importantly, those that cause a more ‘healthy-like’ pattern. Using this strategy, it is also possible to potentially identify compounds that induce phenotypic switching from a subtype of disease with less effective available therapies to one that is more easily treated. Likewise, the CMap can be queried with the gene expression signature that results from treating cells with a known drug, to potentially identify compounds with similar effects but different pharmacologic characteristics, such as a different spectrum of off-target effects that may be useful for designing combination therapy treatment strategies.

While there are not yet many examples of this type of strategy applied to lung disease, similar approaches have been used in the setting of Alzheimer's disease.46 The pathogenesis of Alzheimer's disease is poorly understood and effective therapies remain elusive. Gene expression signatures were developed from datasets comparing tissue from the hippocampus or the cerebral cortex in patients with and without Alzheimer's disease. These signatures were queried against the CMap and one compound called 4,5-dianilinophthalimide (DAPH) was found to significantly reverse the gene expression patterns observed in each dataset. DAPH had been previously shown to reverse the formation of fibrils implicated in accelerated neuronal cell death in the brains of Alzheimer's patients. These results suggest that DAPH is a possible candidate for treatment of Alzheimer's disease and illustrate the potential of the CMap to generate novel hypotheses regarding drug therapy using gene expression signatures.

Elucidating molecular mechanisms of drug resistance

Not only does gene expression analysis provide an approach for examining mechanisms of disease pathogenesis, it also supplies a means by which we can understand mechanisms of resistance to treatment of disease. Corticosteroids are the preferred treatment for management of persistent asthma. However, a subset of asthmatic patients does not respond favourably to this type of therapy and continue to display symptoms of the disease. In order to generate hypothesis about molecular mechanisms of steroid resistance, Goleva et al.47 measured lung function in asthmatic patients after treatment with prednisone for a 1-week period. Patients with a 15% or better improvement in lung function (FEV1%) were designated corticosteroid-sensitive (CS) while patients with a 12% or less change in FEV1% were classified corticosteroid-resistant (CR). Gene expression was profiled in BAL cells mostly composed of macrophages collected via bronchoscopy in three CS and three CR patients. The expression of 30 genes was upregulated in CR patients compared with CS patients and were enriched for genes involved in classical (Th1) macrophage activation via LPS signalling. In contrast, markers of alternative (Th2) macrophage activation were among the genes downregulated in CR patients. To support the hypothesis that LPS may be involved in steroid resistance, levels of LPS in BAL fluid were measured in ten CR and eight CS patients. LPS levels were high in nine of the ten CR patients while only small amounts of LPS were found in the CS patients. To confirm these findings in vitro, normal human monocytes isolated from PBMC (n = 6) were exposed to LPS and treated with various concentrations of dexamethasone, a glucocorticoid. A 24-h exposure to LPS resulted in the loss of the ability of dexamethasone to inhibit production of TNF-α and IL-6, proinflammatory cytokines that increase with classical macrophage activation. These results support the role of LPS and classical macrophage activation in resistance to corticosteroids in asthmatic patients. Knowledge of mechanisms underlying drug resistance will be important for strategies attempting to develop novel therapeutic strategies for asthma.



As previously mentioned, microarrays can be used to measure a number of different properties of a cell's nucleic acids such as DNA methylation, single nucleotide polymorphisms or copy number variants. With regard to transcription, in addition to probes for measuring mRNA expression, it is possible to include probes for measuring expression of non-coding RNAs. MicroRNAs are one type of non-coding RNA that has been receiving increasing attention within the past few years and has become an active subject of study through microarray-based platforms. MicroRNAs are short RNA transcripts (∼20–23 nucleotides) that can modulate expression levels or translation rates of specific mRNA targets via sequence-specific binding to the 3′UTR.48 Thus, microRNAs serve as regulators of gene expression and their key role in regulating biological phenomenon has become increasingly clear. Likewise, the dysregulation of microRNA expression patterns may contribute to disease pathogenesis. MicroRNAs may also be able serve as robust biomarkers. Several commercial platforms have been developed that can simultaneously measure the expression level for hundreds of microRNAs.49 Some studies using microarrays have examined microRNA expression in the physiologic response to tobacco smoke and lung diseases such as lung cancer and IPF in order to explore the role of microRNAs in pathogenesis or build biomarkers.31,50,51 Integrating gene and microRNA expression measured in the same clinical sample can reveal how aberrant expression of microRNAs can contribute to the mechanisms responsible for disease-related patterns of gene expression. Additional studies exploring microRNA expression in samples from patients with other types of pulmonary diseases provide an enticing avenue for future work.


Advances in sequencing technology have produced new methods to interrogate the entire transcriptome at a single base resolution. Massively parallel sequencing (MPS), also known as deep sequencing, can be used to quantify the levels of mRNA or microRNA expression (RNA-Seq) much like microarrays. However unlike microarrays, MPS technologies do not rely on the principle of hybridization to quantify expression, which requires a priori knowledge about the gene sequence in order to design complementary probes. Instead MPS of the transcriptome begins with a reverse transcription reaction to generate a library of each sample and then sequences a portion of the millions of resulting cDNAs in each library. Each sequenced fragment is termed a ‘read’. Once sequenced, these reads are computationally mapped to a reference genome in order to find which gene the read most likely originated from. The number of reads that align to a given gene is proportional to the concentration of that gene's cDNA in the library and the level of expression for that gene. A comparison of read counts for a gene between groups of samples likely gives a good estimate of differences in expression levels for that gene. As sequencing does not rely on predetermined complementary probes, it has the ability to find novel transcribed sequences such as new exons or completely new genes. While several microarray platforms can measure the relative expression of individual exons and describe differential alternative splicing to some degree, the capability of MPS to capture splice junctions and to detect the expression of exons with greater resolution can greatly increase our understanding of how alternative splicing may contribute to lung disease. Although current MPS methods can be costly and have lower throughput than array-based methods, they have the capacity to characterize the transcriptome at an unprecedented amount of detail without any prior knowledge of gene sequences and many predict that MPS could eventually replace the microarray as the platform of choice for studies of gene expression. However, microarray technologies can complement MPS technologies by offering the ability to rapidly validate many novel sequences uncovered by RNA-Seq via a custom microarray and determine whether these sequences are differentially expressed across disease states in large numbers of clinical samples.


The ability of microarray technology to simultaneously measure the expression level of every known gene within a sample opens the door to new understanding of biological complexity. By studying gene expression patterns in samples from patients with disease, we can use this technology to elucidate underlying mechanisms of disease pathogenesis and uncover previously undefined molecular subtypes within a disease. We can also use gene expression microarrays to develop biomarkers that can aid in diagnosing disease or guide tailored therapy. Finally, disease specific gene expression patterns can be used to identify novel therapeutic uses for existing compounds. The use of these approaches in the study and treatment of pulmonary disease is continually expanding and will ultimately provide a better outcome and quality of life for patients with chronic lung disease.