Bone mineral density (BMD) is a complex trait influenced by both genetic and environmental factors.1 In a clinical setting, BMD is used to diagnosis osteoporosis, and low bone mass is associated with an increased risk of osteoporotic fracture.2 While many environmental factors are known to affect BMD,3 most of its variance (60% to 80%) is heritable.1 In just the last 2 years, many loci that contribute to the heritable component of BMD have been identified using genome-wide association studies (GWASs).4 However, the nature of the molecular networks that are perturbed by such loci, which affect BMD, are largely unknown.
Recently, a number of studies have used network analysis to assist in dissecting the basis of complex genetic diseases (as examples, see refs. 5 to 8). This strategy has the advantage of being able to simultaneously account for the subtle effects of multiple genes, the environment, and complex interactions between genes and between genes and the environment. Since the transcriptome is currently the only biologic component, other than the genome, amenable to global interrogation (through the use of DNA microarrays), most network analysis methods have focused on reconstructing transcriptional relationships.9 A commonly used analytical tool for this purpose is weighted gene coexpression network analysis (WGCNA).10 WGCNA parses genes into groups, referred to as modules, based on their coexpression similarities across a set of samples. It has been observed that WGCNA modules are often comprised of genes belonging to the same pathway (as examples, see refs. 7 and 11 to 13). Thus WGCNA is capable of reducing the dimensionality of large microarray data sets from tens of thousands of probes to tens of modules comprised of genes that share similar expression patterns and the same general function. Once identified, characteristics of modules, such as their overall behavior and topology, can be correlated with disease status. In many disease-oriented networks, hub genes (ie, in a coexpression network, hubs are genes that are highly correlated with many other module genes) have been shown to play key roles in orchestrating module behavior and disease.7, 8
The goal of this work was to apply transcriptional network analysis to existing expression data in a population characterized for bone mass. Toward this goal, we identified coexpression modules using monocyte microarray expression profiles from young adults selected on the basis of having extremely low or high BMD. To our knowledge, this is the first study to generate networks in the context of BMD. We choose the monocyte study for this analysis because it was the largest existing data set within the public domain with expression data on individuals with divergent BMD values.14 In addition, several studies have shown that monocytes are relevant to bone. For instance, circulating monocytes produce a number of cytokines and growth factors that affect bone.15, 16 Monocytes and osteoclasts also share a common cell progenitor, and monocytes can be induced to differentiate into bone-resorbing osteoclasts in culture.17, 18
We hypothesized that monocytes from individuals were “differentially programmed” by genetic and environmental differences between the two BMD groups that resulted in alterations in the behavior of osteoclasts. In support of this hypothesis, we identified a module (module 9) of 88 genes whose overall expression behavior was significantly different between the BMD groups. As validation of the biologic relevance of this module, we used two BMD GWAS data sets to demonstrate that module hubs were more likely to be genetically associated with BMD. Our analysis demonstrates that, when applicable, combining network analysis of expression profiles and genome-wide association data is a powerful approach to identify BMD candidate genes.
Materials and Methods
Gene expression data processing
To generate gene coexpression networks, we used previously published microarray data from 26 healthy young Chinese females.14 The data set consisted of expression profiles generated from circulating monocytes that were isolated and purified from subjects with low or high BMD values. The two groups differed dramatically in both hip and spine BMD but not in age, height, or weight. In addition, significant care was taken to exclude individuals with diseases or chronic disorders that might influence the skeleton. A full description of the population and study design is provided in ref. 14. The Affymetrix U-133 Plus 2.0 microarray (Affymetrix, Santa Clara, CA, USA) was used for profiling. We downloaded the raw CEL files from NCBI's Gene Expression Omnibus (GSE7158; www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7158). The raw data were imported and processed using the Affy package19 for the R Language and Environment for Statistical Computing (http://cran.r-project.org/).20 The robust multiarray algorithm was used to normalize and generate probe-level expression data.21 Since our generation of a monocyte network was based on a relatively small number of samples, we were concerned that arrays that were outliers, in terms of their global expression profile, would impair our ability to generate informative networks. Thus, as part of our quality-control protocol, we performed a clustering and principal-components analysis (PCA) to identify potential outliers. The R functions hclust and prcomp were used for this analysis. The first principal component (PC1) explained 97.5% of the overall variance. By clustering samples based on global expression values and PC1, two samples from the high-BMD group, GSM172405 and GSM172418, were identified as outliers and thus removed from the analysis (Fig. 1). The same result was obtained by restricting PCA to the top 50% and 25% of probes based on mean expression levels.
Weighted Gene Coexpression Network Analysis
Network analysis was performed using the WGCNA R package.22 An extensive overview of WGCNA, including numerous tutorials, can be found at http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/. To begin, we filtered the array data to remove lowly and nonexpressed genes by selecting probes with a mean expression in the top 50% of all probes. Next, we selected the 8000 most varying genes based on variance across the 24 samples and then selected the most connected (based on k.total, described below) 3600 genes for network analysis. For computational reasons, the analysis was restricted to 3600 genes. If multiple probes existed for a given gene, only the most connected probe per gene was included in the list of 3600. To generate a coexpression network for the selected probes, we first calculated Pearson correlation coefficients for all gene-gene comparisons across the 24 microarray samples. The matrix of correlations then was converted to an adjacency matrix of connection strengths. The adjacencies were defined as , where xi and xj are the ith and jth gene expression traits. The power β was selected using the scale-free topology criterion previously outlined by Zhang and Horvath.10 Network connectivity (k.total) of the ith gene was calculated as the sum of the connection strengths with all other network genes, k.totali = This summation performed over all genes in a particular module was defined as the intramodular connectivity (k.in). Modules were defined as sets of genes with high topologic overlap.10 The topologic overlap measure (TOM) between the ith and jth gene expression traits was taken as
where denotes the number of nodes to which both i and j are connected, and u indexes the nodes of the network. A TOM-based dissimilarity measure (1 – TOM) was used for hierarchical clustering. Gene modules corresponded to the branches of the resulting dendogram and were precisely defined using the “dynamic hybrid” branch-cutting algorithm.23 Highly similar modules were identified by clustering and merged together. In order to distinguish modules, each was assigned a unique number. Gene significance (GS) for the ith gene was defined as the –log 10 of the p value of a Student's t test measuring differential expression between the low- and high-BMD groups. Module significance (MS) was calculated as the mean GS for all module genes. Module eigengenes were defined as the first principal component calculated using PCA. A Student's two-tailed t test was used to evaluate differences in module eigengene levels between the BMD groups.
Gene ontology and pathway enrichment analysis
We performed a gene ontology (GO) enrichment analysis for network modules using the Database for Annotation, Visualization and Integrated Discovery (DAVID, http://david.abcc.ncifcrf.gov/).24, 25 Each analysis was performed using the functional annotation clustering option. Functional annotation clustering combines single categories with a significant overlap in gene content and then assigns an enrichment score (ES, defined as the –log 10 of the geometric mean of the unadjusted p values for each single term in the cluster) to each cluster, making interpretation of the results more straightforward. To assess the significance of functional clusters, we created 10 sets of 316 genes (size of the average module identified in this study) randomly selected from a list of unique gene IDs from the Affymetrix U-133 Plus 2.0 microarray. Functional annotation clustering was performed for all 10 random gene sets. The maximum random ES was 2.8; therefore, we used a conservative ES cutoff of 3.0 or greater as the threshold for significance.
In silico BMD association analysis
To identify module genes with evidence of association with BMD, we used data from two previously published GWAS data sets, the dCODE Genetics Study (DCG)26 and the Framingham Osteoporosis Study (FOS).27 The DCG GWAS data were downloaded from http://content.nejm.org/cgi/content/full/NEJMoa0801197/DC1 as a list of all study single-nucleotide polymorphisms (SNPs) and precomputed association p values. Details of the statistical analysis are provided in ref. 26. The DCG GWAS consisted of 5861 white subjects [87% female with a mean age of 66.2 years (males) and 59.4 years (females)] phenotyped for hip and spine BMD and genotyped at approximately 300,000 SNPs. The FOS data were downloaded from NCBI's database of Genotype and Phenotype (dbGAP, http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap), also as a list of SNPs with precomputed p values. The FOS GWAS consisted of 1141 white subjects (56% female with a mean age of 62.5 years) phenotyped for femoral neck, trochanter, and lumbar spine BMD and genotyped at approximately 100,000 SNPs. Details of the statistical analysis are provided in ref. 27.
To convert SNP-level p values to gene-level p values, we used ProxyGeneLD (http://ki.se/ki/jsp/polopoly.jsp?d=26072&a=67192&l=en).28 ProxyGeneLD works by identifying clusters of GWAS SNPs (referred to as proxy clusters) in high linkage disequilibrium (LD) (r2 ≤ 0.80) using genotype data from the International HapMap Project.29 It then assigns proxy clusters and singleton SNPs (that did not group within a proxy cluster) to the nearest gene. Gene-level p values are calculated as the minimum of any SNP, either as a singleton or as a member of a proxy cluster, per gene. For this analysis, LD patterns were determined using CEU HapMap samples, and genes were defined as the transcript plus 1 kbp extension upstream, including promoter regions, and 1 kbp downstream. We also evaluated the results when the analysis was repeated, and 5 or 0 kbp up- or downstream was used to define a gene. However, altering these parameters did not affect the results of the analysis. Associated genes were defined as those with p ≤ .05 for at least one of the two BMD traits in DCG and at least one of the three BMD traits in FOS.
Reconstructing the monocyte transcriptional network
To reconstruct a gene network with relevance to BMD, we used previously generated microarray data on purified circulating monocytes from 24 young (mean age 27.3 years) Chinese females, 12 with extremely low (mean ± SD hip Z-score of −1.72 ± 0.60) and 12 with extremely high BMD values (mean ± SD hip Z-score of 1.57 ± 0.57 for the 14 subjects in the full data set; see “Materials and Methods” for an explanation).14 The expression data consisted of transcript levels for 54,675 microarray probes, representing 20,101 unique genes. Using WGCNA, we reconstructed the monocyte transcriptional network through the following four steps: (1) selection of genes expressed in monocytes and whose expression varied across samples, (2) calculation of the connection strengths between network genes, (3) clustering of genes based on connection-strength similarities, and (4) module identification. In the first step, we identified the 3600 genes that were expressed, highly variable, and highly connected. (In this analysis, connected is synonymous with correlated. If two genes are highly connected, they are highly correlated. If a gene has a high connectivity, it is strongly correlated with many other genes.) In the second step, Pearson correlation coefficients were calculated for all gene-gene pairwise comparisons. The resulting correlation matrix then was converted to a matrix of connection strengths by raising the correlations to a power β. This step emphasizes strong correlations and lessens the contributions of weak correlations. Biologic networks have been shown to exhibit scale-free topology, in which most genes are sparsely connected, whereas a small number are highly connected hub genes.30 Therefore, to ensure that the monocyte network was biologically relevant, β = 8 was selected, which resulted in a network that was approximately scale-free based on the criteria outlined by Zhang and Horvath.10 We next calculated a topologic overlap measure (TOM) for each gene pair (see “Materials and Methods”), which is a measure of the similarity of shared connection strengths between two genes. If two genes have a high TOM, they are correlated with the same set of genes. Finally, in the last step, genes were grouped into modules using hierarchical clustering of a TOM dissimilarity measure (1 – TOM). Genes with high TOM (low dissimilarity measures) formed the low-hanging branches of the dendogram and were grouped into gene modules (Fig. 2).
Of the 3600 initial genes, 3475 were parsed into one of 11 gene modules (Fig. 2 and Table 1). To distinguish one module from another, each was assigned a number. The network connections of the remaining 125 genes were dissimilar from those within the 11 defined modules and were not assigned to a module. The modules ranged in size from 30 genes in module 1 to 740 in module 5 (Table 1). A complete list of network metrics and the module membership for each gene is provided in Supplemental Table S1.
Table 1. List of the Top GO Term in the Most Significant DAVID Functional Cluster for Each Monocyte Network Module
No. of genes
Top-term p value
ES = enrichment score (–log 10 of the geometric mean of the unadjusted p values for each single term in the cluster); N.S. = significant (ES > 3.0) clusters were not observed.
6.7 × 10−7
GO:0031091—platelet α granule
2.7 × 10−15
GO:0002376—immune system process
6.5 × 10−7
5.0 × 10−27
GO:0043231—intracellular membrane-bound organelle
3.9 × 10−10
GO:0042981—regulation of apoptosis
7.2 × 10−9
8.0 × 10−20
GO:0009615—response to virus
7.6 × 10−11
1.5 × 10−32
5.4 × 10−10
Previous studies have shown that WGCNA modules are robust, even when generated using small sample sizes (n < 30).6, 8, 31 However, owing to the relatively small number of array samples used in this analysis (n = 24), we wanted to determine how robust the connectivity relationships between genes were to fluctuations in sample size. We focused on module 9 (selected based on its apparent biologic importance, to be described below) genes and calculated intramodular connectivity (k.in) in 1000 module 9 gene expression sets generated by randomly selecting 12 of the 24 array samples. Correlations then were calculated between the true module 9 gene k.in values and those from the 1000 randomly selected sets. The mean ± SD correlation was 0.82 ± 0.02. This indicates that, in this data set, the relationships between module 9 genes were robust to the exclusion of 50% of the data.
Monocyte network modules are enriched for genes involved in similar functions
One of the aims of network analysis is to identify sets of functionally related genes (eg, genes in the same pathway) based on the similarity of transcriptional patterns of such genes across a set of conditions, in our case differing genetic backgrounds. To determine if the 11 modules in this study were comprised of functionally similar genes, DAVID was used to evaluate GO enrichments. DAVID (http://david.abcc.ncifcrf.gov/) is a Web-accessible tool that calculates overrepresentation of various gene groupings, such as GO categories, within a query gene list.24, 25 Interpretation of enrichment results can be problematic because of the inherent redundancy in GO categories. To address this issue DAVID provides users with a functional clustering tool. This application groups GO categories with significant overlap in gene content into clusters. Enrichment statistics are calculated for each term in a cluster, and based on these results, the entire cluster is assigned an enrichment score (ES; see “Materials and Methods”). This makes interpretation of the results more straightforward.
Using DAVID's functional clustering tool, 10 of the 11 modules had significant (>3.0) ES scores for at least one cluster (Table 1; a full list is provided in Supplemental Table S2). The most significant GO term (based on Bonferroni-adjusted p values) for the most significant cluster for each module is presented in Table 1. The most significant enrichment was observed for module 10, which contained a number of genes belonging to the GO cellular component term “cytoplasmic part” (p = 1.5 × 10−32). We further investigated the full list of enrichments for this module (Supplemental Table S2) and found that it was highly enriched for genes located in the mitochondrion. Of the 378 genes in module 10, 78 (20.9%) belonged to the GO term “mitochondrion” (p = 4.0 × 10−18). In addition, 17 genes (4.5%) belonged to the GO term “oxidative phosphorylation” (p = 5.7 × 10−7). These data demonstrate the ability of WGCNA to organize genes into “pathways” based on coexpression patterns and that the monocyte modules are comprised of genes that share biologically meaningful functional connections.
Expression of module 9 is associated with BMD
The primary goal of this analysis was to identify modules from the monocyte network whose overall behavior was associated with differences in BMD status. To identify such modules, we defined two additional network metrics, gene significance (GS) and module significance (MS). GS was calculated as the –log 10 of the p value generated for each gene using a t test and is a measure of the strength of differential gene expression between the low- and high-BMD groups. MS was calculated as the average GS within each module. As shown in Fig. 3A, module 9 stood out as having the highest MS. To determine if the overall expression of module 9 was significantly associated with BMD status, we calculated its eigengene using PCA.32 The eigengene explained 72.5% of the variance in module 9 gene expression. Congruent with results using MS, the module 9 eigengene was significantly higher in the low-BMD group (p = 0.03; Fig. 3B). Of the 88 module 9 genes, the expression of all but three was higher in the low-BMD group (based on direction of the t statistic; Supplemental Table S1). These data indicate that the overall expression of module 9 in monocytes is inversely correlated with BMD status.
Independent genetic validation of the biologic importance of module 9 hub genes
Genes that are highly connected are referred to as hubs and have been shown to be important in disease and in controlling module behavior.6–8, 12, 33 Therefore, if hub genes are driving the expression of module 9, which itself is associated with BMD, then it is plausible that module 9 hub genes are more likely to be genetically associated with BMD. To test this hypothesis, we used two publically available GWAS data sets to perform an in silico association study. Using the ProxyGeneLD tool (see “Materials and Methods”), we generated gene-level p values for all network genes in the DCG26 and FOS27 GWASs. Genes with nominal p values ≤0.05 for at least one of the BMD traits collected in both studies were defined as associated. Next, we determined the mean connectivity of the associated versus nonassociated genes in all modules. Of the 11 modules, only module 9 had a ratio of k.in for associated versus nonassociated genes of greater than 1 (Fig. 4A). As shown in Fig. 4B, the mean ± SE k.in of the associated genes (n = 6) in module 9 was significantly higher (15.6 ± 2.1 versus 9.0 ± 0.8; p = 0.03) than the k.in of nonassociated genes (n = 51). Of the 88 module 9 genes, 57 harbored SNPs genotyped in both studies. Of the 57, six genes meet our criterion of associated (Table 2). In fact, three of the top four, including the first and second most connected genes (IFI35 and EPSTI1; Table 2), were among the six that were associated. Stated another way, the percentage of associated genes in the top 5% of genes based on k.in was 3 of 4 (75%) versus 3 of 53 (5.7%) for the rest of the module (Fisher's exact p = 0.02). These data provide strong independent validation that module 9 and specifically its hub genes are involved in the regulation of BMD.
Table 2. Module 9 Genes Displaying Nominal Evidence of Genetic Association With BMD in Two Independent GWASs
Probe set ID
t stat = t test statistic measuring difference in genes expression between low- and high-BMD groups; –log p = –log 10 of the t-test p value; k.in = green module connectivity; FC = fold change in expression between low- and high-BMD groups (FC is the expression of each gene in the low-BMD group relative to high-BMD group); k.in rank = gene rank based on k.in; DCG p = gene-level p value generated by ProxyGeneLD in the DCG GWAS; FOS p = gene-level p value generated by ProxyGeneLD in the FOS GWAS.
1.0 × 10−2
2.0 × 10−2
8.1 × 10−4
3.2 × 10−3
5.4 × 10−3
3.0 × 10−2
1.8 × 10−2
1.6 × 10−3
9.9 × 10−4
4.0 × 10−2
2.6 × 10−4
3.0 × 10−2
We have used transcriptional network analysis to identify a coexpression module (module 9) that is highly enriched for genes involved in the immune process, whose collective expression was correlated with bone mass in young females with divergent BMD status. Importantly, we were able to validate the importance of module 9 by demonstrating that genes displaying evidence of association with BMD, in two independent GWAS data sets, were more highly connected than nonassociated genes. These data suggest that inputs, in the form of distinct genetic and environmental factors specific to each study group, influence the state of module 9 in monocytes, which, in turn, contributes to the discordant levels of BMD.
Two previous reports using these same expression data have been published. The first was the initial description of the population and the array data,14 in which the authors identified three genes, guanylate binding protein 1, interferon-inducible, 67 kDa (GBP1), signal transducer and activator of transcription 1, 91 kDa (STAT1), and chemokine (C-X-C motif) ligand 10 (CXCL10), whose expression differed between the groups using a standard differential gene expression analysis. The second study replicated the difference in expression for STAT1 and GBP1 using a focused reanalysis of the original data.34 The authors also demonstrated that polymorphisms in STAT1 were associated with BMD (a finding that we replicate in this study using GWAS data; Table 2). Our current results add to the data acquired in the two initial studies by associating an entire gene module with BMD. At the same time, our results highlight several advantages of network analysis. The statistical power of the original analysis was limited owing to the need to correct for multiple comparisons. In contrast, by changing the experimental unit of analysis from probes to modules, we reduced the dimensionality of the data (from 54,675 probes to 11 modules) and therefore significantly increased our power to identify concordant changes in the expression of multiple genes. More important than the increase in power, network analysis provided information on the relationships between genes. Specifically, we identified a set of genes that share similar functions and are connected to one another. We hypothesize that not only are many of the individual genes important in the regulation of bone mass but also that the connections between module 9 genes are important. This hypothesis is supported by the fact that a gene's k.in, a network property inherent to module 9, was related to the likelihood of that gene being genetically associated with BMD.
While this study provides an example of the advantages of analyzing microarray data in the context of transcriptional networks, it does have limitations. First, the number of samples used to reconstruct the monocyte network was relatively small. Although studies have generated robust networks using a smaller number of samples6, 8, 31 and we demonstrated that our modules were robust to the exclusion of 50% of the data, it would have been more appropriate to use a larger population. It is likely, however, that the nature of the current study population, two groups enriched for factors resulting in extremely divergent BMD, provided an increase in power to detect network differences. More important, the genetic association of module hubs with BMD in populations that differ both in ethnicity and age is evidence that our results are applicable to the general population. There are several studies demonstrating that monocytes play an important role in bone development. For example, monocytes and osteoclasts share a common cell progenitor, and monocytes can be induced to differentiate into bone-resorbing osteoclasts in culture.17 Monocytes are also known to secrete factors, such as interleukin 1 and prostaglandin, that affect bone development.15, 35 However, while there is a biologic basis for an involvement of monocytes in bone,36, 37 expression data from more obvious sources, such as bone or bone cells, likely will be even more important in explaining network differences in BMD status. Future studies profiling multiple tissues and cell types in larger groups of individuals may result in identification of more informative modules.
Module 9 consisted of 88 genes, 21 (23.9%) of which were annotated as involved in the immune response (p = 5.0 × 10−8; Table 1; a full list is provided in Supplemental Table S2). The enrichments were, in part, due to a large number of genes induced by interferon γ (IFN-γ), including many interferon-induced (IFI) genes (IFI30, IFI35, IFI44, IFI44L, IFI6, IFIT1, IFIT2, IFIT3, and IFITM3), members of the immunoproteosome (PSMB9, PSMB10, PSME1, and PSME2), and STAT1. IFN-γ has been shown to play contradictory roles in bone metabolism.38–40 In vivo it has both indirect pro-osteoclastogenic (proresorptive) and direct anti-osteoclastogenesis (antiresorptive) actions.41 Its direct effects involve inhibiting the maturation of osteoclast precursors, whereas its indirect effects are via T-cell activation and production of the pro-osteoclastogenic cytokines, receptor activator of nuclear factor-κB ligand (RANKL) and tumor necrosis factor α (TNF-α).41 Although both actions are seen in vivo, its pro-osteoclastogenic role seems to predominate.41 Therefore, it is possible that IFN-γ acts via the module 9 network to “differentially program” monocytes in the two BMD groups. These differences may result in the generation of larger numbers or more active osteoclasts from monocytes in the low-BMD group, leading to greater bone turnover and decreased bone mass. This hypothesis is supported by the observation that osteoclasts cultured from monocytes isolated from osteoporotic subjects have higher bone-resorbing activities than those from normal individuals.42
This study provides a number of candidate genes that warrant further investigation. For example, interferon-inducible protein 35 (IFI35), epithelial stromal interaction 1 (breast) (EPSTI1), and SP110 nuclear body protein (SP110) were hubs in module 9 (k.in ranks of 1, 2, and 4) and were associated with BMD (Table 2). All three are novel with respect to a function in bone. IFI35 is induced by interferons43 and is known to physically interact with N-myc (and STAT) interactor (NMI), which binds STATs and augments STAT-mediated transcription in response to cytokines such as IFN-γ.44EPSTI1 is a recently identified gene induced by epithelial-stromal interaction in human breast cancer.45 SP110 is an interferon inducible protein46 and is thought to function as a nuclear hormone receptor transcriptional coactivator.47 Further studies will be needed to elucidate their precise roles in the regulation of BMD.
In conclusion, we have used transcriptional network analysis to elucidate a module and genes that appear to play an important role in the regulation of bone mass. This analysis highlights the advantages of such an approach over conventional analytical strategies. This analysis provides proof of principle and sets that stage for larger-scale studies using network analysis to investigate the molecular basis of osteoporosis.
The author states that he has no conflicts of interest.
We would like to Drs Steve Horvath and Aldons J Lusis, University of California, Los Angeles, for helpful discussions. We also would like to thank Ana Lira, University of Virginia, for assistance formatting this article.