Presented at the SETAC Pellston Workshop, Forest Grove, Oregon, USA, 18-23 April 2009.
Predictive Ecotoxicology Workshop
Reverse engineering adverse outcome pathways†
Article first published online: 29 NOV 2010
Copyright © 2010 SETAC
Environmental Toxicology and Chemistry
Volume 30, Issue 1, pages 22–38, January 2011
How to Cite
Perkins, E. J., Chipman, J. K., Edwards, S., Habib, T., Falciani, F., Taylor, R., Van Aggelen, G., Vulpe, C., Antczak, P. and Loguinov, A. (2011), Reverse engineering adverse outcome pathways. Environmental Toxicology and Chemistry, 30: 22–38. doi: 10.1002/etc.374
- Issue published online: 14 DEC 2010
- Article first published online: 29 NOV 2010
- Accepted manuscript online: 20 OCT 2010 06:14PM EST
- Manuscript Accepted: 5 FEB 2010
- Manuscript Revised: 5 NOV 2009
- Manuscript Received: 14 SEP 2009
- Mechanism of action;
- Network inference;
- Adverse outcome pathway
The toxicological effects of many stressors are mediated through unknown, or incompletely characterized, mechanisms of action. The application of reverse engineering complex interaction networks from high dimensional omics data (gene, protein, metabolic, signaling) can be used to overcome these limitations. This approach was used to characterize adverse outcome pathways (AOPs) for chemicals that disrupt the hypothalamus-pituitary-gonadal endocrine axis in fathead minnows (FHM, Pimephales promelas). Gene expression changes in FHM ovaries in response to seven different chemicals, over different times, doses, and in vivo versus in vitro conditions, were captured in a large data set of 868 arrays. Potential AOPs of the antiandrogen flutamide were examined using two mutual information-based methods to infer gene regulatory networks and potential AOPs. Representative networks from these studies were used to predict network paths from stressor to adverse outcome as candidate AOPs. The relationship of individual chemicals to an adverse outcome can be determined by following perturbations through the network in response to chemical treatment, thus leading to the nodes associated with the adverse outcome. Identification of candidate pathways allows for formation of testable hypotheses about key biological processes, biomarkers, or alternative endpoints that can be used to monitor an AOP. Finally, the unique challenges facing the application of this approach in ecotoxicology were identified and a road map for the utilization of these tools presented. Environ. Toxicol. Chem. 2011;30:22–38. © 2010 SETAC
The large number of chemicals being released into the environment each year presents a huge challenge for environmental regulatory agencies worldwide 1. In response to this challenge, a new paradigm in toxicity testing has been proposed that would allow screening of thousands of chemicals through high-throughput in vitro assays to determine those that present a potential environmental hazard 2, 3. For this new approach in toxicological assessment to be successful, however, the mechanistic basis for environmental stressor–induced adverse outcomes must be established.
Although the underlying biological processes for many toxicologically relevant adverse outcomes are well known and well characterized in the literature, some responses to stressors are mediated through currently unknown, or incompletely characterized, mechanisms or modes of action 4, 5. The advent of global, genome-wide analysis tools such as the omics technologies (transcriptomics, proteomics, or metabolomics) and new screening methodologies have provided new capabilities for probing entire biological systems, thereby allowing the discovery of novel modes of action for environmental stressors 6. Even when good indications exist on the identity of an adverse outcome pathway (AOP) 7 for a chemical or stressor, a network inference approach based on omics data, as described in this article, can reveal important alternative AOPs.
Microarrays and other transcriptomics-based techniques are powerful tools for assessing chemical impacts on a system and have been used to identify or postulate AOPs or modes of action 8–11. These studies typically have relied on model organisms, either directly or indirectly, in which established genetic, functional, or metabolic pathways have been characterized in detail. A few recent examples include the investigation of the mode of action for hepatotoxicity of triazole fungicides in rats 12, mapping of genes to functional pathways of model animals enabling investigation of the mode of action of explosives in fathead minnows, earthworms, and bobwhite quail 13–15, and metals and ibuprofen in Daphnia magna16, 17. Several strategies for characterizing chemicals with unknown modes of action have been proposed, using classification models created from gene expression profiles of chemicals with known modes of action 18, 19. The application of omics techniques in prescreening chemicals and mixtures for prioritization has been proposed as part of the ToxCast program 20, which has employed a diverse selection of tests, including toxicogenomics.
Many of the aforementioned efforts involve a priori knowledge of pathways and effects. Recent efforts using mathematical and statistical algorithms offer alternative and complementary approaches to determine relationships between genes, proteins, and metabolites. The results from such algorithms then can be integrated with existing knowledge within a pathway discovery process. The impact of a chemical on a biological system can be described in terms of its effect on a network(s) of biological interactions that specify the relationships between components. An understanding of the overall architecture of the connections between clusters of genes, proteins, and metabolites altered or modulated as a result of exposure can be useful for deducing toxicologically relevant effects on such biological networks 21. In turn, these networks and interactions can be used to elucidate events or adverse outcome pathways regulating organismal response to a defined stressor and that dictate outcomes 22.
Using new approaches in network inference and reverse engineering, high dimensional data, such as microarray data, can be used to deduce novel interactions, networks, and pathways, thereby enabling the exploration and examination of how chemicals cause toxicity. In the current study, reverse engineering refers to the use of statistical methods to infer a connectivity network given a set of gene expression, or other omic, data 23, 24. Development and validation of reverse engineering approaches using statistical modeling to infer regulatory or causal relationships is an actively evolving research area 25. For example, reverse engineering strategies have been proposed for identifying the mode of action for drugs 18, 26 and have played a significant role in defining regulatory pathways involved in cigarette-smoke–induced oxidative stress of mouse lungs 27.
Identification of AOPs using reverse engineering
In the current study, we discuss the application of reverse engineering gene regulatory networks from microarray data to identify adverse outcome pathways, highlight the unique challenges facing the application of this approach in the field of ecotoxicology, and provide an initial road map for the utilization of these tools. Using the approach outlined in Figure 1, linkage between a stressor and adverse outcome(s) is achieved in three stages: building networks, interrogating networks (e.g., analysis of inferred network topology), and identification and validation of nodes and pathways predictive of adverse outcomes. A stressor can be linked to adverse outcomes through reverse engineering of impacted pathways and networks, using genomics data coupled with phenotypic measurements. The sites of stressor input into a network can be identified by testing the association between a stressor and individual components or modules of the network. Similarly, the sites in a network that are most closely associated with adverse outcomes can be identified. The path from stressor to adverse outcome in a network can be considered a candidate adverse outcome pathway (AOP). The candidate pathway allows for the formation of testable hypotheses about key biological processes, biomarkers, or alternative endpoints that can be used to monitor an AOP.
Building networks: Reverse engineering and network inference
Reverse engineering of the structure of biological networks lies at the core of systems biology. Inference of such biological networks can be done using different approaches, depending on available data sets. Network structures have been determined based on multi (high) dimensional data such as protein–protein binding strength, protein abundance, signaling (protein activation), and metabolic data (metabolite level) 28–33. Here we limit our discussion to inference of transcriptional regulatory networks. Such networks are most often deduced from gene expression data, possibly combined with other prior information on the genes or DNA binding sites. These gene networks trace the direct and indirect interactions that control or influence gene expression. High-throughput microarray experiments are now producing mRNA expression data in quantities large enough for researchers to reconstruct portions of transcriptional regulatory networks based on correlations between measurements of expression response in diverse states 34. Inference of gene networks are probably most studied because of the relatively large amount of microarray data available, obtained at relatively low cost. The state of a transcriptional regulatory network refers to the set of gene (node) expression values corresponding to a condition such as a chemical dose, time point of exposure, or a genetically distinct individual.
A case study in reverse engineering of an AOP
The most direct illustration of application of reverse engineering of biological networks to elucidation of AOP elements is to simply conduct a demonstration. We do so here, using a large gene expression data set to explore and identify the adverse outcome pathways related to a disruptor of the hypothalamus-pituitary-gonadal endocrine axis. As part of a larger project investigating and modeling the impact of endocrine disrupting chemicals on reproduction in FHM 35, we have assembled a data set consisting of 868 single-color 15,000 probe high-density oligo microarrays. This data set was focused on expression changes in FHM ovary tissue exposed to seven different chemicals, over different periods, with both in vivo and in vitro exposures. Each exposure set contains four to eight biological replicates and controls (Table 1). The data set is large enough to provide an excellent case study for demonstrating how network inference and reverse engineering can be used to examine AOPs. A detailed description of the data and initial analysis is beyond the scope of this review and will be presented in a separate publication (N. Garcia-Reyero, in preparation).
|Chemical/experiment||In vivo/in vitro||Time||Concentration||Conditions no control||Conditions + control||Total arrays (n)|
|Fadrozole||In vivo||0.5, 1, 2, 4, 6 h||5, 50 µM||10||15||60|
|Fadrozole||In vivo||6, 12, 24 h||50 µM||3||6||24|
|Fadrozole||In vitro||0, 1, 2, 3, 4, 6, 8, 10, 12 h||50 µM||8||16||64|
|Fadrozole||In vivo||Exposed (1, 2, 4, 8 d), recovery (1, 2 ,4, 8 d)||3,30 µM||16||3||179|
|Flutamide||In vivo||1, 2, 4, 8 h||500 µg/L||5||10||39|
|Ketoconazole||In vitro||2, 4, 6, 8, 10, 12 h||0.5 µM, 5 pools of 5||6||12||52|
|Ketoconazole||In vitro||15, 30, 45, 60, 75, 90, 105, 120, 135, 150 min||0.5 µM||10||20||84|
|Ketoconazole||In vivo||6, 12, 24 h||0.5 µM||3||6||24|
|Stages||In vivo||NA||Previtellogenic, vitellogenic, mature ovary, ovulated eggs, atretic||5||NA||25|
|Prochloraz||In vitro||2, 4, 6, 8, 10, 12 h||2.5 µM, 5 pools of 5||6||12||53|
|Prochloraz||In vivo||6, 12, 24 h||2.5 µM||3||6||22|
|RDX||In vivo||1, 21d||5 mg/L||2||4||22|
|TNT||In vivo||0.5, 1, 2, 4, 6, 24 h||5 mg/L||6||12||60|
|Trenbolone||In vivo||Exposed (1, 2, 4, 8 d), recovery (1, 2 ,4, 8 d)||low, high dose||16||3||162|
We focus our discussion on flutamide, one of the seven chemicals with which our data set was created. Flutamide is an androgen receptor antagonist primarily used to treat prostate cancer. It specifically competes with testosterone and dihydrotestosterone for binding to the androgen receptor 36 and ultimately inhibits prostate cell proliferation. Flutamide has been used as a model antiandrogen for endocrine disruption in FHM because of its specific mechanism of action 37, 38. Flutamide exposure causes a decrease in plasma levels of estradiol, number of mature oocytes, and fecundity in female FHM 37. Recent studies also suggest that flutamide may cause adverse effects on FHM through pathways other than that involving the androgen receptor 38.
Experimental designs for network inference and reverse engineering
Inference of the structure of molecular interaction networks is limited by the extent of information on the state of the nodes (genes, proteins, or metabolites) in the network. The more data points we have on each node (the greater the number of expression values for each gene) in as many different network states as possible, the more likely we will accurately infer connections (network edges) between nodes. Data on node states from a large number of conditions involving a diverse range of stressors at different dose levels and time points maximizes the identification of connections between different genes, proteins or metabolites, and a large number of such data points are needed to accurately infer networks from high throughput data 39, 40. Also, experimental designs and data sets of utility in reverse engineering efforts must balance the specific perturbation of interest (e.g., chemical being studied) with the diversity of perturbations required to reveal the underlying network structure 28, 39, 41, 42.
One way to achieve this balance is to survey a broad range of chemicals in a single model system 41. The aggregate results for all chemicals can be used for network inference, and the perturbations produced by individual chemicals then can be evaluated in the context of the resulting network. An alternative to large numbers of chemical exposures in the experimental design is to incorporate genetic diversity in network construction. Genetic variation has been proposed as likely to modulate response to a particular stressor and therefore provides perturbations to phenotypic response (e.g., gene expression) that can be used in reconstruction of networks 43. A potential advantage of using ecologically relevant organisms in network construction is the predominant use of genetically diverse outbred populations, in contrast to studies in other model systems (such as mice) in which more genetically homogenous organisms are often used. If genetic markers such as single-nucleotide polymorphisms can be incorporated into the study design, then adverse effects can be linked to genetic markers through association with transcriptional networks.
Role of phenotypic anchors in networks
Another critical element for reverse engineering to support AOP discovery is the integration within the network of accurate phenotypic descriptors of the adverse outcome under consideration. This integration can be achieved by identifying the gene, protein, or metabolite state measurements most likely to explain variations in the phenotypic response. In the case study presented here, the adverse outcome is reduced fecundity, which has previously been linked with lower levels of circulating vitellogenin and steroids 44. Circulating hormone levels are also included in several of the individual experiments (Table 1), thus providing additional biological context for interpreting the subnetworks identified as important for regulating vitellogenin levels, which have been linked to population level effects 45. When combined with gene expression data, the phenotypic anchoring hormone and protein data can enable connection of pathways to adverse outcomes via association networks, as described later.
Data normalization issues in reverse engineering
In practice, as in the current case study, the extensive data requirements for reverse engineering networks necessitate the use of data from multiple different experiments. As a result, significant challenges often present in integrating the diverse data sets. Appropriate normalization procedures must be implemented so that expression levels can be effectively compared across biological samples within an experiment and, even more challenging, to allow comparison between different experiments. Inappropriate normalization can incorporate systematic bias into comparisons 46. Most normalization methods require that either the number of genes changing in expression between conditions be small or that an equivalent number of genes increase and decrease in expression. When normalizing a large compendium of data sets, including different experiments performed by independent laboratories and with different experimental conditions, these assumptions are not necessarily true. Several strategies have been proposed to address these issues. Some involve application of a multistep normalization procedure 47. Others involve incorporating knowledge of experiment and batch variations in the form of a Bayesian prior 48.
If neither of these approaches is feasible, then normalization should be applied only on arrays within each treatment or condition group. However, in many cases, experiments are not properly randomized, and confounding of conditions occurs with sources of nonbiological variation. For this reason, all of the arrays in a particular experiment are typically normalized together as a single group 49. In the current case study, all experiments were performed in the same laboratory, on the same tissue (ovary), and with a range of treatments for which we can hypothesize the original assumptions to be correct. For this reason we have applied quantile normalization to the data set, as discussed in Bolstad et al. 49, to give the same empirical distribution of intensities to each array, although we recognize that more sophisticated approaches may be more appropriate.
Building networks: approaches and algorithms
Regulatory effects on genes can be established through direct investigation by experimental analysis or inferred using algorithms that perform statistical inferences from observational data. For example, the expression values of the genes are used to find the connections (edges) between the genes (nodes) of a transcriptional regulatory network. Current approaches for inferring or reverse engineering networks have been extensively and recently reviewed 21, 29, 33, 40. Computational scientists, drawing on knowledge from several fields, have built very sophisticated inference algorithms to construct networks from even partial correlations in state. Inference algorithms have been developed based on fields including information theory, for example, mutual information and conditional mutual information 50, 51, linear regression models 52–55, probabilistic graphical models such as Bayesian network structure learning 28, 56–58, (59; http://code.google.com/p/bnt/), 60, state space models 61, and data mining such as association rule mining 62.
Mutual information-based analysis
In the analysis of the current case study data, we focused on inference of a network topology from the entire compendium data set. We applied two different algorithms that are based on information theory and the entropy-based estimation of mutual information (MI): the Context Likelihood of Relatedness (CLR) algorithm 63 and the Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) 39. Mutual information is a general measurement of dependencies between pairs of variables. These variables can be genes, proteins, metabolites, or relevant physiology measurements. The higher the MI score between two variables, the greater the information we derive on the states of one from the pattern of states of the other. Although a high MI does not necessarily imply causation, high-scoring MI-derived interactions between gene pairs can be used to formulate trial structures of underlying regulatory networks, as guidance for further exploration. The entropy-based MI captures a broad range of biologically relevant dependencies (positive, negative, and linear as well as nonlinear relationships) and is therefore capable of detecting more general dependencies than measures of linear correlation (i.e., Pearson or Spearman correlation).
Although estimation of the MI is not based on any assumptions about the distribution of variables, it does require a relatively large number of data points (>50) on the state of each node 39, 63. Such a requirement is not unusual for any statistical/correlation-based approach. However, the computational requirements of the independent pairwise MI calculations between genes are relatively small, with each calculation using only the state (expression) data for those two genes. In contrast, Bayesian network structure learning methods operate by completely rescoring an inferred network topology after each proposed edge change. As a result, MI can be used with thousands of genes whereas other algorithms, such as Bayesian network structure learning algorithms, may be limited to several hundred or fewer genes because of computational overhead 60. If insufficient data are used with Bayesian methods, the problem is undetermined and cannot provide usable solutions 42. Once the set of MI values is estimated with one of the available MI-based algorithms (e.g., CLR or ARACNE), each algorithm generates a network by using gene-to-gene comparisons (edges) whose raw or adjusted MI-based scores pass a user-defined cutoff, which is set empirically or on the basis of statistical significance 39. Postprocessing of inferred edges can be performed to more accurately define the network by eliminating incorrectly identified indirect connections and false positives connecting genes. ARACNE performs such postprocessing by using the data processing inequality; CLR, by locally adjusting the cutoff itself.
The case study data set was first analyzed using an implementation of ARACNE in the statistical environment R 64 and incorporated plasma hormone (testosterone and estradiol) and vitellogenin protein levels in addition to the expression levels of the approximately 4,000 genes present on the microarray for which Gene Ontology annotation was available. The analysis was performed on a filtered version of the data set to remove invariant genes and genes expressed at low levels. The ARACNE-based analysis focused on reconstructing a network representative of the neighborhood of genes encoding for receptors. Significant connections to the expression of receptors were selected by applying a p value threshold of p < 10−8 as estimated by a resampling procedure. We performed a second network analysis using an alternate MI approach as implemented in the CLR algorithm from within the Software Environment for BIological Network Inference 65, using the original C code for CLR. Although both ARACNE and CLR are mutual information-based algorithms, each imposes a superstructure on the basic MI calculation that differs in important ways. Both algorithms generate a pairwise matrix of MI-based values, and the significance of the MI score between genes X and Y is estimated by comparing the MI score with a background distribution of MI values. The algorithms differ in that CLR then determines local background distributions of MI values whereas ARACNE uses one global distribution of all MI values (in the matrix). In CLR, the background distribution is created anew for each pair of genes and estimated by a joint normal distribution of the combined set of MI values of all the possible incoming and outgoing edges for each gene of the pair 63. In the work performed here, CLR calculations were made on the entire set of 868 arrays, using a set of 527 FHM genes that were the most differentially expressed across all 187 distinct conditions represented in the data set. To obtain the 527 genes, each condition (time point, chemical, or treatment) was normalized to paired experimental controls, the top 600 genes that were significantly different from controls across all conditions were identified using an analysis of variance test with a p-value threshold of p < 0.05, and the top 600 gene list filtered using a fold-change relative to control threshold value of 1.5, ultimately resulting in 527 genes.
Interrogating network results
Networks derived from gene expression data using reverse engineering methods can be investigated using visualization software such as Cytoscape 66, in which the topology can be viewed and networks can be organized on the basis of the number and strength of connections. A number of Cytoscape plugins are available to facilitate interactive visualization of the network analysis. For example, the Cytoscape plug-in Molecular Complex Detection (MCODE) 67 can find hubs (highly connected genes) and clusters (highly interconnected regions) in a network (Fig. 2). The resulting network can be combined with gene membership in known pathways as well as with other data relating to gene function, such as transcription factors and their predicted binding sites, or receptors hypothesized as being of importance in biological response. This merging of information permits one to form an integrated network that represents both existing knowledge and new information gained by the gene expression study. Additional information can be provided by the use of phenotypic markers that may be associated with adverse outcome. The integrated network forms the basis for identification of adverse outcome pathways as described in Figure 1.
Algorithm for reconstructing accurate cellular networks
The ARACNE network revealed a highly significant connection between serum levels of testosterone and vitellogenin and the molecular state of the ovary (Fig. 2). Functional analysis of a number of sub-networks defined by the modularization methodology in the Cytoscape MCODE plug-in reveal sub-networks associated with specific biological processes of potential relevance for ovary biology (Fig. 2a). The analysis of genes in the neighborhood of testosterone reveal a significant association between testosterone levels and the activity of molecular pathways involved in protein processing (targeted protein degradation, protein export) and cell division and cell differentiation (Fig. 2b).
Context likelihood of relatedness algorithm CLR network
A second network was retrieved from transcriptomics data, using CLR to look at interactions of a smaller subset of genes, allowing us to focus on genes most affected across all chemical exposures (Fig. 3). Using the methods of Faith et al. 63, we expect fewer than 190 false-positive regulatory edges to be reported with z-scores 3.0 standard deviations higher than the mean CLR score. Using the data set in Table 1, we found 1,771 edges between 425 of the original 527 genes, using a cutoff of 3.0. Comparing against the 190 edges expected by chance, we thus obtained excellent results from CLR for additional analysis. We used MCODE, with default options and loops enabled, to investigate the network topology. Several high-scoring (dense connectivity) clusters were found containing genes with known or suspected involvement in stress responses and AOPs, with 18 clusters (subnetworks) detected in toto. Genes from the top MCODE clusters were analyzed for gene ontology overrepresentation. The clusters were enriched with gene ontology terms such as lipid metabolic process, protein binding, steroid binding, catalytic activity, and transferase activity (Fig. 3).
Several genes known to be important in the hypothalamus-pituitary-gonadal axis were found to be present in the overall CLR-derived network, and especially in the top four clusters derived by MCODE from the network (Fig. 3). Most of the hypothalamus-pituitary-gonadal genes were placed in the highest-scoring cluster 1, including Activin receptor types I and II (ACVR1 and ACVR2), aromatase, cholesterol side chain cleavage enzyme, estrogen receptor alpha and beta (ESR1 and ESR2), follicle-stimulating hormone receptor, 17 beta-hydroxysteroid dehydrogenase, inhibin a, luteinizing hormone, steroidogenic acute regulatory protein, androgen receptor, activin receptor (ACVR1), low-density lipoprotein receptor, and syntaxin. Consistent with their known biology, several of these genes are highly connected hub genes, indicating that they may serve as master regulators. Many genes in the inferred network had no annotation available. Hence, their appearance and location in the network can provide new information based on their interactions with other genes of known function and provide suggestions for further exploratory analysis on such genes of unknown classification.
Linking exposure to adverse outcome
The next challenge is to infer a causal pathway through a network linking the environmental stressor to the adverse outcomes through a series of network modules, such as through subsets of functional or topologically related nodes. Network inference methods based on random or stress-induced single-time-point perturbations from homeostasis can be very effective in inferring gene or protein interactions. Where dose–response or time-course data are not available, the network can still be used to identify potential key modules connecting the environmental stressor to the adverse outcome. Ultimately, network inference should be seen as an important approach that aims to generate testable hypotheses for further exploratory analysis and validation, as described in more detail later.
Modules/subnetworks defined on the basis of network connectivity represent active biological pathways across the experimental samples represented in the data set. These are not necessarily associated with adverse outcome. The next step in the identification of an AOP is therefore to associate some of these sub-networks with an adverse outcome.
In the challenge data, the data for the adverse outcome itself (reduced fecundity) were not available, so several surrogate measures (i.e., vitellogenin, plasma testosterone, plasma estrogen) were used. The links between these biomarkers and the endpoints of interest are the subject of a broader study 35. As shown in Figure 2b, plasma testosterone is strongly associated with one of the subnetworks of the ARACNE network, suggesting that the subnetwork may represent an intracellular event in the ovary linked to change in an organism. Additional research is needed, however, to establish whether these changes were in response to plasma testosterone or contribute to setting the levels of plasma testosterone.
In addition to measured endpoints, the functional annotations for the genes themselves can provide clues about which subnetworks might be associated with the adverse outcome of interest. Our initial analysis using CLR and ARACNE identified subnetworks that are potentially associated with ongoing cellular processes modulated across the samples in the data set. To associate these networks with adverse outcome, we link modules to differentially expressed genes of known function.
Once the network has been anchored to the adverse outcome(s), the subnetworks associated with toxicity pathways of interest must be defined. In the case of known toxicity pathways, this is done by identifying subnetworks enriched for genes encoding the proteins involved in the toxicity pathway. Where the toxicity pathways are not defined or additional toxicity pathways are suspected, signature genes associated with specific chemical treatments can be viewed in the context of the network to identify subnetworks, representing the closest change caused by proximal toxicity pathway perturbation for that chemical at an early time point after exposure. A time point should be chosen so that the response represents a toxic effect rather than an immediate acute response that is not chemical specific or a response that is compensating for the chemical effects. Plasma testosterone levels have been linked, as a surrogate for reproductive effects, to highly correlated genes to identify subnetworks related to changes in plasma testosterone levels (Fig. 4). Another subnetwork identifying genes affected by exposure to the chemical flutamide is also highlighted. Later in this review this is used as an example in examining AOPs of a chemical, using reverse engineering.
Once the subnetworks associated with both toxicity pathway and adverse outcome have been identified, the overall connectivity (topology) of the network can be used to predict the key gene-to-gene regulatory relationships lying between the toxicity pathway and the subnetwork directly linked to the adverse outcome. The topology of the MI-based network could be used with known biological relationships and pathways found within the subnetworks to infer a causal path through the network. Key nodes could then be derived from the critical pathways based on the strength of connectivity from the node (gene) to genes associated with the critical pathway, such as connection of identified hub gene to such target genes. Also, such analysis can lead to hypotheses on key biomarkers and validation assays (e.g., knock-down and overexpression studies with key nodes as discussed later) to test the predictive power of perturbation of the key nodes with respect to adverse outcome.
Mapping the transcriptional response to flutamide exposure
To demonstrate reverse engineering approaches in identifying AOPs, the toxicant flutamide was used as a stressor to examine potential AOPs. Genes modulated by flutamide exposure were mapped onto the overall network inferred, using ARACNE (Fig. 4). The genes clustered in proximity of module 4, which was enriched in the Gene Ontology (GO) terms for anatomical structure development, cell motility, and inflammatory response (false discovery rate < 10%). Notably, a clustering of genes down-regulated by flutamide occurred in module 4, consistent with inhibition of cell proliferation (blue circles in Fig. 2b). This cluster of genes can be further explored by using other reverse engineering strategies for more intensive analysis.
Flutamide-affected genes also were examined in the CLR-inferred transcriptional network. Many important biological pathways have critical genes that are transcription factors, receptors, or signaling cascades. These genes appeared in the CLR and ARACNE networks as highly connected hubs. However, the complexity of the networks obscures relationships that could be valuable in understanding the effects of flutamide. In the data set (Table 1), flutamide affected 676 genes from the 15,000-gene microarray when arrays were compared across all conditions by using an analysis of variance method. Twenty-two of these genes were found to be in the 425 genes present in the CLR network.
The CLR network was simplified to focus on genes closely related to these differentially expressed genes to see how perturbation by flutamide impacts the network. First, the subnetwork connecting genes differentially expressed in response to flutamide and the 10 most highly connected genes in the network was isolated (Fig. 5). This was based on the reasoning that subnetworks, and associated pathways, perturbed by exposure could be involved in the AOP. Through this approach, two subnetworks were identified that were enriched in genes differentially expressed when exposed to flutamide. The first subnetwork contained 19 genes, eight of which were flutamide affected. Here, key regulators of androgen signaling, estrogen signaling (ESR2), transforming growth factor beta signaling, ephrin receptor signaling, and antioxidant responses (NFE2L2/Nrf2) are connected to flutamide effects. The second, smaller subnetwork was composed entirely of flutamide-affected genes. These two subnetworks highlight potential pathways through which flutamide may cause adverse effects.
Second, high scoring subclusters within the overall network were examined for inclusion of flutamide-affected genes. Three high scoring subnetworks were identified (Fig. 5b, c, d) from the overall CLR network (Fig. 3), using MCODE at default settings. Each of these three subnetworks contained genes affected by flutamide. These subnetworks were then merged to form a simplified yet highly connected subnetwork, from which we were able to investigate potential AOPs for flutamide (Fig. 6).
In the smaller of two networks resulting from the union of subnetworks (Fig. 6), six genes (CRAT, KLHDC3, LMAN1, MSL2, WDR24, and an unknown gene) were differentially expressed in response to flutamide. These genes are involved in meiotic recombination in mouse metabolism and are potentially involved in histone acetylation and signal transduction. These are linked to four highly connected nodes and potential AOPs, representing stress and xenobiotic response, and metabolism control. The heat shock transcription factor is involved in stress responses and development and may play a pivotal role in the balance between reproductive processes and maintenance when stressed 68, 69. Connection to the xenobiotic response is suggested by interactions with the aryl interacting protein required for function of the aryl hydrocarbon receptor 70 and STRADA, a protein that in humans activates pathways involved in energy homeostasis, lipid metabolism, and tumor suppression 71, 72.
The larger network (Fig. 6) contains two clusters of genes. The cluster of 19 genes, including three flutamide-responsive genes, is composed of several signaling and receptor genes involved in estrogen and androgen signaling, cell regeneration, development, and antioxidant response regulation. The activin type I receptor ACVR1 is involved in androgen signaling within the hypothalamus-pituitary-gonadal-axis, thereby influencing development of tissues from germ cells to skeletal muscle 73. Tenascin-R is involved in regenerating optic axons in zebrafish 74, and its expression has been shown to increase with testosterone in zebra finches 75. The ephrin receptor subfamily member EPHA3 is implicated in mediating developmental events. Three genes, EPHA3, ACVR1, and tenascin-R, are linked to three major signaling pathways involved in steroid hormone processes via estrogen receptor alpha (ESR1), the estrogen receptor beta (ESR2), the androgen receptor—a steroid hormone–activated transcription factor, and the Wnt-inducible signaling pathway protein 1. A transcription factor NFE2L2, also called NRF2, is involved in global regulation of antioxidant responses 76. These genes are also connected to a spermine oxidase, whose activity can result in oxidative stress 77. The smaller cluster contains 13 genes, five of which were impacted by flutamide. This cluster appears to be enriched in different aspects of mitochondrial function (electron transport, mitochondrial permeability, and metabolism), cell adhesion, surface receptor signaling, and protein synthesis and folding.
Defining and evaluating a potential AOP
The identification of adverse outcome pathways can assist in the development of targeted predictive screens for toxicity. Not all perturbations of an identified AOP will invariably lead to an adverse outcome, however. Specific qualitative or quantitative changes, a critical perturbation profile, in an AOP may be required. Critical perturbation profiles occur when changes are sufficiently strong to lead to an AOP. A combination of multiple pathways may be required for an adverse outcome, and the specific time period of a pathway modulation by a stressor, in conjunction with other associated responses, may be critical. As a result, multiple biomarkers in one or more AOPs, with measurements at different time points after exposure, may be needed for prediction of an adverse outcome. The quantitative definition of critical perturbation is dependent on the system examined. Different amounts of perturbation that may be needed to initiate an adverse outcome may vary greatly between different chemicals and pathways. Various aspects should not be underestimated, because an AOP potentially could be activated at a subthreshold level from exposure levels insufficient to cause toxicity. Recognition of such activation is nevertheless valuable as indicating an early stress response. A combination of multiple pathways may be required for an adverse outcome, and the specific time period of a pathway modulation by a stressor, in conjunction with other associated responses, may be critical. As a result, multiple biomarkers in one or more AOPs, with measurements at different time points after exposure, may be needed for prediction of an adverse outcome.
Functional analysis of genes recovered in the CLR subnetworks highlights several potential areas in which flutamide may exert adverse effects (Table 2). The functions and pathways found in the CLR subnetworks are consistent with the findings using ARACNE, in which flutamide-affected genes clustered in proximity of module 4 and were enriched in the GO terms for anatomical structure development, cell motility, and inflammatory response. Also, the central hub genes in the CLR-derived subnetworks are consistent with gene ontology overrepresentation analysis of the entire set of genes differentially expressed in response to flutamide (Table 3).
|Gene symbol||Gene name||PANTHER molecular function||PANTHER biological process||PANTHER pathway|
|CRAT||Carnitine acetyltransferase||Acetyltransferase; acyltransferase||Amino acid metabolism; fatty acid metabolism|
|COQ9||Coenzyme Q9 homolog (S. cerevisiae);||Molecular function unclassified||Biological process unclassified|
|LYRM7||Lyrm7 homolog (mouse)||Molecular function unclassified||Biological process unclassified|
|R3hcc1||R3H domain and coiled-coil containing 1||Molecular function unclassified||Biological process unclassified|
|LARGE||Like-glycosyltransferase||Glycosyltransferase||Carbohydrate metabolism; protein glycosylation|
|L1CAM||L1 cell adhesion molecule||CAM family adhesion molecule||Cell adhesion–mediated signaling; cell adhesion; neurogenesis|
|WISP1||Wnt1 inducible signaling pathway protein 1||Growth factor||Cell surface receptor–mediated signal transduction; ligand-mediated signaling; other oncogenesis; cell structure|
|RAPGEF6||Rap guanine nucleotide exchange factor (GEF) 6||Guanyl-nucleotide exchange factor||Cell surface receptor–mediated signal transduction; MAPKKK cascade|
|TNR||Tenascin R (restrictin, janusin)||Cell adhesion molecule; extracellular matrix glycoprotein||Extracellular matrix protein-mediated signaling; neurogenesis|
|Nrf2||Nuclear factor (erythroid-derived 2)-like 2;NFE2L2||Other transcription factor; Nucleic acid binding||Hematopoiesis|
|YKT6||YKT6 v-SNARE homolog (S. cerevisiae)||SNARE protein||Intracellular protein traffic|
|LMAN1||Lectin, mannose-binding, 1||Other membrane traffic protein||Intracellular protein traffic; protein targeting|
|B2M||Beta-2-microglobulin||Major histocompatibility complex antigen||MHCI-mediated immunity||T cell activation → MHC-antigen|
|FUBP1||Far upstream element (FUSE) binding protein 1||Other RNA-binding protein||mRNA splicing; translational regulation|
|ESR1||Estrogen receptor 1||Nuclear hormone receptor; Transcription factor; Nucleic acid binding||mRNA transcription regulation; steroid hormone-mediated signaling; other neuronal activity; oogenesis; cell cycle control; mitosis; cell proliferation and differentiation; cell motility|
|Esr2||Estrogen receptor 2 (ER beta)||Nuclear hormone receptor; Transcription factor; Nucleic acid binding||mRNA transcription regulation; steroid hormone-mediated signaling; other neuronal activity; oogenesis; cell cycle control; mitosis; cell proliferation and differentiation; cell motility|
|HSF2||Heat shock transcription factor 2||Other transcription factor; Nucleic acid binding||mRNA transcription regulation; stress response|
|HSDL2||Hydroxysteroid dehydrogenase-like 2||Dehydrogenase; reductase||Other metabolism|
|Smox||Spermine oxidase||Oxidase||Other metabolism|
|NDUFB8||NADH dehydrogenase (ubiquinone) 1 beta subcomplex, 8, 19 kDa||Oxidoreductase||Oxidative phosphorylation|
|WDR24||WD repeat domain 24||Other receptor||Peroxisome transport; transport|
|Rpl22l1||Ribosomal protein L22-like 1||Ribosomal protein||Protein biosynthesis|
|RPL31||Ribosomal protein L31||Ribosomal protein||Protein biosynthesis|
|EIF3e||Eukaryotic translation initiation factor 3, subunit 6 48kDa||Translation initiation factor||Protein biosynthesis; translational regulation|
|PPIF||Peptidylprolyl isomerase F (cyclophilin F)||Other isomerase||Protein folding; nuclear transport; immunity and defense|
|ACVR1||Activin A receptor, type I||TGF-beta receptor; Serine/threonine protein kinase receptor; Protein kinase||Protein phosphorylation; cytokine and chemokine mediated signaling pathway; receptor protein serine/threonine kinase signaling pathway||Wnt signaling pathway → Transforming growth factor beta activated kinase 1; TGF-beta signaling pathway → TGF-beta receptor I and II;|
|EPHA3||EPH receptor A3||Tyrosine protein kinase receptor; Protein kinase||Protein phosphorylation; receptor protein tyrosine kinase signaling pathway; neurogenesis; mesoderm development; cell proliferation and differentiation||Angiogenesis → ephrin receptor|
|MMP2||Matrix metallopeptidase 2||Metalloprotease; Other extracellular matrix||Proteolysis||Alzheimer disease-presenilin pathway → matrix metalloprotease|
|KLHDC3||Kelch domain containing 3||Chromatin/chromatin-binding protein||Spermatogenesis and motility|
|DHTKD1||Dehydrogenase E1 and transketolase domain containing 1||Dehydrogenase||Tricarboxylic acid pathway||TCA cycle → alpha-ketoglutarate dehydrogenase|
|IDH2||Isocitrate dehydrogenase 2 (NADP + ), mitochondrial||Dehydrogenase||Tricarboxylic acid pathway||TCA cycle|
|AIP||Aryl hydrocarbon receptor interacting protein||Chaperone||Vision|
|Category||GO term||Gene count||p|
|BP||Cellular component organization and biogenesis||87||2.10E-04|
|BP||Cellular metabolic process||213||5.20E-04|
|BP||Primary metabolic process||213||6.30E-04|
|BP||Biopolymer metabolic process||144||2.60E-03|
|BP||DNA metabolic process||33||2.60E-03|
|MF||Damaged DNA binding||6||3.20E-03|
|BP||Positive regulation of cellular process||36||4.20E-03|
|BP||Cell cycle process||29||4.40E-03|
|MF||Transcription cofactor activity||16||4.60E-03|
|BP||Response to DNA damage stimulus||16||5.50E-03|
|BP||Organelle organization and biogenesis||40||8.50E-03|
|BP||Maintenance of localization||5||1.10E-02|
|BP||Fatty acid biosynthesis||6||1.30E-02|
|KEGG Pathway||Fatty acid biosynthesis||3||1.10E-02|
|KEGG Pathway||Adipocytokine signaling pathway||7||1.70E-02|
|KEGG Pathway||Urea cycle and metabolism of amino groups||4||5.20E-02|
Analysis of our inferred networks provides hypotheses of which pathways may be associated with adverse outcomes. We know that flutamide causes a reduction in plasma estradiol and reduced tissue development, leading to fewer mature oocytes 37, 38. From analysis of the relationship of flutamide-affected genes to the overall networks constructed using ARACNE and CLR, genes involved in multiple pathways were connected to flutamide exposure. Given that androgen and estrogen receptors, in addition to Wnt and activin signaling (down-regulated), were associated based on our CLR- and ARACNE-based analysis of their gene expression patterns, we can hypothesize that flutamide impacts androgen and estrogen steroid hormone pathways, possibly at the receptor level. This would be consistent with the reduction in estradiol synthesis observed in flutamide-exposed fish 37, 38. Fatty acid biosynthesis appears to be significantly impacted in flutamide-affected genes based on enrichment for both the GO process and Kyoto Encyclopedia of Genes and Genomes pathway (http://www.genome.jp/kegg/) (Table 3). In human cells, flutamide has been found to block androgen regulation of lipoprotein lipase, a key enzyme in lipogenesis, thereby providing some support for our hypothesis of flutamide impacting androgen signaling 78. Diminished mature oocytes may be related to cell proliferation and differentiation pathways involving the ephrin/EPH signaling pathway, which has been implicated in differentiation of a number of tissues 79, 80. Broader impacts in differentiation and development are evident in the enrichment of biological processes in cellular component organization and biogenesis when all flutamide-affected genes are considered (Table 3).
Further adverse effects may occur through oxidative stress or other pathways. This is reflected in the close association of NRF2, the stress-responsive heat shock transcription factor, and the aryl hydrocarbon receptor-interacting protein (AIP) gene with flutamide-affected genes. Evidence has been seen of effects in mouse liver through a specific metabolite of flutamide and the production of reactive oxygen linked to mitochondrial dysfunction 81, 82. Dysregulation of metabolic homeostasis also may occur, as indicated by the impact of flutamide on subnetworks containing several metabolic enzyme and control elements. Consistent with this observation is the overrepresentation in our networks of metabolic processes in GO term analysis of flutamide-affected genes (Table 3). Additionally, NFE2L2/Nrf2 has been shown to be involved in protection from mitochondrial stress, thereby strengthening the evidence for a possible relationship between NFE2L2/Nrf2 and mitochondrial metabolism genes in the subnetwork in Figure 683, 84. Whether reactive oxygen is formed in the ovary remains an open question. However, GO term analysis indicates significant enrichment of biological processes related to DNA damage, a possible outcome of reactive oxygen. Further experiments are needed to verify whether these pathways are mechanistically associated with varying testosterone and estradiol levels.
Reverse engineering approaches provide useful tools for discovery of new interactions and associations. However, these results must be evaluated by both leveraging preexisting knowledge of the biological system and targeting follow-up experiments to explicitly test the hypotheses generated. Specific support for predicted interactions can be gained by examining or directly integrating support from other data sets and tools. These methods include transcription factor binding site analysis, correlation with protein–protein binding data sets (found, for example, through mass spectrometry bait–prey experiments), integration with networks from public databases (Database of Interacting Proteins [DIP], Biomolecular Interaction Network Database [BIND], Prolinks, Kyoto Encyclopedia of Genes and Genomes, Human Protein Reference Database [HPRD], the BioGrid, and GO), using, for example, the Cytoscape plug-in BioNetBuilder 85, literature mining, and the use of manually curated pathway databases 86. Empirical testing of network predictions can be approached in different ways. Ideally, one would perform direct interaction tests (chromatin immunoprecipitation [ChIP]-chip, ChIP-seq, and protein interaction assays 87–89) or test the contribution of network/pathway elements to the AOP by gain or loss of function experiments. In loss of function experiments, a key network node is removed by reducing or eliminating expression or inhibiting the activity of the protein product. Conversely, in gain of function tests, expression of a key gene or activation of the protein is artificially increased to assess effects dependent on that protein/gene product. Nodes in networks/pathways can be manipulated using mutants, conditional and permanent gene knockouts 90, gene knockdowns via small RNA technologies: microRNA, short hairpin RNA, or RNA interference 91, or antisense RNA morpholino oligos 92. Also, overproduction of key elements and rewiring of pathways via synthetic biology can be used to test network function 93, 94.
Ideally, the discovery effort will enable development of a hypothetical pathway leading to one or more adverse outcomes. Hypothesis testing may not focus on direct validation of linkages but instead may concentrate on testing key relationships. In fact, if the function of the subnetworks can be predicted, follow-up studies could use an abstraction of the regulatory network and test hypotheses based on perturbation of certain signaling, metabolic, and cellular processes as a key event network for a mode of action 6. Through formulation of a hypothesis, the key components/nodes/genes discovered via reverse engineering may be fitted into a dynamic model that can be used to simulate the AOP response to stressors. In the current case study, the network interactions determined (i.e., the network structure) could, in future work, be incorporated, along with known information on the biological pathway, into a dynamic model and used to examine the impact of the endocrine disruptors on estradiol and testosterone production.
This model, whether conceptual or dynamic, could then be tested by focused empirical experiments. If the components identified by reverse engineering can be manipulated in expression to accurately reproduce observed adverse outcome responses, then we may conclude that we have captured a putative AOP that would otherwise have been difficult to determine from traditional hypothesis-based experimentation.
One of the biggest challenges in adapting reverse engineering approaches for use in ecotoxicology comes in evaluation and interpretation of the resulting networks. Most species being studied have very little functional annotation for the genes identified. Sequence-based bioinformatics approaches can be used to assign potential functional roles to the genes in inferred regulatory networks. However, transitive annotations based on sequence homology become less reliable as the phylogenetic distance between species increases. As a result, we may identify important subnetworks for which very little or no functional information is available. Additional information may be available if other connected gene modules do have functional annotations, especially if those data support inferences made from the sequence-based methods. The inclusion of meaningful biological markers, such as the testosterone example shown previously, will be helpful to link genes to functional outcomes. However, in the absence of corroborative functional information, the interpretation of the possible role of key modules within a reverse engineered network is difficult. Similarly, designing follow-up experiments becomes harder as hypothetical mechanisms become more speculative. A further challenge is expanding from an inferred AOP to potential impact of such at a population level in the environment. The derived AOPs will be particularly valuable in identifying modes of action. In combination with an understanding of the dose–response relationships, measures of exposure, and development of extrapolation models, potential exists to achieve population level predictions.
Reverse engineering approaches provide a systematic approach to investigate and discover interactions and associations of not only individual genes, but also gene subnetworks, presumptively acting as functional modules, found in our overall inferred network through topological analysis. We can conceptually organize such modules into pathways that govern response to perturbation by environmental stressors. One of the key advantages of applying reverse engineering algorithms to genomic experiments and the resulting high-throughput bioinformatic data sets (digital biology) is the automatic integration (after any necessary normalization) of, for example, diverse gene expression data from a heterogeneous set of experiments from multiple laboratories. These algorithms can integrate the data from multiple diverse inputs (via multiple perturbation experiments for various stressors and multiple data types) that could lead to an adverse outcome. The resulting networks, inferred from data reaching across the entire set of perturbations, have the potential to allow for the ability to predict adverse effects to both the individual and populations for a particular stressor or set of stressors. These outcome/endpoint predictions can be used to enhance the use of mode of action information in risk assessment, provide additional information to expand models of ecosystem effects, inform environmental protection abatement strategies, and support decisions about water quality criteria for a particular stressor or other chemicals with similar or related modes of action or pathways of adverse effect(s). Another appealing advantage of reverse engineering approach is that the inferred network output of such algorithms can be easily tied to specific mode of action hypotheses and to AOPs, based on the points of contact (genes or proteins) shared by an inferred network and known AOPs.
Reverse engineering thus provides an investigative tool into AOPs and provides the investigator with potential direct linkages between stressors and population effects, especially if full life cycle ecotoxicity assay endpoints are measured in toxicogenomic studies. Such AOPs identified by a reverse engineering method in one species could provide a convenient surrogate model for risk prediction within related organisms. Such models have the potential to reduce the cost of chemical risk assessment–based research by decreasing animal usage and streamlining the effort needed to fulfill chemical toxicological testing requirements for international programs such as the Registration, Evaluation, Authorization and Restriction of Chemical substances regulation in the European Union (http://ec.europa.eu/environment/chemicals/reach/reach_intro.htm), ToxCast™ in the USA 20, and Canada's Chemical Management Plan (http://www.chemicalsubstanceschimiques.gc.ca/plan/index_e.html).
The inferred networks provided by reverse engineering can improve our understanding of AOPs that provide mechanistic insight that is not provided by statistical associations between gene expression and adverse outcome considered in isolation. Although simple association between gene expression and outcome has been used successfully for biomarker identification, the inference is limited to each stressor. As a result, a separate inference is required for each stressor, and dangers exist of overfitting a predictive model. Adverse outcome pathways (here, found using a combination of reverse engineering methods, topological analysis, and other prior background information) provide a more generalized approach that can be used for multiple stressors and thus may be more relevant for analysis of exposure to mixtures of chemicals in the environment. The impact of any stressor, or mixtures of stressors, could be predicted by its effect on identified AOPs. This could be accomplished given that the original data contained a diverse set of perturbations providing substantial information on the structure of the regulatory network involved in the stress responses.
One of the perceived disadvantages identified in the reverse engineering approach is the apparent lack of agreement on standardized bioinformatics/reverse engineering methods. Reverse engineering is an emerging field with many new approaches. This results in difficulty for the computer scientists and bioinformaticians working in this area to provide definitive recommendations. Only recently have tools begun to be developed in which reverse engineering approaches can be tested on actual biological data 95 or compared in a common software environment on synthetic data 65.
Differences in approach (e.g., the many different algorithms available, of many different types) could delay adoption of network inference into routine application. Cost, because of a need for robust and large experimental data sets with multiple exposures, doses, and timepoints as well as the genomic resource requirements, is also a potential disadvantage for wide-scale utility of reverse engineering methods. Data management is currently a relatively minor problem when compared with such experimental cost, and of the problems of choice of algorithm, data preparation, and proper experimental design. However the terabytes of data arising from replacing current microarrays with next-generation sequencers will probably lead to more substantial data management problems as such new high-throughput platforms come into wider use. Additional limitations center around the question of life stage sensitivity of the targeted organism. Specifically, critical life stages are in relationship to susceptibility of the organism to the toxicant. For example, many chemicals are extremely toxic during development but pose little or no risk for adults. Analysis of such issues needs further development for use of the reverse engineering methods on high-throughput data to gain greater appeal to toxicologists and risk assessors needing to differentiate between such life stages.
However, reverse engineered mapping is ideally suited for single chemical–based exposure regimens. The application of reverse engineering to chemicals present in effluents or complex mixtures may be particularly useful, because the network approach inherently integrates the effects of mixed stressors, especially if measurements of the network state across the use of multiple stressors and multiple biological levels are included.
Research recommendations and future directions
One great challenge will be coordinating multiple efforts and combining disparate data sets. The creation of consortia bringing together different expertise in physiology, toxicology, biochemistry, genomics, computational biology, engineering, and bioinformatics will greatly facilitate focused efforts, especially in the early development and application of the reverse engineering approach on high-throughput data sets to ecotoxicology. Although a great wealth of data and genomic tools exist for standard model species (e.g., Caenorhabditis elegans, Drosophila melanogaster, mouse, and rat), ecotoxicology examines many different species, from water fleas to polar bears. These correlation-based methods for reverse engineering are also directly applicable to inference of signal transduction pathways, using correlations in protein activation/inactivation states 96, although they have not yet been widely applied in this area (or in the metabolomics area, using correlations in metabolite levels), because of a paucity of suitable experimental data.
Although doing all work in one's own species of interest would be desirable, the cost of this would be prohibitive. Gene annotation (e.g., known transcription factors and targets), known physiological responses, and other data from model ecological species (e.g., fathead minnow or Daphnia sp.) and traditional model species will be invaluable as primary templates to provide functional information. Such information from model species will greatly facilitate discovery of new AOPs via reverse engineering in species in which little information exists. High-throughput expression data, protein–protein interaction data, and DNA binding site information for ecospecies also will be valuable. Results from a recent meeting that explores these approaches, the 3rd Dialogue for Reverse Engineering Assessments and Methods (DREAM-3; http://wiki.c2b2.columbia.edu/dream/index.php/DREAM3conf#DREAM3_Conference) found that incorporation of other data (transcription factor, known binding, physiological measures, and so forth) can increase the quality of a reverse engineered regulatory network found by using high-throughput gene expression data as the primary data source 25. This is, of course, what one would expect. One wishes to use all the information available when inferring new connections. The rapid ease with which whole genome sequences are being produced is opening opportunities in developing metabolic pathway or signaling networks based on genome analysis 97. The potential of using this as prior information in our methods for reverse engineering AOPs primarily from gene expression data should be explored.
A current limitation of data used for reverse engineering is that often a large portion of the genes or proteins (>50%) do not have reliable annotation, or the annotation is minimal. New approaches for annotation of data (i.e., Gene Ontologies) are needed to maximize one's ability to extract information and draw hypotheses. In addition to ontology classification as biological function, subcellular localization, and biochemical pathway, an annotation in relation to role in specific pathological conditions might be useful. Many studies have employed inbred populations. Such studies do not take into account potential subpopulations of different susceptibilities present because of differences in polymorphisms or gene copy number. Use of individuals from field, outbred populations with large genetic diversity might help to identify such aspects. This will be particularly important when identifying the relationship between adverse outcome in individual organisms and a potential effect at the population level. Ultimately the linkage of genomic responses to population or ecologically relevant endpoints (e.g., behavior, reproduction, and development) has to be achieved. This will require better means of identification of an AOP and also a better understanding of the relationship between the activation of combined AOPs and the critical threshold levels of activation. The relative importance of short-term AOP activation and chronic activation in relation to toxicity is key, and thus temporal changes leading up to toxicity need to be understood.
This work was supported by the Society for Environmental Toxicity and Chemistry, U.S. Army Corps of Engineers Engineer Research and Development Center Program in Environmental Quality and Installations, the Natural Environment Research Council, the U.S. Environmental Protection Agency, and Procter & Gamble. Permission was granted by the Chief of the U.S. Army Corps of Engineers to publish this information. This document has been subjected to review by the U.S. Environmental Protection Agency, National Health and Environmental Effects Research Laboratory, and approved for publication. Approval does not signify that the contents reflect Agency views or policies. The authors thank Natalia Garcia Reyero and Daniel Villeneuve, organizers of the SETAC Pellston Workshop, from which this article originated.
- 2National Research Council of the National Academies. 2007. Toxicity Testing in the 21st Century: A Vision and a Strategy. National Academies Press, Washington, D.C.
- 272008. Network inference algorithms elucidate Nrf2 regulation of mouse lung oxidative stress. PLoS Comput Biol 4:e 1000166., , , ,
- 282005. Bayesian network analysis of signaling networks: A primer. Sci STKE 281: 14.
- 462007. Comparative analysis of microarray normalization procedures: Effects on reverse engineering gene networks. Bioinformatics 23:i 282–288., , ,
- 501991. Elements of Information Theory. John Wiley & Sons, New York, NY, USA.,
- 562001. Bayesian Networks and Decision Graphs. Springer-Verlag, New York, NY, USA.
- 57JordanMI, ed. 1998. Learning in Graphical Models. MIT Press, Cambridge, MA, USA.
- 582004. Learning Bayesian Networks. Pearson Education, Upper Saddle River, NJ, USA.
- 592001. Bayes net toolbox for matlab. Comput Sci Stat 33: 1024–1034.
- 622001. Data Mining Concepts and Techniques. Morgan Kaufman Publishers, San Diego, CA, USA.,