A key aim of systems biology is to show how the genome, transcriptome, proteome, and metabolome cooperate within a multi-cellular environment to generate cell-based functional processes. An important step here is identifying which gene subgroups underpin the various processes that a cell is undertaking. Such information as we have is collated in the literature and, for some organisms at least, held in databases that carry gene ontology annotations (GO, www.geneontology.org): In the GO, biological processes are linked, through curation, with the appropriate gene products on the basis of experimental evidence. The lists of process-associated genes are, however, incomplete and need to be expanded. The traditional approach here is to analyse genes expressed in these tissues to see if mutation, gene targeting, and other similar techniques affect the phenotype generated by the processes in which the genes are involved. Such techniques all suffer from the limitation that it is difficult to analyse more than one or two genes at a time.
To detect sets of candidate genes, techniques such as differential-display and differential microarray analysis, subtractive-hybridization, and yeast two-hybrid technology are used, and these are slowly and expensively yielding results about the molecular basis of cell processes (e.g., Albig et al.,2007; Hirate et al,2001; Hummerich et al.2006; Tateossian et al.2004, Tomczak et al.,2004). In hand with such experiments is the generation of new proteome and transcriptome data through the use of technologies such as microarray and SAGE (e.g., Ness,2007; Robinson et al.2007), and this is expanding the number of expressed genes that are candidates for involvement in the many processes in which a cell participates at any particular moment. The problem lies in trying to link all these expressed genes with the particular processes in which they are involved, and the more complete are the proteomes, the more complicated this can be.
This report takes a bioinformatics approach to the problem and uses subtraction methods on gene sets (e.g., Gosink et al.,2007) to look for common genes expressed by different tissues undergoing the same developmental process. If we can assume, first, that tissues undergoing the same process require at least most of the same genes (an assumption implicit in the Gene Ontology) and, second, that these genes are under the control of the same transcription factors, then we can expect the genes expressed when the process occurs to reflect these coincidences. Intersection analysis of the sets of gene-expression data for each tissue should, therefore, be able to identify candidate genes (Fig. 1), and, here, the more complete are the sets of expressed genes, the better the approach should work.
This report uses such an approach to analyse the genetic basis of the mesenchyme-to-epithelium transition (MET) and starts by summarising the biological background to the process, and goes on to describe the Boolean analysis method together with a tool (GXD-search) for extracting and analysing the data held in GXD, the mouse gene-expression database (http://www.informatics.jax.org/expression.shtml). The results of the analysis of the expression profiles of tissues undergoing a MET are then given, and the discussion considers the validity of these results.
BACKGROUND: THE MESENCHYME TO EPITHELIUM TRANSITION (MET)
In this report, we are primarily concerned with the mesenchyme-to-epithelium transition (MET). This process is used to turn mesenchymal cells that are normally found in 3D cell masses into 2D sheets that may be tubes (e.g., blood vessels and nephrons), waterproof bounding membranes (e.g., mesothelium), or transitional structures (e.g., somites). Although this process is quite common during mouse development (Table 1), it has attracted relatively little attention, with the two best-studied examples being the formation of somites from unsegmented paraxial mesenchyme (e.g., Takahashi et al.,2005) and nephrons from metanephric mesenchyme (e.g., Davies,1996; Oxburgh,2002). No genes are currently associated with this process (GO:0060231) in the Gene Ontology, and there is little discussion about it in the literature, although Cited1 has been implicated as being involved in the development of nephron epithelium (Plisov et al.,2005). One would, of course, expect the process usually to be marked by at least the down-regulation of mesenchymal markers and the up-regulation of epithelial ones.
Table 1. Tissues That Undergo a Mesenchyme-to-Epithelial Transition With Stages and Core Expression Dataa
The numbers of genes refer to those tissues about to undergo the process.
A full analysis of this process in the mouse requires not only gene-expression data, but also detailed embryological knowledge about where and when this process occurs, at the temporal resolution of Theiler stages. Such information is documented in Kaufman and Bard (1999) and the key staging details for ten examples are given in Table 1, together with some indicative gene-expression numbers from GXD. It should be noted that somite and nephron formation are not given stages as new somites form from paraxial mesoderm every 2–3 hr from about TS13 (E8.5) to about TS 21 (E12.5), while nephrons start to form from metanephric mesenchyme at about TS21 (E12.5) with the process continuing beyond birth.
BOOLEAN INTERSECTION ANALYSIS
Consider the set of proteins in each of a group of tissues whose cells will soon undertake a particular developmental process. The cells in each tissue will include three, possibly overlapping sets proteins: the transcription activators and any signalling pathways needed for the process in question, proteins associated with other processes, and a set of widely expressed housekeeping proteins. In general, the lists for the various tissues will only share genes for housekeeping proteins and for those processes that the tissues are all undertaking. These shared proteins can be identified by Boolean intersection analysis (Fig. 1), and it is reasonable to expect that the more tissues analysed, the fewer will be the number of proteins that the tissues will jointly express as they will have fewer common processes. Process and housekeeping transcription factors can be distinguished on the basis that the former group will have a much more restricted set of expression domains than the latter. In principle, the larger the group of tissues undergoing the process and the fuller the datasets, the better will be the analysis.
The basic assumption behind the methodology is that wherever and whenever a process is activated within a cell, it will require a core group of proteins that will, in addition to any proteins already present, execute that process. An obvious corollary of this is that just before the process is activated, the cell will include the necessary transcription factors and signalling pathway components to ensure that the core genes are expressed. The timing and other practical implications of this are discussed below; here, we consider the application of this basic assumption to a group of tissues.
A similar intersection analysis of protein sets during the process-execution stage will give the proteins involved in the process, together with housekeeping genes. In addition, a comparison of genes expressed before and during the process will give those proteins expressed as a result of transcription-factor activity as they will not generally be expressed until just before the process starts. Similarly, genes that are to be down-regulated before the process can start will be identified from intersections that identify common genes that are expressed before, but not after the process starts.
For this approach to be useful, gene-expression databases are required that hold data for all tissues of an organism at every developmental stage. The only current database that includes such data at this resolution is GXD, the mouse gene-expression database. The quality of the database is impressive: it includes >260,000 expression patterns curated by the Jackson Laboratory on the basis of Theiler stage (12 hr when development is fast [4–12 days] and 24 hr when it is slow) and tissues at histological resolution; the anatomy ontology includes about 8,000 tissues spread over the 28 Theiler stages (Smith et al.,2006).
The GXD-search tool, illustrated in Figure 2, automates this analysis by downloading the data directly from the database and performing the Boolean analysis. The tool is easy to use and freely downloadable (see Experimental Procedures section for details).
Preliminary analysis of the GXD gene-expression lists for the ten examples (Table 1) showed that their numbers varied widely, and for three reasons. First, there were timing problems: Theiler staging assumes that mating occurs at midnight and that there are two stages a day between E4.5 (TS6) and E12.5 (TS21) when development is rapid, with even stages starting at midnight and odd stages at midday. It is thus not unexpected that the numbers of genes reported as being expressed at midday stages are more extensive than those for midnight stages (e.g., the number of genes whose expression is reported for endoderm at TS10 (E7), TS11 (E7.5), and TS12 (E8) are 31, 120, and 31, respectively). To make allowance for this, we assayed the before datasets during the “rapid” phase for the two previous stages (a 24-hr period, albeit that TS 10-11 mesoderm has yet to partition), and for process data we assayed data in the immediate and the following stage (again, a 24-hr period). In doing this, we felt that, while we might pick up some false positives from the database, we would not get false negatives, and false positives could later be eliminated by examining the full expression profiles of the tissues to see if any of these genes had been shown to be absent (as opposed to not reported) either just before or just after the process was initiated. Second, some tissues attract much more interest than others. For the different tissues considered here, the expression sets for a particular stage ranged from none to a few hundred genes (Table 1), and it was clear that if we required that candidate genes be expressed by every tissue, we would get a null result. It was for this reason that the GXD-search tool was designed to analyse all combinations of gene sets. Third, the functions of the genes whose expression patterns are reported in GXD focus particularly on transcription and signalling, a bias that does, of course, reflect the interests of the mouse development community.
We took as the minimum criterion for involvement in the MET process the requirement that at least three of the ten tissues had to express a gene, and none to be shown as not expressing it. Once the results of the computation had been presented, the GXD entries and the appropriate literature for each gene were checked by hand to decide whether it should be viewed as a false positive or a housekeeping gene (Table 2).
Table 2. Genes Identified by GXD-Search as Being Expressed in Mesenchymal Tissues About to Become Epithelial
The data from Boyle et al. (2007) have not yet been entered into GXD.
Lhx1 is absent at TS19, but present in TS20 early renal vesicles (Oxburgh et al., 2004).
The most interesting results from the search came from the analysis of genes expressed before the MET process was initiated. The GXD-search analysis identified seven genes that were widely expressed: six code for transcriptional regulators (Cited1,2, Foxc1,2, Lhx1 [LIM1], and Meox2) and the seventh for a retinoic acid binding protein (Crabp1). Perhaps the most remarkable observation is for the mesonephric mesenchyme, the tissue that forms the mesonephric ducts. The GXD expression list for this tissue includes only 9 genes. Two of these are signalling molecules (Bmp4 and Gdnf) while four of the remaining seven are in the commonly expressed group. Over-representation analysis of the genes in Table 2 using Fatigo (Al-Shahrour et al.,2004) shows that multicellular organismal development and anatomical structure development are the most significant GO terms associated with these genes. However, after correction for multiple hypothesis testing, the P values are not significant.
In addition, analysis of genes expressed in pairs of tissues showed that components of various signalling pathways were present (e.g., Wnt, FGF, and Notch-Delta), but as these pathways are widely used, no inference could be drawn from the results.
In contrast, the results for the genes involved in the MET process itself (i.e., those present in the newly forming tissue) were disappointing in that only two genes satisfying the three-tissue criterion were identified: Arid5b (a gene that gives a DNA-binding protein) expressed during angiogenesis, somite formation, and mesonephric-tubule formation, and Smad2 (a signalling-pathway gene) expressed during angiogenesis, nephron formation, and somite formation. In addition, the analysis of genes turned off once the process had started gave no useful information. There was not a single gene turned off in more than a pair of tissues (e.g., Tbx6 is turned off once angiogenesis and somite formation have started). Even the expected down-regulation of mesenchymal genes and the up-regulation of genes associated with epithelialization were not picked up. Close inspection of the GXD gene expression lists of cell adhesion molecules, laminin, and other markers associated with an epithelial phenotype showed that they have terse expression profiles. Perhaps, with the more extensive use of array technologies, expression data for such genes will become available in GXD, and it will not be difficult to redo the computations reported here.
The key results of the analysis, which was carried out using only tissue-associated and no genetic input, are that Cited1,2, Foxc1,2, Lhx1, Meox2, and Crabp1 are candidate genes for being involved in preparing for a MET in the developing mouse. The first and obvious concern is that their joint involvement in the transition process merely reflects a chance result as they all tend to be expressed during the development of relatively early mesoderm. Examination of the full expression patterns of each of these genes, insofar as they are represented in GXD, suggests that the coincidences are not by chance and for three reasons. First, each of these genes has a wide expression pattern and the only regions where they coincide are where MET occur. Second, the expression of these genes in early mesoderm is actually more spasmodic than complete. Cited1 and Cited2, for example, seem not to be up-regulated in mesoderm until quite late (TS13). However, as the expression data in GXD are not complete, arguments along these lines are not compelling.
The best reason for suggesting that these genes are all involved in the MET is genetic. Evidence from mutation analysis shows that each of the six transcription regulators is implicated in one or another aspect of epithelial development. Lhx1 is widely expressed in the intermediate mesoderm and its derivatives, but its knockout develops neither a nephric duct nor those tissues that depend on the presence of the duct for their development (Kobayashi et al.,2005). Cited1 is a transcriptional cofactor involved in early nephron patterning (Plisov et al.,2005) and in mammary duct morphogenesis (Howlin et al.,2006). Its exact role is unclear as over-expression of the gene blocks kidney epithelia morphogenesis, while the knockout of either or both of Cited1 and its sister gene, Cited2, does not affect nephrogenesis (Boyle et al.,2007).
Foxc1 and Foxc2 are also required for normal kidney metanephric (and heart) development, with compound mutant heterozygotes forming hypoplastic kidneys (Kume et al.,2000). Foxc1 has other interesting effects. Homozygous mutants have ectopic mesonephric tubules and ectopic anterior uretric buds (Kume et al.,2000). It is also noteworthy that, in the zebrafish at least, the knockout of the Foxc1 orthologue blocks normal somite formation, a phenotype observed in mice embryos that are null mutants for both Foxc1 and Foxc2 (Topeczewska et al., 2001). Mice mutant for Meox1 undergo normal somitogenesis but form abnormal dermatomes, while mice null for both Meox1 and its sister gene Meox2 fail to undergo somitogenesis (Mankoo et al.,2003).
The last of the group, Crabp1, is different. It is a very widely expressed cytoplasmic protein (GXD has 433 staged-tissue citations over the period TS9-28) that binds retinoic acid in the cytoplasm and may prevent its access to the nucleus, where its effects are wide, in early human development at least (Zile,2001). In the context of the MET, the absence of relevant experimental data makes it impossible to tell whether CRABP1 is just very widely expressed, or whether the transition requires the absence or low levels of retinoic acid.
It is interesting to look at the annotations for these transcriptional regulators in the Gene Ontology. None is assigned to the MET process, but all show some involvement in epithelial development, often in the morphogenesis of some aspect of the cardiovascular system; in addition, they all have various involvements in the development of other organ systems such as the nervous system and the skeleton. Because the reasons for GO annotations are not explicitly given, the basis for the assignment of these genes to particular processes is not clear, but the obvious reasons are expression patterns and mutant phenotypes.
Perhaps the conclusions that can be drawn from the analysis given here is that the identified transcriptional regulators all play a role in the MET, even if it is not possible to integrate their roles within a defined pathway or systems network. The identification of a set of identified transcriptional regulators associated with a particular process also bears on the validity of the key assumption underlying the computational approach, i.e., for a particular process to occur during development, a standard set of transcription regulators and proteins needs to be in place. This assumption also lies behind the GO process ontology and its associated database and one of the aims of the work was to test its validity. It was, therefore, reassuring that the approach identified six transcriptional regulators each separately known to be involved in epithelial development, and that had not previously been grouped in the context of the MET. It would, however, be unrealistic to hope that these proteins are solely responsible for activating this process. There is no reason to suppose that, on the one hand, the fine details of epithelium generation are the same in all tissues and, on the other, that the GXD expression sets are currently full enough to exclude other candidate transcription factors. Nevertheless, this group does provide a set of candidate transcriptional regulators that may merit further study.
The analysis reported here is based on gene expression assays stored in GXD during the period February to March 2008. All assay types were considered.
The GXD-search Tool
In its simplest mode, a GXD-search user first identifies the embryological stages at which some process is initiated in each of a set of N tissues, and then uses the tool in two stages to identify candidate process genes (see Fig. 2 for a worked example). The first is to obtain the appropriate set of genes for each, and this merely requires activating the add button in the main screen (Fig. 2b), identifying the stage and tissue in the new GXD-access window, and activating the “>” button (Fig. 2a). The program downloads from GXD all the expressed genes for the tissue at that stage, removing duplicates (GXD treats every experimental report as a separate entry). If the tissue continues to use the process for several stages (e.g., the MET underpins nephron production from metanephric mesenchyme starts at TS21 and continues beyond birth), the user repeats the process for each appropriate stage and GXD-search downloads the gene set and computes the union of the gene sets, thus removing duplicates. The user assigns a name to the gene set and activates the add button, which transfers the set back to the main window. The user repeats the process for each tissue so that the main screen eventually holds all the necessary gene sets (Fig. 2b).
The user then activates the do intersections button and the program then runs a Boolean intersection algorithm that first identifies common genes expressed in all N sets of genes, and then, in turn, on each of the N-1 sets, N-2 sets, and so on, presenting the analysis in the new results screen (Fig. 2c). This procedure was adopted to allow for the fact that not all expressed genes are in GXD, nor will all tissues undergoing a process use exactly the same gene set. By looking at the intersections of successively smaller groups, the program compensates for genes whose presence has not been reported or that may only be used in some of the tissues. In practice, there will be very few if any genes expressed in all tissues, but there can be a very large number jointly expressed in pairs with large transcriptomes.
The process is then repeated for the stages before the process is activated to identify candidate before genes that need to be in place in order to ensure that the process-associated genes will be expressed (e.g., signalling pathway genes and transcriptional regulators). This done, the user then checks both sets of results by hand against GXD to ensure, for example, that any data from mutant mice is not giving a false positive or that possible candidate genes have been shown not to be expressed in one of the tissues.
The tool also has other specialised functions that allow for additional combinations of unions and intersections to be performed. It is, for example, possible to identify genes that are turned either on or off between stages in either a single tissue, or that are common to sets of tissues. In addition, gene sets can be stored, compared, and updated.
The GXD-search tool can also compute a random set of genes for use as a reference in over-representation analysis that can be used in existing tools such as Fatigo (fatigo.bioinfo.cipf.es, Al-Shahrour et al.,2004). Over-representation analysis assesses the Gene Ontology annotations that exist for a set of genes (e.g., genes up-regulated in a microarray experiment) in comparison with a reference set. The approach can provide measures of statistical significance of the relative enrichment of a GO-term in the candidate set in comparison to a reference set. Looking at the MET process, we might hope to find that anatomical structure morphogenesis (GO:0009653) has a P value < 0.05 for the intersection sets we compute for the MET. However, two problems arise in such analyses. First, there is a large variability in the population of the GXD database. For some tissues, many assays have been performed covering a large number of genes, while for others there is no gene-expression data. Second, over-representation analysis relies on existing gene annotations related to development; however, these are not as extensive as we might wish. For instance, there are up to 1,586 mouse genes annotated with anatomical structure morphogenesis but only three are annotated with embryonic epithelial tube formation (Ret, Shank3, and Wnt4, none of which are transcriptional regulators). This further limits the utility of over-representation analysis in the context of development at this point in time.