BAR expressolog identification: expression profile similarity ranking of homologous genes in plant species


  • Rohan V. Patel,

    1. Department of Cell & Systems Biology, University of Toronto, 25 Willcocks Street, Toronto, ON M5S 3B2, Canada
    2. Centre for the Analysis of Genome Evolution and Function, University of Toronto, 25 Willcocks Street, Toronto, ON M5S 3B2, Canada
    Search for more papers by this author
  • Hardeep K. Nahal,

    1. Department of Cell & Systems Biology, University of Toronto, 25 Willcocks Street, Toronto, ON M5S 3B2, Canada
    2. Centre for the Analysis of Genome Evolution and Function, University of Toronto, 25 Willcocks Street, Toronto, ON M5S 3B2, Canada
    Search for more papers by this author
  • Robert Breit,

    1. Department of Cell & Systems Biology, University of Toronto, 25 Willcocks Street, Toronto, ON M5S 3B2, Canada
    2. Centre for the Analysis of Genome Evolution and Function, University of Toronto, 25 Willcocks Street, Toronto, ON M5S 3B2, Canada
    Search for more papers by this author
  • Nicholas J. Provart

    Corresponding author
    1. Department of Cell & Systems Biology, University of Toronto, 25 Willcocks Street, Toronto, ON M5S 3B2, Canada
    2. Centre for the Analysis of Genome Evolution and Function, University of Toronto, 25 Willcocks Street, Toronto, ON M5S 3B2, Canada
    Search for more papers by this author



Large numbers of sequences are now readily available for many plant species, allowing easy identification of homologous genes. However, orthologous gene identification across multiple species is made difficult by evolutionary events such as whole-genome or segmental duplications. Several developmental atlases of gene expression have been produced in the past couple of years, and it may be possible to use these transcript abundance data to refine ortholog predictions. In this study, clusters of homologous genes between seven plant species – Arabidopsis, soybean, Medicago truncatula, poplar, barley, maize and rice – were identified. Following this, a pipeline to rank homologs within gene clusters by both sequence and expression profile similarity was devised by determining equivalent tissues between species, with the best expression profile match being termed the ‘expressolog’. Five electronic fluorescent pictograph (eFP) browsers were produced as part of this effort, to aid in visualization of gene expression data and to complement existing eFP browsers at the Bio-Array Resource (BAR). Within the eFP browser framework, these expression profile similarity rankings were incorporated into an Expressolog Tree Viewer to allow cross-species homolog browsing by both sequence and expression pattern similarity. Global analyses showed that orthologs with the highest sequence similarity do not necessarily exhibit the highest expression pattern similarity. Other orthologs may show different expression patterns, indicating that such genes may require re-annotation or more specific annotation. Ultimately, it is envisaged that this pipeline will aid in improvement of the functional annotation of genes and translational plant research.


The increasing availability of sequence data for genomes of multiple species allows us to compute evolutionary trajectories on a genome-wide basis. Such data may also be of value when attempting to trace the ancestry of genes across multiple species in order to assign orthology. The concept of orthology is one that is central to comparative genomics. The original phylogenetic definition of orthology stated that orthologous genes arise due to speciation events (Fitch, 1970). A widely recognized assumption related to this definition is that orthologous genes exhibit conserved functionality as dictated by the final polypeptides encoded by the corresponding genes, because the sequences are similar. Considering the tissue-specific expression patterns of related genes could contribute to a better understanding of gene function, especially given the high number of incidences of whole genome and segmental duplications in plants (Arabidopsis Genome Initiative, 2000; Jiao et al., 2011).

Orthologous genes across multiple species can be identified through their level of sequence similarity. A variety of tools exist for computational detection of orthologs, and these can be split into two major categories: graph-based methods and tree-based methods. Tree-based methods (e.g. Orthostrapper and RIO) use explicit evolutionary models to infer orthology between genes of multiple species, while graph-based methods (e.g. OrthoMCL and InPARANOID) use sequence similarity alone to infer orthology (Kuzniar et al., 2008).

Databases of orthologous genes across a variety of species computed using various methods already exist. These include the National Center for Biotechnology Information Clusters of Orthologous Groups (COG) database (Tatusov et al., 1997), as well as the National Center for Biotechnology Information euKaryotic Orthologous Group (KOG) database (Tatusov et al., 2003). Additionally, online databases of orthologs between a variety of species are provided by OrthoMCL (Li et al., 2003), as well as InPARANOID (Remm et al.,2001), which contain orthologs that were determined using the respective algorithms. Ortholog databases that include information more specific to plants include Phytozome ( and Gramene (Liang et al., 2008). The former database uses sequence similarity to identify orthologs, while the latter makes use of phylogenetic trees to infer orthology. Another platform for viewing information concerning orthologous clusters of genes is PLAZA (Proos et al., 2009), an online resource for plant comparative genomics. Pre-computed orthologous groups in this resource were identified through sequence similarity, and were then fed into a phylogenetic pipeline to improve confidence in assignments of homology. CoGe, a platform for visualization of homologous relationships between genes, differs from others by using synteny in order to identify putative orthologous relationships (Lyons and Freeling, 2008).

Tools have been developed that permit the exploration of both homologous sequences and co-expression neighbors. GeneCAT (Mutwil et al., 2008) enables users to combine BLAST and condition-independent (Usadel et al., 2009) co-expression analyses to infer functional equivalency between genes. In this way, clusters of orthologs can be identified that share functional equivalency. PlaNet (Mutwil et al., 2011) is a more recent implementation of this idea, extended to cover seven plant species.

Ortholog identification may be made difficult by various events in a species’ evolutionary history. Plant species such as Arabidopsis thaliana, as well as the six other plant species that are the focus of this project, are known to have undergone multiple whole-genome duplication and segmental duplication events (Arabidopsis Genome Initiative, 2000; Jiao et al., 2011). These types of events can create one-to-many or many-to-many orthologous relationships, with duplicated genes within a species becoming in-paralogs of one another. This further complicates the process of identifying orthologs, which is also made difficult by the different evolutionary trajectories that duplicated genes may follow, namely neo-functionalization, sub-functionalization, non-functionalization or retention (Ohno, 1970).

Although it is possible, through a number of tools such as those described above, to detect orthologs through their level of sequence similarity, it may also be possible to use gene expression data to assist in ortholog identification. It is widely assumed that orthologs in different species perform similar functions, and for this assumption to hold true they should have similar gene expression profiles, i.e. they should be expressed in similar tissues. If this is not the case, such data may also be used to assess the degree of sub-functionalization of duplicated genes.

The species considered in this study were Arabidopsis thaliana (Schmid et al., 2005), Medicago truncatula (Benedito et al., 2008), Populus trichocarpa (Wilkins et al., 2009a), Glycine max (Libault et al., 2010), Oryza sativa (Jain et al., 2007; Li et al.,2007) Hordeum vulgare (Druka et al., 2006) and Zea mays (Sekhon et al., 2011), as developmental atlases of gene expression are available for these species, generated using microarrays or RNA-seq.

Given the unknown nature of the exact evolutionary relationships between the genes in each gene cluster, we use the term ‘homolog’ rather than the strict definition of ‘ortholog’ to describe related genes from different species in gene clusters. Here we created a pipeline to identify homologous gene clusters in these seven species by sequence, and to rank expression profile similarity based on expression patterns in equivalent tissues between species. The top-ranked homolog by expression profile similarity is termed the ‘expressolog’, but the expression pattern similarity scores for all homologs can also be used to help identify potential redundancies in gene function due to duplication events. The results of this pipeline are visualized using a web-based tool, the Expressolog Tree Viewer, which displays the relationships between sequence and expression pattern similarity for a given gene from any of the seven species, highlighting the expressologs for each species.


Identification of homologous clusters and creation of homologous pair sets

In order to compute gene expression divergence within gene families, it was first necessary to determine clusters of homologs. We used the tool OrthoMCL (Li et al., 2003; Chen et al., 2007) to compute these homologous clusters of genes, by using protein sequence files in FASTA format for all seven species of interest.

We identified 49 495 clusters of homologous genes across the seven species studied. Of these clusters, 26 016 contained genes from just one species, i.e. contained only paralogs. Subsequently, 9907 clusters were found to contain genes from two species, 5130 clusters contained genes from three species, 2267 clusters contained genes from four species, 992 clusters contained genes from five species, 2340 clusters contained genes from six species, and 2843 clusters contained genes from all seven species. Of the 26 016 clusters containing genes from one taxon, 8155 clusters contained Z. mays paralogs, 2817 clusters contained O. sativa paralogs, 1177 clusters contained H. vulgare paralogs, 3771 clusters contained M. truncatula paralogs, 3594 clusters contained P. trichocarpa paralogs, 3291 clusters contained G. max paralogs, and 3211 clusters contained A. thaliana paralogs. The top right half of Table 1 provides a summary of the number of one-to-one homologous sequence pairs between each species pair.

Table 1.   Summary of results and datasets
  Arabidopsis thaliana Populus trichocarpa Medicago truncatula Glycine max Hordeum vulgare Oryza sativa Zea mays
  1. Top right of diagonal: total number of one-to-one homolog pairs. Bottom left of diagonal: sum of the number of one-to-one and randomly selected one-to-many homolog pairs, used for tissue equivalency calculations. Rows 1–5 show the number of protein isoform sequences covering the denoted number of genes in each BLAST dataset, the expression platform, number of probe sets on it, and the number of genes mapping to the probe sets. Diagonal: number of sequences not in singleton groups within a given species. The numbers in parentheses denote the number of homolog pairs with gene expression information for both members of the pair. Only primary transcript sequence and expression data were used for analyses.

  2. aNumber of protein sequences in the BLAST dataset, including isoforms.

  3. bNumber of genes represented in the BLAST dataset.

  4. cGEO microarray platform identifier for expression analyses, unless RNA-seq was used.

  5. dNumber of different genes/primary transcripts present in mapping files (see Experimental procedures).

  6. e61 278 probe sets in total for M. truncatula, M. sativa and S. meliloti genes.

  7. fImplied number of genes detected by RNA-seq from; 69 145 is the number from Note that 69 145 genes were predicted by Schmutz et al. (2010), 46 403 with high confidence.

  8. g57 381 probe sets in total for rice indica and japonica sub-strains, the indica variety was used to generate the expression atlas.

  9. hNumber for expression measurements.

Number of isoformsa35 38645 03353 42355 78722 63451 258106 046
Number of genesb27 41540 66850 96246 36722 63440 65577 355
Expression platformcGPL198GPL4359GPL4652RNA-seqGPL1340GPL2025GPL12620
Number of probe sets22 81061 41350 902e66 210f22 84038 548g70 062h
Number of genes mapping to probe setsd23 86132 53321 65466 21021 07133 98245 826
A. thaliana 19 4853380 (2491)3821 (2986)1606 (1606)2292 (2144)3669 (3366)2998 (2998)
P. trichocarpa 7094 (5610)29 9142935 (2690)1604 (1604)2269 (1550)2195 (1871)2209 (2209)
M. truncatula 7842 (4653)6898 (5378)26 1271898 (1898)1798 (1733)3023 (2498)2314 (2118)
G. max 6585 (6446)5971 (5971)8751 (8333)38 784706 (687)1237 (1198)1155 (1049)
H. vulgare 3569 (3480)3991 (3270)2734 (2692)3317 (3293)13 6995020 (4768)2290 (2287)
O. sativa 5262 (4899)5145 (4407)4187 (3608)5013 (4974)6888 (6636)23 3678524 (8524)
Z. mays 4392 (4159)3605 (3605)3942 (3728)4098 (3731)5696 (5688)12 428 (11 878)38 221

Duplication events such as whole-genome duplications, segmental duplications and tandem duplications can result in difficulties when inferring orthology. OrthoMCL identifies probable orthologs as well as in-paralogs (genes that arose from duplication events following a speciation event, i.e. recent paralogs) across the species of interest. These in-paralogs are found to be more similar to each other than to any sequence from any other genome. For tests involving computation of tissue equivalencies prior to ranking expression profile similarity, we also generated two other sets of data for each species pair to address the issue of uneven vector lengths caused by such duplication events. The first set consisted of a collection of all one-to-one homologs plus homologous pairs generated by randomly picking one sequence from the ‘many’ side of one-to-many homologous clusters and pairing it with its ‘one’ partner. The bottom left half of Table 1 shows the number of sequences in each set for each species pair. The second set consisted of the first set plus homologous pairs generated by randomly picking one sequence from each side of the many-to-many homologous clusters (Table S1).

Identification of tissue equivalencies: effect of homolog pair set size

Prior to ranking the homologs from multiple species by expression profile similarity, equivalent tissues between each pair of species being compared were deduced. Comparisons between expression profiles of genes in multiple species are made more difficult by the differences in physiology and anatomy of the species of interest. One way to overcome this problem is to use the available gene expression data to compute correlations between plant tissues (Cho et al., 2002). This can provide a means to quantify the relationship between tissues in multiple species, in the absence of accurate sample annotation by Plant Ontology terms (Avraham et al., 2008) for tissue expression data.

An example of how tissue equivalencies may be calculated between species is shown in Figure 1. For correlations to be calculated, it is necessary to have the same number of gene pairs (i.e. rows in Figure 1) to generate the corresponding tissue expression vectors. We tested four sets of gene pairs: the most sequence-similar MYB homologs as an example of a small dataset, and then the three datasets described above: all one-to-one homologs, all one-to-one homologs plus random pairs picked from one-to-many homologous clusters, and finally all one-to-one homologs plus random pairs picked from both the one-to-many and many-to-many homologous clusters. In order to see which set performed the best for the purposes of tissue equivalency calculations, Spearman’s rank correlations (SCC scores) were first calculated between 28 tissues of G. max (Libault et al., 2010) and all tissues in the A. thaliana AtGenExpress Development Atlas (Schmid et al., 2005). We used a semi-manual scoring system to assess whether the best tissue as assessed by the SCC score matched that in Arabidopsis, by assigning a value of 1 if the Plant Ontology term (Avraham et al., 2008) matched, a value of 0.5 if we were unsure whether the tissues looked similar based on their description, and zero if they were clearly not similar. The results are shown in Table 2 and Table S2. The set of homolog pairs comprising all one-to-one homologs plus those generated by randomly picking one of the ‘many’ in one-to-many homolog clusters gave the best tissue equivalency identification performance, with 22.5 of 28 tissues positively identified.

Figure 1.

 Method for computing tissue equivalencies, using a hypothetical dataset.
Expression profiles are found for tissue 1A from species A, and tissues 1B–4B from species B using a set of homolog pairs. The similarity between these expression profiles is correlated using a correlation metric such as Spearman’s correlation coefficient (SCC). The tissue from species B most correlated with that from species A is considered the most equivalent tissue.

Table 2.   Scoring tissue equivalencies between Arabidopsis thaliana and Glycine max using different homolog pair set sizes
 Number of pairsScore
Number of tissues examined 28
Best sequence matches, one gene family (MYBs)5515.5
One-to-one homolog pairs160617.5
One-to-one homolog pairs plus random one-to-many pairs644622.5
One-to-one homolog pairs plus random one-to-many pairs plus random many-to-many pairs11 41321.5

In order to assess the effect of random sampling for generating the one-to-many homolog pairs, we generated 100 sets of randomly sampled one-to-many pairs and many-to-many pairs and performed the tissue equivalency calculation for each set. The results of this experiment are shown in Table 3 for the poplar–Arabidopsis tissue equivalency analysis. Not a lot of variation is introduced by the procedure used to randomly create homolog pairs from the one-to-many homolog cluster space, with the standard deviation for the SCC values being approximately two orders of magnitude less than the SCC values themselves. Thus we used just a single set of randomly created homolog pairs from the one-to-many homolog cluster space for further analyses.

Table 3.   Summary of tissue equivalency calculations
 Tissue equivalency calculated with
 2491 one-to-one homolog pairs5610 one-to-one homolog pairs plus random one-to-many pairs (x 100)9153 one-to-one homolog pairs plus random one-to-many pairs plus random many-to-many pairs (x 100)
Poplar tissueArabidopsis tissueSCC valueArabidopsis tissueSCC value ± SDArabidopsis tissueSCC value ± SD
  1. Table shows equivalent tissues between poplar and Arabidopsis thaliana computed using all one-to-one homolog pairs, homolog pair sets comprising all one-to-one homolog pairs plus pairs picked randomly from one-to-many homolog groups, and homolog pair sets comprising all one-to-one homolog pairs plus pairs picked randomly from one-to-many and many-to-many homology groups. In the latter two instances, 100 homolog pair sets were generated, the SCC values for all tissues were computed 100 times, and the mean and standard deviation are shown for the best ranked tissue considered equivalent. A complete list of data for computing equivalencies is presented in Data S1, Figures S3–S22, and the equivalencies may be viewed at Root 1.04 and 1.09 refer to growth stages, see Schmid et al. (2005).

Male catkinFlowers stage 12, stamens0.49Flower stage 12, stamens0.46 ± 0.0057Flower stage 12, stamens0.45 ± 0.0068
Female catkinFlowers stage 12, stamens0.52Flower stage 12, stamens0.48 ± 0.0051Flower stage 12, stamens0.47 ± 0.0059
Young leafRosette leaf 120.56Vegetative rosette0.52 ± 0.0058Vegetative rosette0.51 ± 0.0068
Mature leafCotyledon0.62Rosette leaf 80.54 ± 0.0052Rosette leaf 80.52 ± 0.0054
XylemRoot 1.040.47Root 1.040.44 ± 0.0057Root 1.040.42 ± 0.0051
RootRoot 1.090.55Root 1.040.50 ± 0.0053Root 1.040.48 ± 0.0066

Identification of tissue equivalencies: effect of correlation metrics

Three correlation metrics were examined, the Pearson correlation coefficient (PCC), the uncentred Pearson correlation coefficient (UPCC) and Spearman’s correlation coefficient (SCC), using two datasets, in order to decide which metric would be best suited to the analysis (see Usadel et al., 2009 for a discussion of the advantages and disadvantages of each). A larger dataset incorporating all 7094 pairs of one-to-one homologs plus randomly selected one-to-many pairs from homologous clusters between A. thaliana and P. trichocarpa was used for this experiment. All three correlation metrics mentioned above were used to compute tissue equivalencies. A summary of the results from these analyses is given in Figure 2 (for use of the SCC metric) and Data S1, Figures S1 and S2 (for use of the PCC and UPCC metrics, respectively). Based on these analyses, it was decided that the output from the SCC analysis gave the most coherent results, using a semi-qualitative scoring method similar to that described to generate Table 2 (data not shown). Similar results were seen for other test species pairs (see Table S2; UPCC and PCC data not shown). Therefore, tissue equivalencies were computed between all species using the dataset of all one-to-one homologs plus the randomly generated homolog pairs from one-to-many homologous clusters and the SCC metric. The results of these analyses can be found in Data S1, Figures S3–S22. Between six and 47 tissue samples were available for use for these analyses. Examples of these analyses show that rice anthers are equivalent to maize anthers (Data S1, Figure S21), 16-day-old M. truncatula seeds are equivalent to 28-day-old soybean seeds (Data S1, Figure S14), and carpels from stage 12 flowers of Arabidopsis are equivalent to rice ovaries (Data S1, Figure S4). The complete results of these analyses are also available at

Figure 2.

 Poplar–Arabidopsis tissue equivalencies.
Heatmap showing values of correlations between tissues of Arabidopsis thaliana (rows) and P. trichocarpa (columns), performed using Spearman’s correlation coefficient and 7094 homolog pairs from one-to-one and one-to-many homologous clusters. Tissue equivalencies (best SCC scores) are highlighted by boxes outlined in black.

Expression profile similarity ranking of homologs

Having computed tissue equivalency analyses for all species being studied, it was possible to rank homologs within each cluster of genes by expression profile similarity, as per the schematic shown in Figure 3. The tissue equivalencies were used as comparable data points for expression profiles between species. The SCC metric was used to correlate expression profiles of homologs across all equivalent tissues for sequences in each of the 49 495 clusters if expression information was available for them. We used the SCC metric to minimize the effect of outliers (Usadel et al., 2009).

Figure 3.

 Method for ranking of homologs by expression profile similarity using a hypothetical dataset.
Expression profiles for all genes within a given homologous cluster are retrieved across all equivalent tissues. The expression profile of gene A from species A is compared to the profiles of its homologous genes A’, A’’ and A’’’ from species B in equivalent tissues using a correlation metric such as Spearman’s correlation coefficient (SCC). The top-most ranked homolog in terms of its correlation coefficient score to gene A is termed the ‘expressolog’.

Expressolog tree viewer

In order to be able to visualize the combined sequence similarity and expression similarity data, we implemented an Expressolog Tree Viewer to display a phylogenetic tree of sequence relationships and corresponding expression pattern similarities. Circles of differing sizes and shades of grey beside the names of the sequences denote the degree of sequence similarity, while circles of differing sizes in shades of red, yellow and blue denote expression pattern similarity or dissimilarity. Hovering over these circles with the mouse pointer provides additional information, such as the numerical value for the degree of similarity. For cases where no expression information is available, a question mark is displayed instead of the expression similarity score.

An example Expressolog Tree Viewer output is shown for P. trichocarpa homologs of the A. thaliana ATELF5A-1 gene (At1g13950) in Figure 4, together with eFP browser outputs of these genes (Winter et al., 2007; Wilkins et al., 2009a). ATELF5A-1 is expressed most highly in the stamens and pollen, and to a lesser extent in the leaves of A. thaliana. There are four homologs of this gene in poplar, and our pipeline was able to detect a homolog with strong expression in catkins, which was flagged as the expressolog of ATELF5A-1, highlighted with a yellow background on the Expressolog Tree Viewer. An analysis of the other P. trichocarpa genes in this cluster showed varying expression patterns. Additionally, it was found that variation in sequence similarity does not necessarily correlate with variation in expression pattern similarity between P. trichocarpa homologs when compared to the query gene. The levels of sequence similarity and expression profile similarity of poplar homologs to ATELF5A-1 in Figure 4 and Table 4 provide an example of this. In Table 4, the poplar homologs of ATELF5A-1 are ranked by the SCC value of expression profile similarity, and one can clearly see that the level of divergence of expression profiles does not correlate with the level of divergence of sequence similarity for each of the ATELF5A-1 homologs.

Figure 4.

 Expressolog Tree Viewer output.
The Arabidopsis ATELF5A-1 gene (At1g13950) shows highest expression in the stamens and pollen, and to a lesser extent in leaves of A. thaliana. Red denotes higher expression. Additionally, a phylogenetic analysis from the Expressolog Tree Viewer for the Arabidopsis ATELF5A-1 gene (At1g13950) and its poplar homologs is shown. Some eFP browser (Wilkins et al., 2009a) outputs of expression levels from poplar homologs of the Arabidopsis ATELF5A-1 gene are shown, together with their SCC values and sequence similarity scores. Rankings are based on similarity of expression profile. The catkin-specific expressolog (POPTR_0018s11660.1, PtpAffx.2128.1.A1_x_at) was identified by this pipeline. Remaining homologs were more highly expressed in other tissues, and at different expression levels.

Table 4.   Poplar homologs of ATELF5A-1 with SCC expression similarity and sequence similarity values
Expression similarity rankHomologProbe setSCC valueSequence similarity (%)

Global analyses of sequence and expression pattern similarity across species

For all our data from homologous clusters from two or more species, we determined how often the expressolog, i.e. the homolog with the most similar pattern of expression to the query gene, is also the most similar at the level of sequence. The results are summarized in Table 5; further details are given in Table S3. The number of times for which this is not the case is often surprisingly high. For instance, between poplar and Arabidopsis, there are 4231 cases (39.1%) in which the homolog with the best sequence match is not that with the best expression profile match, i.e. not the expressolog. The number of cases where the homolog with the best sequence match is the expressolog is 6589. The number of cases in which the expressologs are not the best sequence matches ranges from a low of 15.4% between poplar and M. truncatula, to a high of 50.7% between soybean and barley, as shown in Table 5. Further information on these comparisons is given in Table S3.

Table 5.   Global summary of expressologs and best sequence similarity matches
  Arabidopsis thaliana Poplar Medicago truncatula SoybeanRiceBarleyMaize
  1. Data to the top right of the diagonal indicate the number of times the top sequence homolog is the expressolog; data to the bottom left of the diagonal number of times the top sequence homolog is not the expressolog.

A. thaliana 658964235717659047584673
Poplar4231 (39.1%)99768954870564276532
Medicago truncatula 1457 (18.5%)1810 (15.4%)7110457936484063
Soybean5696 (49.5%)8614 (49.0%)5327 (42.8%)406928918433
Rice1482 (18.4%)1913 (18.0%)1053 (18.7%)3980 (49.4%)779210 541
Barley1122 (19.1%)1324 (17.1%)866 (19.2%)2972 (50.7%)1450 (15.7%)5924
Maize2964 (38.8%)3666 (35.9%)2328 (36.4%)5999 (41.6%)4977 (32.1%)2918 (33.0%)

To view these data in a slightly different way, we created two sets of pairs of sequences for three species combinations. The first set of sequences consisted of those with the best sequence similarity scores from homologous clusters. The second set consisted of those with the best expression similarity scores, i.e. the expressologs. For each set, we extracted the corresponding expression similarity score for the first set, and the corresponding sequence similarity score for the second set. The results for comparison of the corresponding Arabidopsis versus poplar sets are shown in Figure 5. The most similar sequences (mean sequence similarity of 70.7%) have just over half the mean expression similarity score compared with those with the best expression similarity score (0.29 versus 0.49), and the mean sequence similarity score for the latter set is just a few per cent lower (68.3%). The same is true for comparison of Arabidopsis with maize and maize with rice (see Data S1, Figure S23), and for other species comparisons (see Table S3).

Figure 5.

 Analysis of sequence similarity versus expression pattern similarity for two sets of gene pairs from A. thaliana and P. trichocarpa
The left panel shows the results of an analysis performed using the most sequence similar homologs. The corresponding expression pattern similarity scores, as measured by Spearman’s correlation coefficient (SCC), were retrieved from our database. Each pair was plotted according to its sequence and expression pattern similarity score. In order to simplify this scatter plot, a hexagonal binning function was used. Each hexagonal bin contains a certain number of points, denoted by the grey shading. If a bin is black, then there are 425 points in it (each point is one pair). The right panel row shows the sequence similarity against expression similarity for all homologs with top-ranked SCC values, termed the ‘expressologs’. The same shading scale and binning function was used. The mean sequence similarity and expression pattern similarity scores across all pairs in each graph are shown by dotted blue lines. The red lines are the lines of best fit through all points in the two graphs, the R-squared values of which are shown.

Figure 6 shows an outline of the pipeline used to rank the homologs by expression profile similarity, summarizing the major steps involved in ranking homologs using this method.

Figure 6.

 Schematic of the pipeline used to identify expressologs between species.

eFP browsers

In addition to the existing eFP browsers already available on the Bio-Array Resource (BAR) site (Toufighi et al., 2005) for Arabidopsis (Winter et al., 2007) and poplar (Wilkins et al., 2009a), five new eFP browsers were created to enable cross-species visualization of expression patterns within homologous clusters of genes. Although the Expressolog Tree Viewer provides an indication of the degree of expression similarity of homologous genes, it is also useful to be able to view actual expression values for the species in question, in either a pictographic, tabular or graphical format, as is possible within the eFP browser framework. eFP browsers have been created for the developmental atlases of gene expression for M. truncatula (Benedito et al., 2008), G. max (Libault et al., 2010), O. sativa (Jain et al., 2007; Li et al.,2007), H. vulgare (Druka et al., 2006) and Z. mays (Sekhon et al., 2011). Within the Expressolog Tree Viewer output, the circles denoting sequence similarity link to a multiple sequence alignment, while the circles denoting expression similarity link to the corresponding eFP browser outputs for the relevant homolog, to enable more detailed examination of expression patterns of a given homolog. Example outputs for the five new eFP browsers are given in Figure 7 for the expressologs of the A. thaliana IRREGULAR XYLEM 3 (IRX3) gene. IRX3 is expressed most highly in the stem, specifically the xylem, of A. thaliana (Taylor et al., 1999).

Figure 7.

 Example views of five new eFP browsers that have been created to enable cross-species expression browsing within the framework described in this paper.
Views are for expressologs of the Arabidopsis thaliana IRX3 gene in the respective species. The eFP browser views are for Medicago truncatula (top left), Glycine max (top right), Oryza sativa (bottom left), Hordeum vulgare (bottom center) and Zea mays (bottom right). Red indicates higher expression in the depicted tissue.


In this study, we have devised a computational pipeline for ranking of genes within homologous clusters based on expression profile similarity. The similarity of spatio-temporal expression patterns may be thought of as an additional piece of information regarding functional equivalency between homologous genes. Thus, using this pipeline, we are able to identify, within each cluster, which homolog exhibits both the highest sequence similarity and expression pattern similarity. This is a complementary approach to the co-expression neighborhood approach implemented by PlaNet (Mutwil et al., 2011) or by Chikina and Troyanskaya (2011); with both of these methods, it is first necessary to manually identify genes in both species showing similar patterns of expression before exploring the corresponding co-expression neighborhoods: our expressolog method automates this process. Currently, it is possible to view this information for the seven species in this study. It is also possible to view the expression patterns of the ranked homologs using a suite of seven eFP browsers, five of which are new for this study.

Despite the fact that the monocots and the dicots diverged roughly 200 million years ago (Wolfe et al., 1989), our pipeline was still able to broadly identify equivalent tissues, as described in the Results. There are some oddities in terms of our equivalency analyses, especially between the monocots and dicots. For instance, our pipeline identified 1 cm pods from soybean as being equivalent to the base of stage 2 leaves from V5 maize plants. Perhaps this is unsurprising, as the divergence between monocots and dicots has resulted in several differences in morphology between the two groups of flowering plants, and as a result there may not be clear equivalent tissues between these groups. We have added small ‘alert’ icons for comparisons between these groups, but the Expressolog Tree Viewer still provides an easy access point for viewing the actual expression profiles of homologs across the monocot–dicot divide, for manual investigation. Movahedi et al. (2011) point out that rice and Arabidopsis genes may exhibit similar co-expression networks despite having differing tissue-specific patterns of expression, highlighting a more general phenomenon of concerted network evolution.

Statistical analyses of the expression profile similarity versus sequence similarity correlation outputs between A. thaliana and P. trichocarpa showed interesting results. Two different sets of homologs were investigated: those exhibiting maximum sequence similarities, and those showing maximum expression profile similarities (expressologs) within each homologous cluster. Both sets of homologs exhibited a similar mean sequence similarity value. However, major differences could be seen in the expression similarity values between different sets of homologs. For instance, there is a large difference in mean expression profile similarity values for the most sequence similar homologs and expressologs. The homolog pairs with the highest sequence similarity show almost a 50% lower expression profile similarity value than the expressologs (Figure 5). Expression profile similarity can be seen as a piece of information additional to sequence homology for in planta functional equivalence. An extension of this idea would be to use expression similarity and sequence similarity to more accurately annotate homologs.

Although we have used only developmental gene expression atlases for the majority of analyses described here, we also examined use of a ‘response atlas’ to compute expressologs. In this case, we used experiments from Arabidopsis and poplar that were designed to subject the plants to drought stress in similar ways, and that sampled similar tissues at similar time points (Wilkins et al., 2009b, 2010). Data S1, Figure S24 shows an analysis of the drought response in A. thaliana compared with P. trichocarpa. Here expressologs of differentially expressed genes from Wilkins et al. (2010) were determined using both developmental atlas data and the abiotic stress expression data. A more positive correlation for expression responses was found for the analysis performed in the latter case. This may suggest the need to use a specific expression compendium when comparing expression data under certain conditions (see Usadel et al., 2009, for a discussion of the difference between using condition-independent and condition-dependent expression compendia for co-expression analyses).

For the example of ATELF5A-1 in A. thaliana, this gene shows highest expression in the stamens, pollen and leaves. The computational pipeline we devised was able to identify an expressolog with strong expression in catkins in P. trichocarpa. However, the other ranked P. trichocarpa homologs of ATELF5A-1 showed expression patterns that differed from both ATELF5A-1 and its P. trichocarpa expressolog. This may suggest that these other homologs require re-annotation or a more specific annotation. Additional examples, which can be found in Data S2, include the AtG8F gene, which shows highest expression in the internode, stamens, seeds and leaves of A. thaliana. In P. trichocarpa, the expressolog shows highest expression in the xylem and catkins. However, the other homolog in the same gene cluster shows a different pattern of expression. Other sample results are given in Data S2. Currently, many functional annotations are derived through sequence homology. It is envisaged that the results from this pipeline will ultimately aid in the improvement of functional annotations of genes, and we are planning to provide our services to a new International Arabidopsis Informatics Consortium information portal and other bioinformatics resources (International Arabidopsis Informatics Consortium, 2010). Further, this pipeline may be useful in automatically deriving Plant Ontology terms for datasets submitted to gene expression databases.

RNA-seq and whole-genome tiling arrays are powerful methods for transcript-specific gene expression profiling. Our method will continue to prove useful as these data become more common. The gene expression profiling datasets we employed for this study in most cases did not contain splice variant information: only 268 genes from poplar, 623 genes from rice, 550 genes from M. truncatula and 13 683 genes from maize had probe sets that mapped to alternately spliced transcripts. Further, the one RNA-seq dataset that we used (from soybean) did not report expression level differences for different transcripts. Thus, for the tissue equivalency analyses we present here, we only used the expression information for the ‘.1’ or ‘T01’ primary transcript, as in most cases these were the only expression data available. However, when we repeated our tissue equivalency analysis for poplar–Arabidopsis using the expression information for the 268 genes with probe sets for alternately spliced transcripts, we found no difference using the set of one-to-one homologs plus randomly selected pairs of one-to-many homologs. The results were similar for maize–rice when we included the alternate transcripts for the 13 683 genes (see Data S3). As more than 50% of expressed genes with introns in maize exhibit alternate splicing along a leaf developmental gradient (Li et al., 2010), including expression information for large numbers of alternate transcripts would shift the distributions of the one-to-one, one-to-many and many-to-many homolog clusters away from membership in the one-to-one homolog clusters and toward membership in the many-to-many homolog clusters. Thus, with ‘complete’ RNA-seq datasets (i.e. those providing expression information for all transcript variants across many tissues), it may be necessary to use the set of ‘one-to-one homolog pairs plus random one-to-many pairs plus random many-to-many pairs’ for tissue equivalency calculations. We do not anticipate this being an issue, as we show in Table 2 that this set performs almost as well as the set we chose for use in this paper for tissue equivalency calculations.

In summary, using the eFP browser framework and Expressolog Tree Viewer, it is now possible to readily view expression patterns of homologs across different species. The ranking of homologs by both sequence similarity and expression profile similarity allows the user to assess the relationship between a given gene and its homologs in terms of expression profile similarities, providing further information regarding functional equivalency and improving the functional annotation of homologous genes. In addition, our pipeline will become more useful as more plant gene expression atlases are generated, perhaps by such efforts as the 1000 Plant Transcriptomes Project (Stewart et al., 2010). We will re-run the pipeline as more gene expression atlases are generated.

Experimental procedures

Sequence files

The protein sequence files used were as follows: A. thaliana, TAIR10_pep_20101214.fa, downloaded from; M. truncatula, Mt3.0_proteins_20090702_NAMED.fa, downloaded from; P. trichocarpa, Ptrichocarpa_156_peptide.fa.gz, downloaded from; G. max, Glyma1.pep.fa.gz, downloaded from;O. sativa, TIGR 6.1 all.pep, downloaded from; H. vulgare, translated using the OrfPredictor (Min et al., 2005) tool using a sequence file provided by Federico Giorgio (Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany); Z. mays, Zmays_166_peptide.fa, downloaded from

Orthologs were identified using OrthoMCL version 1.4 (Li et al., 2003), which was run using the following parameters set in the scripts: Mode = 1; P-Value Cutoff = 1e-10; Percent Identity Cutoff = 60; Percent Match Cutoff = 60; MCL Algorithm Inflation value = 2.2, based on manual investigation and comparison with other online databases.

Sequence similarity calculations

For the analyses performed in this paper and the values presented in our Expressolog Tree Viewer, we computed the sequence similarity scores using the command line version of CLUSTAL W (Thompson et al., 1994) with the following command: ‘clustalw -infile=[INPUT FILE] -outfile=[OUTPUT FILE] -pwmatrix=gonnet -pwgapopen=10 -pwgapext=0.1 > [STDOUT FILE]’, where the [INPUT FILE], [OUTPUT FILE] and [STDOUT FILE] filenames were specified by the scripts that we designed and wrote for our pipeline. The settings are default settings. Alignments presented in the output of the Expressolog Tree Viewer are generated ‘on the fly’ using MAFFT (Katoh et al., 2002) from sequences stored in our Expressolog database. We use the command ‘mafft –globalpair –maxiterate 500 –op 1.53 –ep 0.123 –quiet [INPUT FILE] > [OUTPUT FILE]’ to generate our alignments ‘on the fly’, where ‘op’ is the gap opening penalty, ‘ep’ is the gap extension penalty, ‘globalpair’ denotes an accurate global alignment, and the ‘maxiterate’ option tells MAFFT to iterate over the alignment 500 times to improve it. The ‘op’ and ‘ep’ values are default values, and the [INPUT FILE] and [OUTPUT FILE] are file names specified by the Expressolog Tree Viewer script.

Expression datasets, mapping and normalization

Expression datasets, platforms with GEO platform identifiers and mapping files used for each species for use in ranking of orthologs and subsequent production of eFP browsers were as follows: A. thaliana, AtGenExpress data series of Schmid et al. (2005), Affymetrix ATH1 platform GPL198, mapping to gene models performed using; P. trichocarpa, GEO accession number GSE13990, Affymetrix poplar genome array GPL4359, mapping to gene models performed using; M. truncatula, ArrayExpress experiment name E-MEXP-1097, Affymetrix Medicago genome array GPL4652, mapping file IMGAGv3MAPPINGS.txt for mapping to IMGAG version 3 gene models provided by Jeremy Murray (Samuel Roberts Noble Foundation,; G. max,, data are RNA-seq data so no mapping file was necessary; O. sativa, GEO accession numbers GSE7951 and GSE6893, Affymetrix rice array GPL2025, mapping to gene models performed using; H. vulgare, GEO accession number GSE16754, ArrayExpress experiment name E-AFMX-3, Affymetrix barley genome array GPL1340, mapping to gene models performed using; Z. mays, PlexDB experiment number ZM37, Nimblegen maize whole-genome microarray 385K (version V1_4a.53), mapping of Sekhon et al. (2011) expression data based on probes to maize gene models performed by Ethalinda Cannon (Iowa State University, Ames) for PlexDB.

The number of genes mapping to each platform is shown in Table 1. For all Affymetrix platforms,.CEL files were normalized using R/Bioconductor (Gentleman et al., 2004) using the MAS5 algorithm, with a TGT value of 100. For soybean RNA-seq expression data, FPKM-normalized data were obtained from the URL indicated above. For maize gene expression data obtained from PlexDB as described above, RMA-normalized expression values were linearized for use in this work using the equation Y=2x, where x represents the RMA-normalized expression value, and y is the value used for this work. Only primary transcripts and protein isoforms were considered for tissue equivalency calculations. However, for expressolog computation, alternative protein isoforms and their corresponding transcripts were used where available.

Correlation metrics

Three correlation metrics were compared in the process of this study. The Pearson correlation coefficient


measures correlation between two vectors X and Y standardized by the standard deviation of the vectors.

The formula for the uncentred Pearson correlation coefficient is similar to the one above for the Pearson correlation coefficient, but assumes that the mean is always equal to zero. The difference is seen if there are two vectors with identical shapes but a standard offset to each other. In this case, the Pearson correlation coefficient would give a value of 1 but the uncentred Pearson correlation coefficient would not.

Spearman’s correlation coefficient


takes the ranks of the expression values into account, rather than the absolute values themselves.

Software and webservices

The Expressolog Tree Viewer may be accessed on the BAR webserver at or by clicking on the ‘Expressolog’ icon at the top of eFP browser output pages, which appears if a given query gene has homologs in other species. Our eFP browser code has been adapted to permit easy implementation for any species of interest. It is available on (version 1.5). Additionally, information can be retrieved through the use of JSON-based web services at the following URL:[{“gene”:”GENE OF INTEREST”},{“gene”:”GENE OF INTEREST”},…], where the ‘GENE OF INTEREST’ may be any gene from the seven species of interest to this study. Any number of genes may be inputted in the format shown above, and a JSON data structure is retrieved giving information in the following order: {‘probeset_A’: ‘Probeset of orginal gene’,‘gene_B’:‘Homologous gene’,‘probeset_B’:‘Probeset of homologous gene’,‘correlation_coefficient’:‘SCC Value of expression profile correlation’,‘efp_link’:‘Link to eFP output for species specific browser of the homologous gene’}. A separate data structure such as the one above is given for each homolog, separated by a comma. Such a structure allows the user to parse the relevant information. In this way, bulk data retrieval of important information is facilitated.


We would like to acknowledge funding provided by the Natural Sciences and Engineering Research Council of Canada to N.J.P for this study, and from the Agricultural BioProducts Innovation Program of Agriculture and Agri-Food Canada to R.B. We are also grateful to Ethalinda Cannon, Wimalanathan Kokulapalan,and Carolyn Lawrence (Department of Genetics, Development and Cell Biology) of MaizeGDB at Iowa State University, Ames, IA, U.S.A., and Shawn Kaeppler and Rajandeep Sekhon (Department of Agronomy) of the University of Wisconsin, Madison, WI, U.S.A., for discussions and images helpful in creating the Maize eFP browser. We are also grateful to Darshan Brar from the International Rice Research Institute for providing us with images for the rice eFP browser. Federico Giorgi at the Max Planck Institute for Molecular Plant Physiology kindly provided barley sequences. Jeremy Murray from the Samuel Roberts Noble Foundation kindly provided M. truncatula mapping files.