SEARCH

SEARCH BY CITATION

Keywords:

  • microarray;
  • database;
  • genomics;
  • expression profiling;
  • transcriptome;
  • bioconductor

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Conclusions
  5. Acknowledgements
  6. References

The increasing accessibility and use of microarrays in transcriptomics has accentuated the need for purpose-designed storage and analysis tools. Here we present UPSC-BASE, a database for analysis and storage of Populus DNA microarray data. A microarray analysis pipeline has also been established to allow consistent and efficient analysis (from small to large scale) of samples in various experimental designs. A range of optimized experimental protocols is provided for each step in generating the data. Within UPSC-BASE, researchers can perform standard and advanced microarray analysis procedures in a user-friendly environment. Background corrections, normalizations, quality-control tools, visualizations, hypothesis tests and export tools are provided without requirements for expert-level knowledge. Although the database has been developed primarily for handling Populus DNA microarrays, most of the tools are generic and can be used for other types of microarray. UPSC-BASE is also a repository of Populus microarray information, providing data from 21 experiments on a total of 407 microarray hybridizations in the public domain of the database. There are also an additional 10 experiments containing 347 hybridizations, where the automatically analysed data are searchable.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Conclusions
  5. Acknowledgements
  6. References

Advanced global gene-expression profiling tools, such as Gene-Chips (Lockhart et al., 1996) and spotted microarrays (Schena et al., 1995), are becoming common features in laboratories, used not only by highly trained specialists but increasingly also by researchers who lack training in the sophisticated data-handling procedures required to optimize their use. The goal for biologists using DNA microarrays is to find relevant information and answers to their specific questions as quickly and conveniently as possible, rather than spending large amounts of time identifying the ideal cDNA-synthesis, labelling and statistical techniques for their experiments. Problems with labelling, hybridization, washing, scanning, image analysis, normalization and statistical treatment of data strongly influence the outcome of the analysis, and re-analysis of published data can often lead to results that differ significantly from those obtained originally (Gu and Gu, 2003; Wang et al., 2002). If DNA microarray results from several experiments are to be compared, new problems may appear relating not only to the biological source material, but also to differences in analytical procedures. In recent years, several large-scale array-analysis projects, such as AtGenExpress (Schmid et al., 2005), have been conducted in which a large number of samples have been analysed in a standardized, ordered fashion. These projects create invaluable resources for the scientific community, but require resources far beyond those of a normal research project.

Appropriate logistical tools are required for handling and analysing the tremendous amounts of data produced in even a single microarray experiment. To gain as much knowledge as possible from microarray experiments, at least two types of database are essential: a storage and analysis database of expression data obtained from the analyses; and an annotation database connecting independent array elements with second-level sequence information and possible gene identification, and third-level functional classification. As increasing amounts of data from array experiments are published, the need for public repositories has become increasingly evident to allow re- and meta-analysis of data (Ball et al., 2004). This need is being met by the large numbers of commercial and public DNA microarray database structures now available (Penkett and Bahler, 2004). Such repositories, for instance ArrayExpress (Brazma et al., 2003) and Gene Expression Omnibus (GEO) (Edgar et al., 2002), require data to be submitted in a standardized format. Within the microarray community such standards have been established by the Microarray Gene Expression Data Society, which has presented the ‘Minimum information about a microarray experiment’ (MIAME) standards (Brazma et al., 2001; Stoeckert et al., 2002).

The genome of Populus trichocarpa was the third plant genome to be fully sequenced (Tuskan et al., 2006), making Populus the most important model tree system for plant genomics currently available. Extensive Populus expressed sequence tag (EST) collections have been compiled (Bhalerao et al., 2003; Kohler et al., 2003; Nanjo et al., 2004; Sterky et al., 1998, 2004), which not only are important for annotation of the genome and to confirm the expression of predicted genes, but also can be used to obtain digital expression profiles of genes (Ewing et al., 1999; Sterky et al., 2004). However, these digital expression profiles do not yield very accurate estimates of expression levels. DNA microarrays have much greater potential to provide precise information on gene expression, and have been used in several cases to analyse changes in gene expression in Populus (Andersson et al., 2004; Hertzberg et al., 2001; Israelsson et al., 2003; Kohler et al., 2003; Lafarguette et al., 2004; Moreau et al., 2005; Rishi et al., 2004; Schrader et al., 2004a,b; Smith et al., 2004; Taylor et al., 2005). Furthermore, full-genome arrays, based on the genome sequence, are under production. We were the first to produce Populus cDNA microarrays (Hertzberg et al., 2001), and our most recently generated array is based on a 100 K EST data set (Sterky et al., 2004), estimated to represent 17 345 of the gene models in the Populus genome (B. Segerman, UPSC, Umeå, Sweden personal communication). This corresponds to a significant part of the transcriptome (Tuskan et al., 2006). As a large number of experiments are being performed with our DNA microarrays, we wanted to establish a standard operating procedure that should make the analysis simple and more reliable, especially for less experienced researchers, to allow a higher throughput of array experiments. A further advantage of a standard operating procedure is that it should facilitate comparisons of array data generated in different experiments and by different researchers, and thus help make the overall value of the array experiments greater than the sum of the individual experiments. For this reason, we wanted to develop a DNA microarray analysis pipeline and a database to store the results.

We have developed the UPSC-BASE database (http://www.upscbase.db.umu.se) for hosting plant microarray data (more specifically, data from Populus and Arabidopsis arrays). The database provides the user with up-to-date microarray procedures in the laboratory, as well as tools for downstream data analysis. It connects to annotation databases (PopulusDB) as well as other gene-expression databases [for Arabidopsis, The Arabidopsis Information Resource (TAIR) (Huala et al., 2001), Nottingham Arabidopsis Stock Centre (NASC) affymetrix service (Craigon et al., 2004), Gene Expression Omnibus (GEO) (Edgar et al., 2002) and Genevestigator (Zimmermann et al., 2004)]. It is based on the free web-based database solution base (BioArray Software Environment) (Saal et al., 2002). In our setup, the intention has been to provide the researcher with logical steps without bottlenecks in the data production and analysis steps, so that data deposition, normalization, transformation, and statistical and hypothesis tests are easy to follow, and generate understandable results that can lead researchers to valid conclusions. All experimental protocols (all tested and optimized), all relevant information on the Populus DNA microarrays, all plug-ins developed, and a description of all modifications of the original base package are freely available at the UPSC-BASE website.

Experimental design

Early microarray experiments were typically small-scale and had little biological or technical replication. As the technology has matured, it has become possible to perform more complex experiments examining the effects of several factors (e.g. time, mutations and environmental treatments). Consequently, it has become increasingly important to apply appropriate experimental designs to cDNA microarray analyses to ensure results are reliable (Churchill, 2002; Kerr and Churchill, 2001; Yang and Speed, 2002). UPSC-BASE features an interactive tool for generating experimental designs, referred to as the design advisor. The design advisor calculates optimal design solutions for situations where many biological samples are present and an exhaustive pairwise hybridization scheme is unrealistically labour-intensive and costly. The functionality of the design advisor is described in more detail on the UPSC-BASE website and by Vinciotti et al. (2005); Wit and McClure (2004); Wit et al. (2005). By utilizing design advisor prior to hybridization, we believe the final quality of data can be increased while keeping the required number of hybridizations to a minimum, thus reducing both cost and manual effort. Although few published microarray studies have used a loop or factorial design, the published data suggest that this approach is superior to other alternatives (Vinciotti et al., 2005). This is the approach we follow in our recommended pipeline.

To demonstrate the reproducibility of our analysis pipeline and the features and usefulness of UPSC-BASE, we used a data set obtained from rehybridizing six leaf samples collected during various stages of development from a free-growing aspen. The biological samples are described in (and the raw data for the original experiment stored as) experiment UMA-0032 at http://www.upscbase.db.umu.se. The ‘wet’ part of the pipeline and the initial downstream steps were performed by another individual, and the experimental design was completely different from that used during the ‘original’ data collection, to assess the robustness and consistency of our pipeline. In contrast to the common reference in the original experiment, we chose to use an all-versus-all design for the demonstration experiment, with one sample date included in triplicate in the hybridization design (Figure 1a,b). UPSC-BASE features a visualization plug-in for producing overview graphs of the experimental design. The raw data for the all-versus-all experiment, plus the results obtained after each step of the analysis, can be downloaded from experiment number UMA-0013 at http://www.upscbase.db.umu.se. We aimed at demonstrating both the high reproducibility of our array analysis pipeline, and the fact that a good design can make it possible to obtain faithful data from many samples using a minimum of hybridizations. In this design we analysed eight samples using a total of 28 microarrays.

image

Figure 1.  The two designs represented in the microarray experiment. The same samples were hybridized in the ‘original’ experiment in a common reference design (a) and in the ‘demonstration’ experiment in an all-versus-all design (b).

Download figure to PowerPoint

Production of the Populus (POP2) microarray

The microarrays used here constitute the second generation of the global Populus cDNA microarrays and contain, in total, 24 735 cDNA fragments. This array is based on the first-generation 13-k Populus array (Andersson et al., 2004) with clones from seven cDNA libraries, representing the cambial zone (AB), young leaves (C), floral buds (F), tension wood (G), senescing leaves (I), dormant cambium (UA) and active cambium (UB). The 25-k array contains clones from the 13-k array plus 12 additional cDNA libraries, representing the apical shoot (K), cold-stressed leaves (L), roots (R), bark (N), shoot meristem (T), male catkins (V), dormant buds (Q), female catkins (M), petioles (P), fibre death (X), imbibed seeds (S) and virus/fungus-infected leaves (Y). For a detailed description of the construction and sequencing of the cDNA libraries, see Sterky et al. (2004). The arrays were produced and quality-tested as described by Moreau et al. (2005).

Generating the array image

The flow chart of the ‘wet’ part of our analysis pipeline is depicted in Figure 2. We provide several protocols that we have found suitable for extracting RNA from different Populus tissues. The cetyl trimethyl ammonium bromide (CTAB)/lithium chloride (LiCl) method (Chang et al., 1993; Doyle and Doyle, 1987) is a general-purpose RNA-extraction method that is useful if plenty of material is available. Modified protocols are also available for small samples and for extracting RNA from stem tissue. A TRIzol reagent method (Invitrogen, Carlsbad, CA, USA) works less robustly for Populus than for Arabidopsis. A modified RNeasy kit (Qiagen, Valencia, CA, USA) gives lower yields, but the procedure is quicker. The Dynabead-based mRNA-extraction method (Hertzberg et al., 2001) allows extraction from tissue samples as small as 1 mg using direct labelling of the cDNA. If plenty of material is available (>20 μg RNA per hybridization), the standard cDNA synthesis/indirect labelling protocol is the most robust method. More recently, the MessageAmp amplification method (Ambion, Austin, TX, USA) with indirect labelling has also proved useful when smaller amounts of RNA (approximately 1 μg) are available.

image

Figure 2.  Flow-chart of laboratory work. Schematic overview of methods used in the laboratory component of the microarray analyses stored in the UPSC-BASE database.

Download figure to PowerPoint

Hybridization in an automated slide processor (Amersham Bioscience, Little Chalfont, UK) is a highly reproducible method, suitable for most experiments, which generates very high-quality data. The standard protocols have been optimized in several ways for our material to give stronger hybridization signals, less background and more uniform hybridization results. In our setup, 12 slides can be processed in parallel within a total run time of 24 h. As an alternative, the standard manual hybridization protocol is believed by some researchers to give stronger hybridization signals, but typically at the expense of the evenness of the hybridization.

The linear range of the signal intensity is a limitation of microarrays. It is often impossible to select the scanning parameters in such a way that both very strongly and weakly expressed genes can be faithfully analysed simultaneously. We provide two parallel scanning procedures. First, when the aim is to study a subset of genes, the channel calibration method can be used. The results are limited to genes with signals within the linear range (with moderate expression levels); spots with too low or high intensity will not give ratio values that are directly correlated with the ‘true’ hybridization signals. In order to standardize the procedure to obtain useful data concerning genes with both high and low expression levels, we have implemented an alternative method, based on multiple scanning of the microarray slides. Each of microarray slide is scanned three times with predetermined increases in laser power and photo-multiplier tube settings using a Scanarray 4000 scanner (Perkin-Elmer Life and Analytical Sciences, Wellesley, MA, USA). The data from each physical microarray slide are then merged with a regression method, restricted linear scaling, within the linear range (Ryden et al., 2006). Restricted linear scaling is a method to handle the problems associated with missing and saturated signals, which occur in most types of microarray experiments.

Database installation

For storing and processing the data, we have incorporated two major microarray-analysis packages: base (Saal et al., 2002) and bioconductor (Gentleman et al., 2004), into UPSC-BASE. Compared with the downloadable base system, our installation features a large set of local adjustments, primarily related to large-scale data handling and a simplified user interface. A comprehensive description of the modifications is beyond the scope of this paper, however, a brief summary of the most wide-ranging extensions is outlined below, and additional information is available on the UPSC-BASE website.

Depending on the design of the experiment, data can be analysed in several different ways. In contrast to bioconductor, which uses the command-line interface of r (Ihaka and Gentleman, 1996), UPSC-BASE is accessed via a web interface and helps the user perform advanced analyses without expert knowledge in statistics and computer programming. Many plug-ins have been implemented in the system (Table 1), most based on published methods, but some developed in-house.

Table 1.   Current implemented plug-in tools for microarray analysis in UPSC-BASE
  1. aOnly the advanced export functions are listed as plug-ins.

AnalysisMulti-dimensional scaling
Principal components analysis
Principal components analysis (Nonlinear iterative partial least squares)
Pearson/Spearman correlation of signal intensities or ratios
Background correctionSubtraction
Moving minimum
Half subtraction
Minimum subtraction
Edwards
Normal and exponential
UmeaSAMED linear background correction
ClusteringK-means clustering
Hierarchical clustering
Self-organizing map
ExportaFASTA
GeneSpring
MA/RG
MapMan/AraCyc
Hypothesis testt-test/Mann–Whitney test
B-statistics
DEDS (Differential expression via distance synthesis)
manova (fixed model)
Significance analysis of microarrays
NormalizationGlobal median ratio
Loess
Print-tip Loess
2D spatial location
Composite
RobustSplines
Robust neural networks (neural nets normalization)
Optimized local intensity- dependent normalization
Stepwise normalization
Between arrays (scale or quantile)
Quality controlArray plots
Bias-estimation plots
MA control spots
Rank-intensity plot
UmeaSAMED QC
TransformationUmeaSAMED restricted linear scaling
VisualizationChromosome viewer
Design plot
Digital Northern
Gene ontology plots
Time-series plot

Data import en masse is facilitated using a batch import tool suitable for importing complete experiments. Required data files (such as scanned images and data files from the feature extraction) are uploaded by the user via file transfer protocol (FTP) and linked to a suitable experiment based on an experiment description file, a tab-delimited text file that defines various properties of the experiment.

The experiment description file can handle virtually all aspects of a MIAME-compliant experiment (Brazma et al., 2001). For cross-experiment comparisons, a database search tool has been implemented that browses the entire set of available analysed data in the database. The search procedure queries a set of array elements (based on internal ID, annotation information or functional class) and displays matching slides grouped by experiment or array element.

All public data in UPSC-BASE are analysed automatically with a standard procedure, including linear scaling and stepwise normalization (Wilson et al., 2003) to give reliable and standardized data. Analysis procedures have been simplified by sending supplementary laboratory information management systems information to the analysis tools, making it possible to utilize the complete design of the experiment without user intervention. This is particularly useful for hypothesis tests such as B-statistics (Lonnstedt and Speed, 2002), which would otherwise require the potentially error-prone step of manually inputting the design matrix of the experiment. Furthermore, to provide a basis for standard analysis packages, a batch plug-in feature is available: instead of running analysis tools one at a time, several plug-ins can be queued to run sequentially. This also makes it possible to provide standard analysis pathways in order to bring conformity to the analysis procedures of the data within the database. Currently, several proposed analysis packages have been pre-defined and are available for all regular users.

In addition to enabling cross-experiment comparisons and large-scale data handling, integration with internal and public databases has been a key consideration in the design of the UPSC-BASE microarray database. Annotation information and functional class assignments are updated automatically from PopulusDB (Sterky et al., 2004) on a weekly basis to provide up-to-date annotations. Furthermore, PopulusDB has been extended with suitable links to the microarray database search tool in order to find quickly information regarding specific clones. For the Complete Arabidopsis Transcriptome MicroArray (CATMA; Huala et al., 2001), annotations are also downloaded from TAIR on a weekly basis. A more detailed description of database modifications is available on request.

Data generation and quality assessment

An overview of the analysis pipeline, from TIFF image to interpretation, is shown in Figure 3. The image analysis is performed in genepix 5.0 (Axon Instruments, Union City, CA, USA) with standardized settings. In our experience, analysis of TIFF images with composite pixel intensity (CPI) settings set to find circular features with a diameter of 80–150% of the expected size, and composite pixel intensity threshold set to 300, produces the best results. In this way, very weak spots are automatically marked as ‘not found’. The extracted data are stored as plain text files and composite JPEG images. There are three alternative ways to handle bad spots: no flagging, manual flagging or automatic flagging. Although time-consuming, the manual method works well for experiments including relatively few (and not too large) microarrays, while the automatic flagging method masqot (microarray spot quality control; Bylesjo et al., 2005) is a reproducible alternative for high-throughput studies.

image

Figure 3.  Flow-chart of analysis. Schematic overview of methods used in the analysis of microarrays stored in the UPSC-BASE database.

Download figure to PowerPoint

Microarrays to be included in the analysis are selected by creating a BioAssaySet containing extracted data from the raw files. Raw data can be imported as median- or mean-quantified values for background and foreground. Spatial and intensity visualizations of foreground and background intensities are used as quick solutions to spatial quality control problems and unbalanced signal intensities are common, especially for manual hybridizations. We have implemented several quality-control plug-ins, for instance arrayplots (Dudoit and Yang, 2002; Smyth, 2004); bias estimation (Futschik and Crompton, 2004a); rank intensity plot (Kroll and Wolfl, 2002) and Umeasamed qc (http://www.umu.se/climi/bact/Microarray/R-libraries.htm) to visualize potential problems. The rank intensity plot plug-in could be an effective tool for deciding whether to remove or keep a microarray in the data set, based on the number of missing spots and intensity distributions.

Cross-hybridization, incomplete washing and dust are all factors that contribute to background noise in the observed intensities. The ordinary local background correction, in which the observed background intensities are subtracted from the foreground intensities, is likely to underestimate the true background noise. We have implemented several methods for advanced background correction (for an overview see Table 1). For example, the linear background correction method (http://www.umu.se/climi/bact/Microarray/R-libraries.htm) combines information from observed background intensities and observations from negative control genes to estimate background-corrected intensities.

A median normalization is capable of coping with linear signal-intensity differences between two channels, but with large differences the systematic error is typically not linear. Loess normalization can remove non-linear intensity dependencies, often visualized as a curvature in an MA plot (Yang et al., 2002). An MA-plot is a plot of log-intensity ratios (M-values) versus log-intensity averages (A-values). Normalization methods, such as optimized local intensity-dependent normalization (Futschik and Crompton, 2004a,b), neural nets normalization (Tarca et al., 2005) or stepwise normalization (Wilson et al., 2003; Yang et al., 2002), are needed to remove spatial problems. To obtain highly reproducible and robust results between microarray experiments, we have chosen to use the stepwise normalization in our pipeline.

Analysis and visualization tools

The effects of systematic error are minimized in normalized data sets, and different kinds of hypotheses can be tested to pinpoint the biological implications of the results. UPSC-BASE has several plug-ins for hypothesis testing and visualization. For pairwise comparisons, methods based on different types of t-test can be applied. However, for experiments with three or more samples – or multiple factors – more advanced methods are needed to avoid the need to perform multiple pairwise comparisons using all the information in the multivariate experimental design. In UPSC-BASE, there are two choices: either anova (Kerr et al., 2000) or analysis by linear models (Smyth, 2004).

Overview plots are helpful to obtain unbiased indications of general trends in the results, but the final goal is biological interpretation. Microarrays can be used to extract a few candidate genes for further studies, but can also be used as a first screening tool to elucidate biological themes (Hosack et al., 2003) or regulatory gene networks (Banerjee and Zhang, 2002). The gene ontology classification (Ashburner et al., 2000), TAIR (Huala et al., 2001), Munich Information Center for Protein Sequences (Mewes et al., 1999; Schoof et al., 2002), and Kyoto Encyclopaedia of Genes and Genomes (Kanehisa and Goto, 2000) classification schedules are very useful resources for plant biology. Gene lists generated in earlier analysis steps can be used to look for over-represented categories. The results can then be visualized as dendrograms or graphs. In Figure 4, over-represented gene ontology cellular component categories are highlighted, based on the list of upregulated genes in the sample collected on 27 May. This represents an early leaf-development stage and was compared against the overall expression profiles from young to mature leaves. The directed acyclic graph structure makes it easy to follow affected processes from general to specific categories. In our demonstration data set of 27 May, over-representation was found, for example, for the tubulin, ribosome and nucleosome cellular component categories.

image

Figure 4.  Gene ontology pathway analysis. Interesting groups in the directed acyclic graph structure of the gene ontology cellular component classification system. Grey boxes indicate over-represented gene ontology cellular component category groups (P < 0.001); white boxes are parental groups needed to complete the graph.

Download figure to PowerPoint

The ‘digital Northern’ tool can compare gene expression in different tissues/treatments, based on the EST data from 19 tissues/treatments (for a detailed description of the source material for the libraries see Sterky et al., 2004), mapped onto the different Populus gene models (B. Segerman, personal communication). For example, the same subset of genes as shown in Figure 4 can be compared with the digital expression profiles of 19 libraries (Figure 5), demonstrating that the leaf transcriptome on that particular day during early leaf development is most similar to the transcriptome of the apical shoot meristem, and most dissimilar to the transcriptome of senescencing leaves. This tool can also be used, with some limitations, to provide a rough verification of microarray results, if Northern blotting or real-time RT-PCR is not going to be performed.

image

Figure 5.  Digital Northern analysis. Clustered correlation map of the EST library distribution found in PopulusDB for upregulated genes in the 27 May sample.

Download figure to PowerPoint

The ‘chromosome viewer’ displays the chromosomal locations of a gene list, based on EST mapping (B. Segerman, personal communication) onto the Populus genome sequence (Tuskan et al., 2006). This is particularly useful for analysing features such as the co-localization of differentially expressed genes or quantitative trait loci (Kennedy and Wilson, 2004; Wu and Stettler, 1994).

Integration with external software

Although many different tools are included in the database, compatibility with other analytical tools is an important feature. Several export functions were implemented in the original base installation to facilitate advanced analysis in external software. In addition to those, we have added a general function for exporting data in channel-wise intensities or MA values (Dudoit and Yang, 2002). We have also simplified data export from UPSC-BASE to the commercial software genespring (Silicon Genetics, Redwood City, CA, USA) and used the r-genespring package to make a plug-in for this purpose, whereby all information about the biological samples can be transferred to genespring as parameters.

Other useful microarray data visualization software packages for the plant research community are MapMan (Thimm et al., 2004; Usadel et al., 2005) and AraCyc (Mueller et al., 2003). The MapMan/AraCyc export plug-in extracts the microarray data in a format that can be directly imported into these packages. In Figure 6 an overview of the metabolic pathways is presented, visualizing gene-regulation patterns in early leaf development (27 May). Array elements showing positive B-statistics and at least twofold differences were exported to the MapMan software package. As visualized, genes involved in photosynthetic light reactions and in the Calvin cycle were downregulated, while those involved in cell wall degradation and mitochondrial electron transport/ATP synthesis were upregulated.

image

Figure 6.  MapMan results. Graphic visualization of metabolic overview pathway for spots with positive B-statistics and at least twofold changes in gene expression in the 27 May sample compared with the overall expression profile.

Download figure to PowerPoint

It is often desirable to obtain alternative confirmation of the biological interpretation of the results from microarray studies. Most commonly, real-time RT-PCR is used for gene-wise confirmation. To simplify the primer design procedure, a fasta (Pearson and Lipman, 1988) export sends the sequences of the elements in a gene list as a multi-FASTA file. The exported file can be used directly in various external primer design software packages.

To assess the robustness of the microarray analysis pipeline, we compared the data obtained from the ‘original’ analysis of the leaf development samples with data generated by two individuals using a different experimental design. Most importantly, in the experimental loop for the demonstration data set, we included a triplicate of one sample in the hybridization design. These three samples were virtually indistinguishable from each other (data not shown). When the original data set was compared with the demonstration data set using principal components analysis (Wold et al., 1987), it was clear that the results were very consistent. The total pattern of gene expression from one sample (one date) in the two experiments was always almost identical (Figure 7), demonstrating the quality of the microarray and analysis. Thus variations introduced by differences in experimental design, sample handling and hybridization appeared to be much smaller than the genuine differences found between samples. Therefore we conclude that data generated using the procedure described here generally have reasonable levels of confidence. Data displayed in the public domain of UPSC-BASE provide reliable estimates of expression levels of specific genes.

image

Figure 7.  Principal components analysis of design comparison. Score plot of the first two components (t1, t2) from principal components analysis of the two experimental designs. Results from the all-versus-all experiment (○) show the same pattern as the results from the common reference experiment (•).

Download figure to PowerPoint

UPSC-BASE – a public resource for Populus genomics

Data on all experiments performed using our Populus DNA microarrays are stored in UPSC-BASE. The database can be accessed via the web, and has an anonymous (public) login option. The analysis is performed by individual researchers, and at this stage data can be shared with others in the same working unit, but after publication of the results, or after an appropriate time lag, data are transferred to the public domain of the database. In the public domain, anonymous users can access data on several of the performed experiments (currently 21) and 407 microarray hybridizations (Table 2). Also, an additional number of unpublished experiments (currently 10) and 347 microarray hybridizations have data searchable under public access.

Table 2.   Overview of the publicly accessible microarray experiments in UPSC-BASE (at publication date)
UMA no.Experiment descriptionSlides
  1. aData searchable but not downloadable batch-wise.

  2. bData available as Supplementary Material at the journal site where data was initially published.

1aSeasonal variation in gene expression – whole season37
2Popface experiment; changes in gene expression in elevated CO222
3Virus infection of Populus tremula12
5Analysis of secondary cell wall genes during tension wood formation8
6aFungal infection of P. tremula35
7Global profile of wood-forming tissues20
9Popyomics EU programme; drought stress55
10Assessment of impact of elevated CO2 of biomass production in cottonwood18
11Impact of constitutive expression of CCAAT-binding factor (CBF) on frost tolerance of hybrid aspen32
12Finding genes involved in regulation of fibre cell death in hybrid aspen wood5
13Optimization of experimental design in microarray analysis28
17aComparative analysis of cambial and bud dormancy80
20aTranscript profiling of the apical region during primordia/leaf development33
21Meristem identity in the cambial zone49
22Meristem identity24
25aEffects of ozone on oxidative stress responses and respiratory processes in poplar leaves3
28aAnalysis of auxin-responsive gene expression during annual cambial cycle40
30aGlobal tissue profiling47
31Transcriptomics of poplar in response to poplar mosaic virus18
32aSeasonal variation in gene expression – spring 2000 and 200226
35Popface experiment; changes in gene expression in elevated CO2 (continuation)8
36Popyomics EU programme; UK drought-stress extremes24
38Assessing the effect of altered carbohydrate supply14
42Dynamics of leaf growth20
43aResistance of Salix viminalis to the gall midge Dasineura marginemtorquens26
44Active versus dormant cambium6
45aAdventitious root formation20
48A transcriptional roadmap to wood formation18
49bChanges in gene expression in the wood-forming tissue of transgenic hybrid aspen2
50A transcriptional timetable of autumn senescence16
81Expression analysis of genes encoding putative cellulose synthases in hybrid aspen8

In combination with the JGI genome browser (http://genome.jgi-psf.org/Poptr1) and PopulusDB, UPSC-BASE allows Populus researchers not only conveniently to access the Populus genome, but also to obtain expression characteristics for a considerable fraction of the genes. UPSC-BASE is being continuously developed and improved with novel analysis tools, and is rapidly growing as more experiments are transferred into the public domain, accessible for external users. By streamlining the analysis procedure, we are trying to provide the community with a resource containing data generated in a standardized way, which should simplify comparisons of the data set. We must, however, point out that many of the experiments included in the public domain of UPSC-BASE were performed before this analysis pipeline was established. Details on these experiments are stored in the database, but the procedures that were used could differ from those described here. As the individual experiments – like most other cDNA array experiments – typically use different designs and reference samples, direct between-experiment comparisons are not always possible. A considerable weakness of the Populus model system is the lack of a standardized vocabulary to describe different tissues. This is a problem that the Populus community has to solve if the value and user-friendliness of the microarray data are to be maximized. Despite these limitations, we believe the rapidly increasing number of experiments in UPSC-BASE will constitute a useful resource for both the Populus community, and perhaps for the plant science community in general.

Conclusions

  1. Top of page
  2. Summary
  3. Introduction
  4. Conclusions
  5. Acknowledgements
  6. References

We have described the pipeline for DNA microarray analysis, developed for our Populus microarrays. Most of the tools are generic and can also be used for other microarrays. For example, we use them to analyse Arabidopsis CATMA microarrays (Hilson et al., 2004). These efforts have provided a large set of protocols and generic plug-ins for the microarray community using the base system. With minor modifications, the plug-ins can be used in all kinds of base installations, regardless of the organism and microarray system concerned. We believe, however, that the most significant value of this contribution is the description of the publicly accessible database, which will increase the attraction of Populus as a model system for molecular biology, genetics and genomics.

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Conclusions
  5. Acknowledgements
  6. References

This work was supported by the Knut and Alice Wallenberg Foundation, Swedish Foundation for Strategic Research, the Swedish Research Council, Kempestiftelserna and the European Commission through the Directorate General Research within the Fifth Framework for Research – Quality of Life and Management of the Living Resources Programme, contract no. QLK5-CT-2002-00953 (POPYOMICS).

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Conclusions
  5. Acknowledgements
  6. References
  • Andersson, A., Keskitalo, J., Sjödin, A. et al. (2004) A transcriptional timetable of autumn senescence. Genome Biol. 5, R24.
  • Ashburner, M., Ball, C.A., Blake, J.A. et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 2529.
  • Ball, C.A., Brazma, A., Causton, H. et al. (2004) Submission of microarray data to public repositories. PLoS Biol. 2, E317.
  • Banerjee, N. and Zhang, M.Q. (2002) Functional genomics as applied to mapping transcription regulatory networks. Curr. Opin. Microbiol. 5, 313317.
  • Bhalerao, R., Keskitalo, J., Sterky, F. et al. (2003) Gene expression in autumn leaves. Plant Physiol. 131, 430442.
  • Brazma, A., Hingamp, P., Quackenbush, J. et al. (2001) Minimum information about a microarray experiment (MIAME) – toward standards for microarray data. Nat. Genet. 29, 365371.
  • Brazma, A., Parkinson, H., Sarkans, U. et al. (2003) ArrayExpress – a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 31, 6871.
  • Bylesjo, M., Eriksson, D., Sjodin, A., Sjostrom, M., Jansson, S., Antti, H. and Trygg, J. (2005) MASQOT: a method for cDNA microarray spot quality control. BMC Bioinformatics, 6, 250.
  • Chang, S., Puryear, J. and Cairney, J. (1993) A simple and efficient method for isolating RNA from pine trees. Plant Mol. Biol. Rep. 11, 113116.
  • Churchill, G.A. (2002) Fundamentals of experimental design for cDNA microarrays. Nat. Genet. 32, 490495.
  • Craigon, D.J., James, N., Okyere, J., Higgins, J., Jotham, J. and May, S. (2004) NASCArrays: a repository for microarray data generated by NASC's transcriptomics service. Nucleic Acids Res. 32, D575D577.
  • Doyle, J.J. and Doyle, J.L. (1987) A rapid DNA isolation procedure for small quantities of fresh leaf tissue. Phytochem. Bull. 19, 1115.
  • Dudoit, S. and Yang, Y.H. (2002) Bioconductor R packages for exploratory analysis and normalization of cDNA microarray data. In The Analysis of Gene Expression Data: Methods and Software (Parmigiani, G., Garett, E.S., Irizarry, R.A., Zeger, S.L. eds). New York: Springer, pp. 73101.
  • Edgar, R., Domrachev, M. and Lash, A.E. (2002) Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207210.
  • Ewing, R.M., Ben Kahla, A., Poirot, O., Lopez, F., Audic, S. and Claverie, J.M. (1999) Large-scale statistical analyses of rice ESTs reveal correlated patterns of gene expression. Genome Res. 9, 950959.
  • Futschik, M. and Crompton, T. (2004a) Model selection and efficiency testing for normalization of cDNA microarray data. Genome Biol. 5, R60.
  • Futschik, M.E. and Crompton, T. (2004b) OLIN: optimized normalization, visualization and quality testing of two-channel microarray data. Bioinformatics, 21, 17241726.
  • Gentleman, R.C., Carey, V.J., Bates, D.M. et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80.
  • Gu, J. and Gu, X. (2003) Induced gene expression in human brain after the split from chimpanzee. Trends Genet. 19, 6365.
  • Hertzberg, M., Aspeborg, H., Schrader, J. et al. (2001) A transcriptional roadmap to wood formation. Proc. Natl Acad. Sci. USA, 98, 1473214737.
  • Hilson, P., Allemeersch, J., Altmann, T. et al. (2004) Versatile gene-specific sequence tags for Arabidopsis functional genomics: transcript profiling and reverse genetics applications. Genome Res. 14, 21762189.
  • Hosack, D.A., Dennis, G. Jr, Sherman, B.T., Lane, H.C. and Lempicki, R.A. (2003) Identifying biological themes within lists of genes with EASE. Genome Biol. 4, R70.
  • Huala, E., Dickerman, A.W., Garcia-Hernandez, M. et al. (2001) The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res. 29, 102105.
  • Ihaka, R. and Gentleman, R. (1996) R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5, 299314.
  • Israelsson, M., Eriksson, M.E., Hertzberg, M., Aspeborg, H., Nilsson, P. and Moritz, T. (2003) Changes in gene expression in the wood-forming tissue of transgenic hybrid aspen with increased secondary growth. Plant Mol. Biol. 52, 893903.
  • Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 2730.
  • Kennedy, G.C. and Wilson, I.W. (2004) Plant functional genomics: opportunities in microarray databases and data mining. Funct. Plant Biol. 31, 295314.
  • Kerr, M.K. and Churchill, G.A. (2001) Statistical design and the analysis of gene expression microarray data. Genet. Res. 77, 123128.
  • Kerr, M.K., Martin, M. and Churchill, G.A. (2000) Analysis of variance for gene expression microarray data. J. Comput. Biol. 7, 819837.
  • Kohler, A., Delaruelle, C., Martin, D., Encelot, N. and Martin, F. (2003) The poplar root transcriptome: analysis of 7000 expressed sequence tags. FEBS Lett. 542, 3741.
  • Kroll, T.C. and Wolfl, S. (2002) Ranking: a closer look on globalisation methods for normalisation of gene expression arrays. Nucleic Acids Res. 30, e50.
  • Lafarguette, F., Leple, J.C., Dejardin, A., Laurans, F., Costa, G., Lesage-Descauses, M.C. and Pilate, G. (2004) Poplar genes encoding fasciclin-like arabinogalactan proteins are highly expressed in tension wood. New Phytol. 164, 107121.
  • Lockhart, D.J., Dong, H., Byrne, M.C. et al. (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 14, 16751680.
  • Lonnstedt, I. and Speed, T. (2002) Replicated microarray data. Stat. Sin. 12, 3146.
  • Mewes, H.W., Heumann, K., Kaps, A., Mayer, K., Pfeiffer, F., Stocker, S. and Frishman, D. (1999) MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 27, 4448.
  • Moreau, C., Aksenov, N., Lorenzo, M., Segerman, B., Funk, C., Nilsson, P., Jansson, S. and Tuominen, H. (2005) A genomic approach to investigate developmental cell death in woody tissues of Populus trees. Genome Biol. 6, R34.
  • Mueller, L.A., Zhang, P. and Rhee, S.Y. (2003) AraCyc: a biochemical pathway database for Arabidopsis. Plant Physiol. 132, 453460.
  • Nanjo, T., Futamura, N., Nishiguchi, M., Igasaki, T., Shinozaki, K. and Shinohara, K. (2004) Characterization of full-length enriched expressed sequence tags of stress-treated poplar leaves. Plant Cell Physiol. 45, 17381748.
  • Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 24442448.
  • Penkett, C.J. and Bahler, G. (2004) Navigating public microarray databases. Comp. Funct. Genomics 5, 471479.
  • Rishi, A.S., Munir, S., Kapur, V., Nelson, N.D. and Goyal, A. (2004) Identification and analysis of safener-inducible expressed sequence tags in Populus using a cDNA microarray. Planta, 220, 296306.
  • Ryden, P., Andersson, H., Landfors, M., Naslund, L., Hartmanova, B., Noppa, L. and Sjostedt, A. (2006) Evaluation of microarray data normalization procedures using spike-in experiments. BMC Bioinformatics, 7, 300.
  • Saal, L.H., Troein, C., Vallon-Christersson, J., Gruvberger, S., Borg, A. and Peterson, C. (2002) BioArray Software Environment (BASE): a platform for comprehensive management and analysis of microarray data. Genome Biol. 3, SOFTWARE0003.
  • Schena, M., Shalon, D., Davis, R.W. and Brown, P.O. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270, 467470.
  • Schmid, M., Davison, T.S., Henz, S.R., Pape, U.J., Demar, M., Vingron, M., Scholkopf, B., Weigel, D. and Lohmann, J.U. (2005) A gene expression map of Arabidopsis thaliana development. Nat. Genet. 37, 501506.
  • Schoof, H., Zaccaria, P., Gundlach, H., Lemcke, K., Rudd, S., Kolesov, G., Arnold, R., Mewes, H.W. and Mayer, K.F. (2002) MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource based on the first complete plant genome. Nucleic Acids Res. 30, 9193.
  • Schrader, J., Moyle, R., Bhalerao, R., Hertzberg, M., Lundeberg, J., Nilsson, P. and Bhalerao, R.P. (2004a) Cambial meristem dormancy in trees involves extensive remodelling of the transcriptome. Plant J. 40, 173187.
  • Schrader, J., Nilsson, J., Mellerowicz, E., Berglund, A., Nilsson, P., Hertzberg, M. and Sandberg, G. (2004b) A high-resolution transcript profile across the wood-forming meristem of poplar identifies potential regulators of cambial stem cell identity. Plant Cell, 16, 22782292.
  • Smith, C.M., Rodriguez-Buey, M., Karlsson, J. and Campbell, M.M. (2004) The response of the poplar transcriptome to wounding and subsequent infection by a viral pathogen. New Phytol. 164, 123136.
  • Smyth, G. (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, No 1, Article 3.
  • Sterky, F., Regan, S., Karlsson, J. et al. (1998) Gene discovery in the wood-forming tissues of poplar: analysis of 5692 expressed sequence tags. Proc. Natl Acad. Sci. USA, 95, 1333013335.
  • Sterky, F., Bhalerao, R.R., Unneberg, P. et al. (2004) A Populus EST resource for plant functional genomics. Proc. Natl Acad. Sci. USA, 101, 1395113956.
  • Stoeckert, C.J. Jr, Causton, H.C. and Ball, C.A. (2002) Microarray databases: standards and ontologies. Nat. Genet. 32, S469S473.
  • Tarca, A.L., Cooke, J.E. and Mackay, J. (2005) A robust neural networks approach for spatial and intensity dependent normalization of cDNA microarray data. Bioinformatics, 21, 26742683.
  • Taylor, G., Street, N.R., Tricker, P.J., Sjödin, A., Graham, L., Skogström, O., Calfapietra, C., Scarascia-Mugnozza, G. and Jansson, S. (2005) The transcriptome of Populus in elevated CO2. New Phytol. 167, 143154.
  • Thimm, O., Blasing, O., Gibon, Y., Nagel, A., Meyer, S., Kruger, P., Selbig, J., Muller, L.A., Rhee, S.Y. and Stitt, M. (2004) MapMan: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. Plant J. 37, 914939.
  • Tuskan, G., DiFazio, S., Jansson, S. et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313, 15961604.
  • Usadel, B., Nagel, A., Thimm, O. et al. (2005) Extension of the visualization tool MapMan to allow statistical analysis of arrays, display of corresponding genes, and comparison with known responses. Plant Physiol. 138, 11951204.
  • Vinciotti, V., Khanin, R., D'Alimonte, D. et al. (2005) An experimental evaluation of a loop versus a reference design for two-channel microarrays. Bioinformatics, 21, 492501.
  • Wang, J., Delabie, J., Aasheim, H., Smeland, E. and Myklebost, O. (2002) Clustering of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study. BMC Bioinformatics, 3, 36.
  • Wilson, D.L., Buckley, M.J., Helliwell, C.A. and Wilson, I.W. (2003) New normalization methods for cDNA microarray data. Bioinformatics, 19, 13251332.
  • Wit, E. and McClure, J. (2004) Statistics for Microarrays: Design, Analysis and Inference. Chichester: Wiley.
  • Wit, E., Nobile, A. and Khanin, R. (2005) Near-optimal designs for dual-channel microarray studies. Appl. Stat. 54, 817830.
  • Wold, S., Esbensen, K. and Geladi, P. (1987) Principal component analysis. Chemom. Intell. Lab. Syst. 2, 3752.
  • Wu, R. and Stettler, R.F. (1994) Quantitative genetics of growth and development in Populus.1. A three-generation comparison of tree architecture during the first 2 years of growth. Theor. Appl. Genet. 89, 10461054.
  • Yang, Y.H. and Speed, T. (2002) Design issues for cDNA microarray experiments. Nat. Rev. Genet. 3, 579588.
  • Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J. and Speed, T.P. (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30, e15.
  • Zimmermann, P., Hirsch-Hoffmann, M., Hennig, L. and Gruissem, W. (2004) GENEVESTIGATOR. Arabidopsis microarray database and analysis toolbox. Plant Physiol. 136, 26212632.