S. Cavallaro, MD, PhD, Istituto di Scienze Neurologiche, CNR, Viale Regina Margherita 6, 95123 Catania, Italy. E-mail:email@example.com
The characterization of the molecular mechanisms whereby our brain codes, stores and retrieves memories remains a fundamental puzzle in neuroscience. Despite the knowledge that memory storage involves gene induction, the identification and characterization of the effector genes has remained elusive. The completion of the Human Genome Project and a variety of new technologies are revolutionizing the way these mechanisms can be explored. This review will examine how a genomic approach can be used to dissect and analyze the complex dynamic interactions involved in gene regulation during learning and memory. This innovative approach is providing information on a new class of genes associated with learning and memory in health and disease and is elucidating new molecular targets and pathways whose pharmacological modulation may allow new therapeutic approaches for improving cognition.
One of the most ambitious goals of modern neuroscience is to identify the mechanisms whereby our brain codes, stores and retrieves memories. Two general forms of memory can be classified by their duration: short-term memory (STM), which is rapidly formed and can outlast training for minutes or hours, and long-term memory (LTM), which lasts from hours to days, weeks or even years. STM involves post-translational modifications of preexisting molecules that alters the efficiency of synaptic transmission. In contrast, LTM can be blocked by inhibitors of transcription or translation indicating that it is dependent on de novo gene expression (Davis & Squire 1984; Stork & Welzl 1999). Proteins newly synthesized during memory consolidation may contribute to restructuring processes at the synapse and thereby alter the efficiency of synaptic transmission beyond the duration of STM. Revealing the dependence of LTM on protein synthesis, however, provides no information about the identity and specificity of the required proteins.
Because the quantity of a particular protein is often reflected by the abundance of its mRNA, a variety of methods has been used to describe only a limited number of differentially expressed genes during LTM (Stork & Welzl 1999). Successive screening therefore were needed to uncover how many and which genes are involved in memory and how do they interact functionally to effect memory storage. To survey the gene-based molecular mechanisms that underlie LTM, genome-wide expression analysis is being used in a variety of behavioral conditions (Cavallaro et al. 2001; Cavallaro et al. 2002; D'Agata & Cavallaro 2003; Dubnau et al. 2003; Leil et al. 2003; Luo et al. 2001; Rampon et al. 2000). This genomic approach requires the development of various laboratory protocols, as well as the development of database and software tools for efficient data collection and analysis. A basic understanding of these computational tools is therefore required for optimal experimental design and meaningful data analysis. For this reason, in the present review, we will outline the main procedures of microarray analysis (Fig. 1) before describing its application in the learning and memory field. For a more general description of microarray technology, the reader is referred to other reviews (Heller 2002; Hess et al. 2001; Noordewier & Warren 2001; Quackenbush 2001).
Expression profiling by DNA microarray technology
A DNA microarray is a grid of DNA spots, called probes, each containing a unique DNA sequence (Fig. 2). Spots contain either DNA oligomers or a longer DNA sequence designed to be complementary to a particular mRNA of interest. When a microarray is hybridized to fluorescence-tagged complementary DNAs or RNAs derived from messenger or total RNA, each spot is a target for the mRNA encoded by a gene. A laser can then excite the bound cDNAs or cRNAs and a scanner collects fluorescence intensities from each spot on the slide. The intensity of the fluorescence at each array element is proportional to the expression level of that gene in the sample. The choice of having oligomers or longer cDNA sequences yields two different microarray technologies: oligonucleotide and cDNA microarrays, respectively. The thing that makes microarrays the most promising technology for genome-wide expression analysis is the number of DNA probes that it is possible to place on a microarray. Already there are microarrays with probes for every gene in yeast, and others with over 40 000 human genes. This allows researchers to observe the response of whole genomes to various stimuli instead of one gene at a time.
Computational analysis of microarray data
Microarray analysis results in large amounts of data that are difficult to interpret without computational methods. The simplest analysis involves two samples, representing a test condition and a control condition and yields a list of paired expression values, one pair for each gene. As illustrated in Fig. 2, these pairs can be represented graphically by a scatter plot, with the values of sample one plotted on the x-axis and the values of sample two plotted on the y-axis. The resulting correlation plot provides a visual image of the relationship between the two expression profiles. In this plot, genes with similar expression levels in the two samples should have points on the identity line (y = x), and genes that are expressed differentially lie at some distance from this line. However, the problem is that microarrays do not measure expression levels directly, but rather intensity levels, as represented by the amount of phosphorescent dye that was recorded by a scanner. Many other factors, such as the overall mRNA concentration of the two samples, the saturation effects in the hybridization or the quenching effect of the phosphorescent dyes can affect these intensity values. In order to correct these differences in intensity levels, the raw data can be normalized, for example, by using a normalization constant derived from housekeeping or spiked control genes.
Once normalized, a series of restrictions (or filters) can be applied to the data obtained. These restrictions include factors such as quality control, expression level constraints, sample-to-sample fold comparison and statistical group comparisons. The simplest way to identify interesting genes in DNA-microarray experiments is to search for those that are consistently either up- or downregulated. Relative differences in expression levels (fold changes) have been typically employed in group comparisons of gene expression. This approach, however, is somewhat arbitrary and inherently subject to high error rates because information on sample variance is not exploited. If array experiments are replicated to an extent that permits direct estimates of the variance of each individual transcript, parametric or non-parametric statistics can be applied. In these cases, however, a high number of false-positive results are expected by chance when one relies on the nominal P-value. For instance, when testing 10 000 transcripts we would expect to misidentify about 500 genes as significant (P < 0.05), even when there is no real difference in gene expression. Multiple testing corrections therefore are needed to adjust the individual P-value to account for this effect.
More complex computational methods are needed to monitor several gene expression profiles, such as those arising from time-course studies, and various clustering techniques have been applied to the identification of patterns in gene-expression data. Cluster analysis is a commonly used method to investigate and interpret gene expression data sets. By grouping together genes that have similar expression profiles, cluster analysis can be used for extraction of regulatory motifs, inference of functional annotation and classification of cell types or tissue samples.
The term clustering stands for a method that makes it possible to partition a set of objects (genes) into subgroups with similar features called clusters. These partitions have to satisfy the following features: Homogeneity in the cluster (the objects which belong to the same cluster have to be as similar as possible) and Heterogeneity among clusters (the objects which belong to different clusters have to be as different as possible).
A clustering method generally consists of two distinct components: a distance measure (or similarity coefficient) that indicates how similar two gene expression patterns are (or more generally, two clusters), and a clustering algorithm, which sorts the data and groups genes together on the basis of their separation in expression space.
Measure of distance or similarity coefficient
Many of the advanced analysis techniques are based upon measures of gene similarity. Similarity between genes is usually based on the correlation between the expression profiles of the genes. For expression data, we can solve the problem of ‘similarity’ mathematically by defining an ‘expression vector’ for each gene that represents its location in ‘expression space’. In this way, expression data can be represented in n-dimensional expression space, where n is the number of experiments and where each gene-expression vector is represented as a single point in that data space.
In any clustering algorithm, the calculation of a ‘distance’ between any two objects is fundamental for placing them into groups. There are various methods for measuring distance, typically falling into two general classes: metric and semimetric (or similarity coefficient). Each of these takes two expression patterns and produces a number representing how similar the two genes are.
Detailed mathematical description of distance metrics used in clustering analysis can be retrieved in other reviews (Quackenbush 2001). The most common metric distance is the Euclidean distance. It simply is the geometric distance in the multidimensional space. The most commonly used semimetric distance measure in the analysis of gene expression data is the Pearson's correlation coefficient r.
After providing means of measuring distance between genes, clustering algorithms sort the data and group genes together on the basis of their separation in expression space. Various clustering techniques have been applied to the identification of patterns in gene-expression data. Most cluster analysis techniques are hierarchical, the resultant classification has an increasing number of nested classes and the result resembles a phylogenetic classification. Non-hierarchical clustering techniques also exist, such as k-means clustering, which simply partition objects into different clusters without trying to specify the relationship between individual elements. Examples of hierarchical and k-means clustering are reproduced in Fig. 2.
Although cluster analysis techniques are extremely powerful, great care must be taken in applying this family of techniques. Eventhough the methods used are objective in the sense that the algorithms are well defined and reproducible, they are still subjective in the sense that selecting different algorithms, different normalizations, or different distance metrics, will place different objects into different clusters. Furthermore, clustering unrelated data would still produce clusters, although they might not be biologically meaningful.
Numerical, semantic and mixed clustering. Cluster analysis is a methodology to identify groups of genes that share common expression characteristics and behaviors. It has been frequently exploited in the analysis of genome-wide expression data, as the experimental observation that a set of genes is coexpressed implies that the genes share a biological function and are under common regulatory control.
Frequently, the clustering is used to group together genes considering only similar expression profiles, but it does not consider other well-known features of the gene properties. Actually, genes with a different profile expression could have similar functions as well and the classical clustering methodologies do not put it in evidence.
In order to extract knowledge from gene expression information, cluster analysis can be organized in three different approaches: numerical, semantic and mixed clustering.
The numerical clustering method is applied to the levels of gene expression. It tends grouping genes with a similar expression profile in the same clusters and makes sure that genes having different profiles with similar semantic features fall in different clusters. These considerations suggest that simple numerical clustering algorithms are inadequate to infer the genes' and proteins' role.
In order to discover more complex relationships among gene sequences the semantic clustering is used. The term semantic clustering indicates methods of clustering based on semantic characteristics, such as gene ontologies. When categorical domains are ordered, they can be turned in numerical values in order to transform those that are similar in near values. After that, methods of classical numerical clustering can be applied to that data set. When categorical domains are not ordered, however, this approach does not necessarily produce meaningful results. In this case, the k-modes algorithm can be used to remove this limitation.
Finally, more useful analyses can be performed using the mixed clustering. In this case, each gene can be represented from a vector in the n + m-dimensional space, where n is the number of levels of gene expression and m is the number of semantic features transformed in numerical values. Then, each gene-expression-semantic vector is represented as a single point in data space and any measure of distance can be adopted to calculate the distance between any two genes. The mixed clustering tends to group genes with similar expression profiles as well as genes with similar semantic features.
In order to perform semantic and mixed clustering we have developed new informatics applications (Fig. 3). Functional information is automatically retrieved by means of the software application genelink (Fig. 3A). This application is designed to retrieve genomics and proteomics information from external worldwide databases (NCBI GenBank and LocusLink). Before performing the semantic clustering, the functional annotations have to be turned into numerical values to transform features that are similar functionally in near values. For each gene ontology (GO) number, for example, the software application gene ontology system (GOS) assigns a new GO number in order to identify the hierarchical relationships among GOs (Fig. 3b). The basic idea is that two related GOs must be coded with two closest GO numbers. After this renumbering process, methods of classical numerical clustering can be applied to the original data set (Fig. 3c) to perform numerical (Fig. 3d), semantic (Fig. 3e), or mixed (Fig. 3f) clustering. In this way, cluster analysis can effectively extract functional information from gene expression data.
The use of a genomic approach for studying learning and memory
The following part of this review will focus on the use of a genomic approach to dissect and analyze gene-based mechanisms underlying learning and memory. We will highlight gene expression microarray analysis performed in different behavioral paradigms (eye-blink conditioning, water- and T-maze learning and passive avoidance conditioning). For space limitation, in this review we will try to give a broad view of the results obtained refraining from discussing each of the genes implicated by microarray analysis.
Messenger RNA levels from cerebellar lobule HVI and hippocampus of unpaired and paired rabbits were simultaneously analyzed with high-density cDNA microarrays containing more than 8700 cDNAs (Cavallaro et al. 2001). When gene expression patterns were compared, mRNA levels of 79 and 17 genes differed more than twofold in lobule HVI and hippocampus, respectively (Figs 4b,c). These genes were operationally defined as ‘memory related genes’ (MRGs). Approximately 50% of the MRGs differentially expressed in the hippocampus were also differentially expressed in the HVI lobule, suggesting common mechanisms of memory storage in the two areas.
A majority of MRGs were downregulated, whereas only two genes that differed by a factor greater than 2 were upregulated in lobule HVI of paired animals (Fig. 4b). Because LTM can be blocked by transcription and protein synthesis inhibitors, most previous reports have focused on the identification of proteins whose expression is upregulated (Davis & Squire 1984). The preponderant reduction of gene expression during LTM therefore would not have been predicted and provides new and unexpected insights into the molecular mechanisms that underlie it. The specific role of the downregulation of MRGs following learning remains a matter of speculation. Downregulation of a gene may be the end point in a dynamic gene expression process that begins with upregulation during acquisition of the learned response. Alternatively, memory storage may require a balance of upregulation of some genes and downregulation of genes that exert inhibitory constraints on memory formation (Alberini et al. 1994). These latter genes might be termed memory suppressor genes (Abel & Kandel 1998).
A majority of the MRGs implicated have no currently recognized function and are not yet named. Complete nucleotide sequence determination, conceptual translation, expression monitoring and biochemical analysis are currently underway (D'Agata et al. 2003) and should provide a detailed functional understanding of these genes. Seventeen genes have significant similarity to known genes and can be grouped into different functional classes (Fig. 4D).
Our microarray analysis of eye blink-conditioned rabbits (Cavallaro et al. 2001) was the first reported in the literature to demonstrate the feasibility and utility of a cDNA microarray system as a means of dissecting the molecular mechanisms of associative memory. Further studies, however, were required at different time points and behavioral conditions to better understand the role of the implicated genes. To perform such studies we and others have moved to rat or mice, two animal species that are better suitable for genomic studies than rabbit in terms of sequenced genes and available microarrays.
Microarrays have been used to analyze hippocampal gene expression in rats following training in a multiunit T-maze (Luo et al. 2001). In this study, the expression of 28 genes (18 known genes and 10 ESTs) was found to be increased in maze-trained animals compared with yoked control rats that were trained in a straight runway. Some of the known genes are involved in Ca2+ signaling, Ras activation, kinase cascades and extracellular matrix function. None of them, however, overlap with genes identified by microarray analysis in the other experimental paradigms examined in this review. Although the aversive foot shock was applied in equal duration and frequency to both the trained and control rats, changes in gene expression could be ascribed to other differences among the two groups, such as locomotor activity. In addition, because the animals were pretrained on day 1 and T-maze-trained on days 2 and 3, the time-dependent patterns of gene regulation during acquisition and consolidation of memory are unknown.
To detect learning related changes, microarray analysis has been used by two laboratories to characterize gene expression profiles in animals trained in the Morris water maze (Cavallaro et al. 2002; Leil et al. 2002; Leil et al. 2003). In this learning paradigm, a rodent learns to locate a submerged island in a large pool by creating a spatial map using extra-pool cues. Leil et al. (2002) used cDNA-microarrays containing approximately 9000 clones to detect hippocampal gene expression changes between F1 hybrid mouse strains that perform well on the Morris water maze and inbred strains that perform poorly. Although this study was performed in a brain region intimately involved in spatial learning, genes differentially expressed in mouse strains may subserve other behavioral processes or functions. Indeed, in a later study, the same authors (Leil et al. 2003) used microarray analysis to characterize the differential expression of genes (n = 3) in the hippocampus of F1 hybrid mice after 2 days of water-maze training. Although mouse strains used were the same, no overlap was found between the genes revealed in the two studies. In addition, no overlap was found between genes differentially expressed in water-maze training (Leil et al. 2003) and those performed in the other behavioral paradigms examined in this review. This is probably due to a number of factors, including different genes on the arrays, different species or strains employed, brain regions and time point studied. In addition, because mice are very reactive to placement in water, gene expression differences may be due to stress responses rather than learning and memory.
To analyze the time-dependent patterns of gene regulation during water-maze training, we measured hippocampal gene expression profiles in naïve, swimming control and water-maze trained rats (Fig. 5a,b), using microarrays containing more than 1200 genes relevant to neurobiology (Cavallaro et al. 2002). When gene expression profiles in naïve and swimming control animals 1, 6 and 24 h after swimming sessions were compared, 345 genes were found differentially expressed more than twofold in at least two of the four conditions (Fig. 5c). These genes, operationally defined as ‘physical activity related genes’ (PARGs) indicate that physical activity and mild stress associated with behavioral training has a significant impact on hippocampal gene expression.
When gene expression levels in swimming control animals were compared with water-maze trained animals 1, 6, or 24 h after training, 140 MRGs were found (Fig. 5c). The majority of these MRGs (110 of 140) were also PARGs, i.e. influenced by physical activity. Among MRGs, 55 genes were upregulated in the hippocampus of water-maze-trained animals, whereas 91 genes were downregulated (Figs 5d,e).
Most of the MRGs, those differentially expressed between the swimming and spatial learning animal groups, were also affected during swimming alone but with entirely different temporal patterns of expression (Fig. 5f). Although learning and physical activity involve common groups of genes, the behavior of learning and memory can be distinguished from unique patterns of gene expression across time.
All of the MRGs identified during water-maze learning have a recognized function and can be classified into six major groups based on their translated product: (i) cell signaling (ii) synaptic proteins (iii) cell–cell interaction and cytoskeletal proteins (iv) apoptosis (v) enzymes and (vi) transcription or translation regulation. Some of these genes have been previously related to synaptic plasticity, memory, or cognitive disorders. For a complete description of the MRGs implicated by microarray technology, the reader is referred to our previous study (Cavallaro et al. 2002). In the following paragraph we will discuss only one of the MRGs, FGF-18, which has been further tested for its memory regulatory function.
FGF-18 is a novel member of the FGF family, which was shown to stimulate neurite outgrowth (Ohbayashi et al. 1998). Although the function of this peptide is still unknown, the other members of its family are important signaling molecules in several inductive and patterning processes and act as brain organizer-derived signals during formation of the early vertebrate nervous system. Water-maze training but not physical activity induced the expression of FGF-18. To explore the effect of FGF-18 in spatial learning, we tested the effects of a single exogenous dose of FGF-18. Rats were trained in a Morris water maze for two trials and then injected intracerebroventricularly with 0.94 pmoles of FGF-18 or vehicle. As shown in Fig. 6, FGF-18 treatment improved spatial learning behavior by inducing a 49% reduction in the escape latency but no significant changes in motor activity.
The data obtained in the hippocampus of water-maze-trained rats (Cavallaro et al. 2002) represent the first temporal gene expression comparison reported in the long term retention of learning and memory and further demonstrated the utility of a genomic approach as a means of dissecting the molecular basis of associative memory. This approach provides information on the gene expression changes that occur during physical activity, stress, learning and memory, allowing the identification of molecular targets and pathways whose modulation may generate new therapeutic approaches for facilitating learning and memory.
Passive avoidance learning
We have recently extended our genome-wide screenings to an additional behavioral animal model, a step-through passive avoidance test, known to require hippocampus-dependent learning and depend upon transcription (Stubley-Weatherly et al. 1996). In these experiments (D'Agata & Cavallaro 2003), conditioned animals (CA) were trained to avoid moving from the lighted to the darkened section of a conditioning chamber by delivering a foot shock when they entered the darkened section. Control rats included untrained (naïve) animals, and animals exposed to the unconditioned (USTA) or the conditioned (CSTA) stimulus. To verify that the trained rats in fact learned the passive avoidance task, learning was assessed in a comparable group of animals by evaluating the latency of step-through in a retention test. Twenty-four hours after the one-trial training period, only CA learned to associate stepping through the darkened chamber with the foot shock (Fig. 7a).
Hippocampal gene expression profiles in CA, USTA, CSTA and naïve animals were measured 6 h after training using microarrays containing 1263 genes relevant to neurobiology (D'Agata & Cavallaro 2003). When gene expression profiles of naïve animals were compared with those of CSTA or USTA, 46 and 60 genes, respectively, were found differentially expressed (Fig. 7b). These genes further demonstrate that physical activity and mild stress associated with behavioral training have a significant impact on hippocampal gene expression.
When gene expression levels in naïve animals were compared with CA, 38 MRGs were found (Fig. 7b). Among these, 21 genes were downregulated and 17 genes were upregulated. Some of these MRGs (21/38) were also differentially expressed in CSTA (16) and USTA (16) (Fig. 7b).
A hierarchical clustering method was used to group MRGs on the basis of similarity in their expression patterns (Fig. 7c). The most evident traits of the clustered data were that MRGs showed entirely different patterns of expression in CA vs. CSTA or USTA. Genes segregating into nine major branches of the dendrogram were assigned to nine clusters (Fig. 7c). Clusters 1–4 represent those genes, which were downregulated, whereas clusters 5–9 include those, which were upregulated in CA. Some of the MRGs, those differentially expressed between naïve and CA, were also affected by exposing the rats to the conditioned or the unconditioned stimulus alone, whereas others were uniquely induced when the two were associated and the animals were conditioned (Fig. 7c, clusters 2 and 8). Expression changes of MRGs in CSTA or USTA had different magnitudes or more often opposite trends than CA (Fig. 7c, clusters 1, 2, 3, 5, 8 and 9). As we have previously observed in water-maze-trained animal, learning, physical activity and mild stress associated with behavioral training involve common groups of genes. Their behavior in learning and memory, however, could be distinguished from unique patterns of gene expression as shown in the clustered data.
All of the MRGs identified have a recognized function and can be classified into different functional classes based on their translated product (Fig. 7c). Some of these genes have been previously related to synaptic plasticity, memory or cognitive disorders. Six of 38 MRGs found in the hippocampus of rats after passive avoidance training (Fig. 7c, shown in bold) were also differentially expressed in the same brain area following water-maze learning (Fig. 5f) suggesting common mechanisms of memory storage in different behavioral paradigms. For a complete description of the MRGs implicated by microarray technology during passive avoidance conditioning, the reader is referred to our previous study (D'Agata & Cavallaro 2003).
These ‘early’ studies are limited by the experimental design (animal strain, behavioral condition and time), technology (microarray platforms and number of genes) and computational analysis (normalization, filtering and statistical analysis) used. The value of these experiments will progressively increase as more is learned about the function of each gene and when software applications, like that we presented in this paper, will enable us to identify complex correlations existing between the genomic profiles obtained by microarray experiments and functional information (Fig. 8).
Although sure to be just the tip of the iceberg, the results already obtained point toward genes or sets of genes that may play critical roles in learning and memory. The discovery of these genes represents the key to developing novel and efficacious therapies to improve learning and memory, under normal conditions as well as in disorders that affect cognitive functioning, such as Alzheimer's disease.
We gratefully acknowledge Alfia Corsino, Maria Patrizia D'Angelo and Francesco Marino for their administrative and technical support. This work was partly sponsored by grants of the Italian Ministry of Health and the Italian Ministry of Education University and research to SC.