A Modified Polymerase Chain Reaction-Long Serial Analysis of Gene Expression Protocol Identifies Novel Transcripts in Human CD34+ Bone Marrow Cells


  • Yun Zhao,

    1. Terry Fox Laboratory, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
    Search for more papers by this author
  • Afshin Raouf,

    1. Terry Fox Laboratory, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
    Search for more papers by this author
  • David Kent,

    1. Terry Fox Laboratory, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
    2. Genetics Program, University of British Columbia, Vancouver, British Columbia, Canada
    Search for more papers by this author
  • Jaswinder Khattra,

    1. Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
    Search for more papers by this author
  • Allen Delaney,

    1. Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
    Search for more papers by this author
  • Angelique Schnerch,

    1. Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
    2. Genetics Program, University of British Columbia, Vancouver, British Columbia, Canada
    Search for more papers by this author
  • Jennifer Asano,

    1. Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
    Search for more papers by this author
  • Helen McDonald,

    1. Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
    Search for more papers by this author
  • Christina Chan,

    1. Terry Fox Laboratory, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
    Search for more papers by this author
  • Steven Jones,

    1. Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
    2. Departments of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
    Search for more papers by this author
  • Marco A. Marra,

    1. Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
    2. Departments of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
    Search for more papers by this author
  • Connie J. Eaves Ph.D.

    Corresponding author
    1. Terry Fox Laboratory, British Columbia Cancer Agency, Vancouver, British Columbia, Canada
    2. Departments of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
    3. Medicine, University of British Columbia, Vancouver, British Columbia, Canada
    4. Experimental Pathology and Laboratory Medicine, University of British Columbia, Vancouver, British Columbia, Canada
    • Terry Fox Laboratory, 675 West 10th Avenue, Vancouver, BC, Canada V5Z 1L3. Telephone: 604-675-8122; Fax: 604-877-0712
    Search for more papers by this author


Transcriptome profiling offers a powerful approach to investigating developmental processes. Long serial analysis of gene expression (LongSAGE) is particularly attractive for this purpose because of its inherent quantitative features and independence of both hybridization variables and prior knowledge of transcript identity. Here, we describe the validation and initial application of a modified protocol for amplifying cDNA preparations from <10 ng of RNA (<103 cells) to allow representative LongSAGE libraries to be constructed from rare stem cell-enriched populations. Quantitative real-time polymerase chain reaction (Q-RT-PCR) analyses and comparison of tag frequencies in replicate LongSAGE libraries produced from amplified and nonamplified cDNA preparations demonstrated preservation of the relative levels of different transcripts originally present at widely varying levels. This PCR-LongSAGE protocol was then used to obtain a 200,000-tag library from the CD34+ subset of normal adult human bone marrow cells. Analysis of this library revealed many anticipated transcripts, as well as transcripts not previously known to be present in CD34+ hematopoietic cells. The latter included numerous novel tags that mapped to unique and conserved sites in the human genome but not previously identified as transcribed elements in human cells. Q-RT-PCR was used to demonstrate that 10 of these novel tags were expressed in cDNA pools and present in extracts of other sources of normal human CD34+ hematopoietic cells. These findings illustrate the power of LongSAGE to identify new transcripts in stem cell-enriched populations and indicate the potential of this approach to be extended to other sources of rare cells.

Disclosure of potential conflicts of interest is found at the end of this article.


Genome-wide expression profiling has become an important tool for analyzing cell behavior and has been particularly useful for identifying molecular events associated with early developmental decisions and disease pathogenesis. Two technologies are now commonly used for comparing or characterizing the complete transcriptome of specific cell populations: hybridization-based arrays [1] and serial analysis of gene expression (SAGE) [2]. In the first of these procedures, known transcript sequences or expressed sequence tags (ESTs) present on a solid phase surface were originally used to capture reverse-transcribed DNA copies of extracted cellular mRNA. The extent of hybridization achieved by competing cDNAs prepared from two cell sources was then determined to allow a comprehensive survey of differences in the gene expression profiles of the two cell populations being compared. The subsequent substitution of annotated oligonucleotides as capture probes has further improved consistency and signal detection.

SAGE involves the construction of large libraries of tags (typically 10 or 17 nucleotides long) that have been reverse-transcribed from the 3′ end of mRNAs present in the sample. The tags are then sequenced, and bioinformatics methods are used to derive transcript identities. Transcript levels can then be inferred directly from tag frequencies, bypassing any need for comparison to a reference cDNA preparation. As a result, each SAGE library becomes a permanent digital data resource accessible for repeated interrogation. The fact that SAGE does not require prior knowledge of the transcripts being surveyed also makes it useful for gene discovery. SAGE has thus become a particularly attractive technology for studies of cellular transcriptomes from organisms for which comprehensive genomic sequence information is available. Nevertheless, a major limitation of the original SAGE methodology has been the need for relatively large quantities of starting RNA (originally 5 μg, the amount typically obtained from approximately 106 cells [2]). Subsequent modifications to decrease the amount of starting material needed (microSAGE [3], amplified antisense RNA-LongSAGE [4], small amplified RNA-SAGE [5], SAGE-lite [6], and polymerase chain reaction [PCR]-SAGE [7]) have now made it possible for either SAGE or LongSAGE libraries to be generated from much smaller amounts of RNA (down to 40 ng). However, these are still not readily applicable to isolates containing fewer than 104 cells. Because of the very low frequency of normal or malignant stem cells in many primary tissues, this limitation still hampers the use of any SAGE approach for characterizing a variety of stem cell populations.

Here, we describe a method that adapts recent technology for amplifying cDNAs from a few nanograms of total cellular RNA [8, 9] in a fashion that meets the requirements for SAGE library construction, minimizes the generation of ambiguous tags, and preserves the initial transcript representation. Using this approach, we have created the first LongSAGE library thus far reported from the CD34+ subset of normal adult human bone marrow cells. Analysis of the tags obtained indicates the capture of many expected transcripts, as well as a number of transcripts not previously known to exist.

Materials and Methods


Normal human cord blood cells were obtained, with consent, from anonymized discarded placentas, and the low-density (<1.077 g/cm3) fraction of cells isolated by centrifugation on Ficoll-Hypaque (Pharmacia, Calgary, AB, Canada, http://www.pfizer.ca) was then cryopreserved. Samples were thawed, and the CD34+ cells were separated immunomagnetically using a CD34+ cell positive selection kit (EasySep; Stem Cell Technologies, Vancouver, BC, Canada, http://www.stemcell.com). The cells were then stained with a phycoerythrin-conjugated anti-human CD34 antibody (8G12; BD Biosciences [BD], San Jose, CA, http://www.bdbiosciences.com) and propidium iodide (PI) (Sigma-Aldrich, St. Louis, http://www.sigmaaldrich.com), and a population of viable (PI) CD34+ cells was obtained using a FACSVantage machine (BD). Aliquots of 100, 500, 103, and 105 viable CD34+ cells were collected directly into vials containing 100 μl of RNA extraction buffer from the PicoPure RNA extraction kit (Arcturus, Mountain View, CA, http://www.arctur.com). Cryopreserved normal adult human bone marrow cells obtained with informed consent were provided by the Northwest Tissue Center (Seattle). After thawing, human cells expressing lineage (lin) markers of mature blood cells (CD2, CD3, CD14, CD16, CD19, CD24, CD56, CD66b, and glycophorin A) were removed immunomagnetically using a column (StemSep; Stem Cell Technologies) as recommended by the manufacturer and cryopreserved. The lin cells were thawed at 37°C and incubated in 50% fetal calf serum in Hanks' balanced salt solution overnight at 4°C to minimize effects of freezing and thawing on the levels of different mRNAs present. This was established by comparing the levels of transcripts for 11 variably expressed genes using quantitative real-time (Q-RT)-PCR. We also found that the gentle thawing process adopted for previously cryopreserved cells did not perturb the differentiation capabilities of these cells as determined by colony-forming cell (CFC) assays (data not shown). Thawed cells were then stained with allophycoerythrin-conjugated anti-human CD34 antibody (8G12; BD), fluorescein isothiocyanate-conjugated lineage marker antibodies, and PI. Viable (PI) linCD34+ cells were then isolated using a FACSVantage flow cytometer. The purity of the sorted cells was determined to be >98% as assessed by second fluorescence-activated cell sorting (FACS) analysis of an aliquot of the sorted cells. Total RNA extracts were prepared from viable linCD34+ cells isolated by FACS in the same manner as described for cord blood cells.

Hematopoietic Progenitor Cell Assays

CFC assays were performed by plating human linCD34+ bone marrow cells at 800 cells per milliliter in serum-containing methylcellulose medium (Methocult 4230; Stem Cell Technologies) supplemented with 3 U/ml erythropoietin (Stem Cell Technologies), 50 ng/ml Steel factor (SF) (prepared and purified in the Terry Fox Laboratory), 20 ng/ml each of interleukin-3 (IL-3) and granulocyte-macrophage colony-stimulating factor (both from Novartis International, Basel, Switzerland, http://www.novartis.com), 20 ng/ml granulocyte colony-stimulating factor (G-CSF) (Stem Cell Technologies), and 20 ng/ml IL-6 (Cangene Corp., Mississauga, ON, Canada, http://www.cangene.com) [10]. Long-term culture-initiating cell (LTC-IC) assays were performed by culturing 2 × 104 lin CD34+ bone marrow cells in 2 ml of myeloid LTC medium (Myelocult; Stem Cell Technologies) supplemented with 10−6 mol/l hydrocortisone sodium hemisuccinate (Sigma-Aldrich) for 6 weeks on pre-established, irradiated feeder layers of mouse fibroblasts genetically engineered to produce human SF, G-CSF, and IL-3. At the end of this time, the number of CFCs present was determined, and the number of input LTC-IC calculated assuming an average 6-week output of 18 CFCs per LTC-IC [10].

RNA Isolation and cDNA Preparation and Amplification

An RNA extract prepared from undifferentiated H9 human embryonic stem cells was kindly provided by Dr. J. Thomson (University of Wisconsin, Madison, WI). RNA extracts were also prepared separately from 100, 500, 103, or 105 FACS-purified human CD34+ cord blood cells using the PicoPure RNA extraction kit. To minimize contamination with genomic DNA, RNA isolates were treated with DNaseI (Amplification Grade; Invitrogen, Burlington, ON, Canada, http://www.invitrogen.com) according to the manufacturer's protocol. To quantify the extent of genomic contamination in the final purified cDNA used for SAGE library construction, we used intron-specific primers to amplify sequences for two genes on different chromosomes: forward primer 5′-CCCCATGAGTCAGGTCGG-3′ and reverse primer 5′-CCCAGACTGCATCTCAGCCA-3′ for the DRCG8 gene (22q11.2), and forward primer 5′-AGTTTCTCCTCTCTCCTCCCAAG-3′ and reverse primer 5′-TCACTTCACTTCATTTTCACTTCTC-3′ for the ATP11A gene (13q34), by quantitative PCR (Q-PCR). The results obtained with both pairs of primers showed that <0.1% of the cDNA sample contained genomic DNA. RNAs were reverse-transcribed, and the cDNAs obtained were amplified using the switching mechanism at the 5′ end of RNA transcripts (SMART) cDNA synthesis kit (catalog number 635000; Clontech, Mountain View, CA, http://www.clontech.com) following the manufacturer's protocol but using modified template switching (TS) and cDNA amplification primers, as detailed below (Fig. 1A). The first-strand cDNA was synthesized with an oligo(dT) primer (5′-AAG CAG TGG TAA CAA CGC AGG CTA CTT TTT TTT TTT TTT TTT TTT TTT TTT TTT TVN-3′, where V denotes A, C, or G and N denotes A, C, G, or T) and the PowerScript reverse transcriptase provided in the kit, in the presence of the modified TS primer. The TS primer was modified by introducing a sequence containing an AscI digestion site (5′-AAG CAG TGG TAA CAA CGC AGG CGC GCC GGG-3′ [the AscI site is underlined]). The first-strand cDNA was purified with a NucleoSpin column and then amplified using a modified PCR primer that contained a biotin molecule at its 5′ end (5′-biotin-AAG CAG TGG TAA CAA CGC AGG C-3′) and the Advantage II PCR Kit (Clontech). The biotinylated 5′ ends of the amplified cDNAs were then removed by digestion of the initial amplified product with AscI (New England Biolabs, Beverly, MA, http://www.neb.com). The cDNA was purified on a Chroma-Spin 200 Column (Clontech), and its concentration was determined using a spectrophotometer (GeneQuant Pro; Biochrom, Cambridge, U.K., http://www.biochrom.co.uk).

Figure Figure 1..

cDNA amplification protocol for PCR-serial analysis of gene expression (PCR-SAGE) library generation. (A): Incorporation of a template switching primer containing an AscI sequence allows an end-to-end amplification of the first-strand cDNA using a single biotinylated oligonucleotide primer and then subsequent removal of the 3′ biotin via AscI digestion. The 5′ end of the double-stranded cDNA is then available for capture on streptavidin-coated beads for SAGE library construction. (B): Rationale for choosing AscI to eliminate the 3′-biotinylated end of amplified cDNAs based on the low frequency of its recognition sequences in human cDNAs. (C): The amplified cDNA smear before and after AscI digestion. Amplified cDNA prepared from 10 ng of RNA was subjected to digestion with AscI restriction endonuclease for 1 hour and size-fractionated on an ethidium bromide-stained agarose gel in parallel with undigested amplified cDNA sample. The results demonstrate that digestion with AscI does not perturb the overall distribution of the amplified cDNA fragment size. Abbreviations: bp, base pairs; PCR, polymerase chain reaction; RT, reverse transcription.

SAGE Library Construction

For the PCR-LongSAGE libraries, the amplified cDNA was first digested with NlaIII and incubated with streptavidin beads (M-280; Invitrogen); the immobilized, truncated cDNAs were then linked to two different adaptors, and LongSAGE libraries were constructed using the I-SAGE kit (Invitrogen) following the manufacturer's protocol. The I-SAGE kit was also used to construct a SAGE-lite library from 400 ng of PCR-amplified cDNA (22 cycles) obtained using a methodology described before [6] that also uses SMART cDNA technology.


RNA was reverse-transcribed with SuperScriptII (Invitrogen) to generate first-strand cDNA for use as the template for Q-RT-PCR analysis of transcript levels in nonamplified RNA preparations. Q-RT-PCR was performed using SYBR Green PCR MasterMix (Applied Biosystems, Foster City, CA, http://www.appliedbiosystems.com) and an iCycler PCR machine (Bio-Rad, Hercules, CA, http://www.bio-rad.com). After an initial denaturation step at 94°C for 5 minutes, 50 cycles of a three-step PCR with a single fluorescence measurement were undertaken (94°C for 15 seconds, 60°C for 20 seconds, and 72°C for 30 seconds). The PCR products were also subjected to melting curve analysis for verification of single amplicons and absence of primer dimers. Q-RT-PCR and data analysis were performed on an iCycler iQ system, using iCycler iQ real-time detection software (Bio-Rad). The primers used are shown in supplemental online Table 1. Q-PCR assays were used to confirm the expression of the unique tags identified by bioinformatics analysis of the linCD34+ human adult bone marrow LongSAGE library. For this purpose, RNA was extracted from linCD34+ cells isolated from human bone marrow samples from three different normal adult donors; one of these samples was the same as that used for construction of the PCR-LongSAGE library. cDNA was generated, as described and as a negative control the same amount of RNA was used without adding reverse transcriptase. Primers for detecting novel transcripts were selected from the human genome (Human BLAT Search, http://www.genome.ucsc.edu/cgi-bin/hgBlat) flanking 5′ and 3′ regions of the identified unique tags in such a way that the amplicons would include the unique tag sequences (supplemental online Table 2).

Bioinformatics and Statistical Methods

DiscoverySpace software (http://www.bcgsc.ca/platform/bioinfo/software/ds) was used to determine the similarity of different pairs of LongSAGE libraries using Audic-Claverie statistics [11] and for tag-to-gene mapping using the RefSeq database [build 35, August 26, 2004; http://www.ncbi.nlm.nih.gov/RefSeq). Pearson correlation coefficients were calculated using the regression program from the !STAT package [12], and hierarchical clustering was performed using Phylip software [13; http://www.med.nyu/rcr/phylip.main.html].


Development of a cDNA Amplification Protocol Suitable for Constructing LongSAGE Libraries

To allow LongSAGE libraries to be constructed from highly PCR-amplified preparations of 3′ cDNAs without major distortion of the original transcript representation, we used the SMART technology developed by Clontech [9] and also used in the SAGE-lite protocol [6] with two modifications. The original technology makes use of a TS primer containing a short poly-guanine sequence at its 3′ end for the first-strand cDNA synthesis step. We then modified the cDNA amplification primer so that it contained a biotin molecule at the 5′ end. In addition, we modified the TS primer by introducing an eight-base (GGCGCGCC) AscI restriction endonuclease recognition sequence into its 3′ end (Fig. 1A). These modifications allowed the biotinylated primers incorporated into the 3′ ends of the cDNA products to later be removed to yield a final product in which the cDNAs were biotinylated exclusively at their 5′ ends, as required for SAGE library construction (Fig. 1A). This approach is a variation of the previously described introduction of a seven-base SapI site for the same purpose [7]. However, from in silico analyses, we found that 24% of Ensembl transcripts contain at least one SapI site, which could result in a potential loss of >600 tag types following SapI digestion. In contrast, only 3% of Ensembl transcripts contain one or more AscI restriction sites, and only 80 contain an AscI site between the first NlaIII site 5′ of the poly(A) tail and the poly(A) tail itself (Fig. 1B). Consistent with the expectation of a minimal loss of tags after digestion with AscI (at 37°C for 1 hour), we found that there was no detectable change in the size distribution of the amplified cDNAs when they were analyzed electrophoretically (Fig. 1C).

To determine the number of cycles of amplification to use, we generated cDNA samples independently from three separate 10-ng aliquots of RNA extracted from undifferentiated human H9 embryonic stem cells (http://www.transcriptomES.org) and then examined the electrophoretically separated products obtained after 18–24 cycles of amplification. The results showed that the PCR amplification reaction had not yet reached a plateau after 21 cycles, by which time there was already sufficient product to construct a SAGE library (supplemental online Fig. 1A). This result was also validated by Q-RT-PCR analyses (supplemental online Fig. 1B).

Evidence of the reproducibility of the cDNA amplification protocol and its ability to preserve relative transcript levels in amplified cDNA products was obtained from separate Q-RT-PCR measurements of the levels of six differentially expressed mRNAs in the H9 cell extract described above on samples taken before and after three independent amplifications of the starting cDNA pool (Fig. 2).

Figure Figure 2..

Real-time polymerase chain reaction (PCR) of replicate amplified cDNA samples. Three 10-ng aliquots of RNA extracted from a single pool of undifferentiated H9 human ESCs were independently amplified using a 21-cycle PCR step. Levels of seven transcripts (ACTB, GAPD, BLP1, SAFB2, CCND1, ABCG2, and RPS4X) were quantified by real-time PCR, and the values were normalized to the levels of ACTB transcripts measured in the same preparations. Q-RT-PCR data were also obtained from an initial aliquot of 100 ng of the same RNA after reverse transcription but with no amplification. Values shown are the mean ± SEM. The sequences of the primers used are shown in supplemental online Table 1. Abbreviation: Ct, cycle threshold.

We next asked what would be the minimum number of normal adult human cells from which a suitable amplified cDNA product could be obtained to allow construction of a 200,000-tag LongSAGE library. To address this question, we used a combination of immunomagnetic cell separation and multiparameter FACS procedures to isolate CD34+ cells from a pool of cells from three normal human cord blood harvests (Fig. 3A). cDNA products were prepared from separately collected aliquots of 100, 500, 103, and 105 of the CD34+ cells isolated, and they were then amplified or not (105 cell samples). Figures 3B and 3C show comparisons of the levels of 10 transcripts quantified in these extracts by Q-RT-PCR before and after amplification. All 10 transcript species were detected in the amplified cDNA products obtained from as few as 500 cells, and their levels were highly correlated with those measured in the nonamplified material (R = 0.83 ± 0.04; Fig. 3B). In addition, the RNA extracted from the 500-cell sample yielded more than 400 ng of amplified cDNA, which is more than enough to build a one million-tag LongSAGE library using the I-SAGE protocol (Invitrogen). The cDNA products generated from 100 CD34+ cord blood cells also showed a significant correlation between the levels of the more prevalent transcript species before and after their amplification, although some of the rarer transcript species were not detectable in the amplified products generated in this case (data not shown).

Figure Figure 3..

Validation of the applicability of the cDNA amplification procedure to small-cell numbers. (A): Fluorescence-activated cell sorting (FACS) profile showing the immunomagnetically enriched CD34+ low-density human cord blood cells from which the final CD34+ cells used in this study were isolated by FACS. (B): cDNA products were generated individually from three replicate aliquots of 500 CD34+ cells sorted directly into RNA lysis buffer and then subjected individually to our modified cDNA amplification procedure. The levels of 11 transcripts (GAPD, ACTB, ABCG2, CCND2, ABCB1, CCND1, BCR, ABL1, PIAS4, CD34, and SELL) in each of the amplified cDNA samples were then quantified by quantitative real-time polymerase chain reaction and normalized to the levels of ACTB cDNA in the same samples. (The sequences of the primers used are shown in supplemental online Table 1). Values shown are the mean ± SEM. Pearson correlation coefficients and the best line fit to the data derived by least squares analysis are shown. (C): Similar analysis of RNA from replicate samples of 103 CD34+ cells. Abbreviations: Ct, cycle threshold; FSC, forward light scattering; PE, phycoerythrin.

Comparison of Replicate LongSAGE Libraries Prepared from Amplified and Nonamplified cDNAs

We then compared the complete tag profiles from LongSAGE libraries constructed from amplified and nonamplified cDNAs derived from the same original RNA extract. For this analysis, two of the independently amplified H9 cDNA preparations analyzed in Figure 2 were used to prepare replicate libraries. The two PCR-LongSAGE libraries were sequenced to depths of 57,470 (library A) and 112,517 (library B) total tags (all analyses performed using http://www.transcriptomES.org). To minimize effects due to poor-quality tags, we applied sequence quality cut-offs of 95.0% and 99.9% to the nonsingleton and singleton tags, respectively. This reduced the number of tags in the two PCR-LongSAGE libraries to 46,241 (library A) and 83,557 (library B). The library prepared from nonamplified material was a 467,522-tag library constructed from 20 μg of RNA using the standard I-SAGE protocol. Also included in this analysis was a 60,492-tag SAGE-lite library prepared from a 100-ng aliquot of the same RNA extract. All four libraries showed the expected predominance of low-abundance tags and, in this respect, were indistinguishable from one another (data not shown). They also contained readily detectable frequencies of tags unique to transcripts of known relevance to undifferentiated human embryonic stem cells (supplemental online Table 3) [14].

We then used DiscoverySpace software to compare the tag representation in these four libraries on a pairwise basis. This software uses Audic-Claverie statistics [11] to allow the tag composition of SAGE libraries to be compared independent of library size. This analysis showed the two replicate PCR-LongSAGE libraries to be 98% similar to one another using a 95% confidence interval, that is, only 2% of tag types were present at significantly different levels (p < .05) in one of the two PCR-LongSAGE libraries (Fig. 4A). Comparison of each of these libraries to the conventional LongSAGE library prepared from nonamplified material gave corresponding similarity values of 95% (for PCR-LongSAGE library B; Fig. 4B) and 84% for the PCR-LongSAGE library A (data not shown). Values for parallel similarity comparisons with the SAGE-lite library were 96% (library B; Fig. 4C) and 97% (library A; data not shown), and the value for comparison of the LongSAGE library with the SAGE-lite library was 97% (data not shown). In fact, only seven tags were consistently over- or under-represented in both of the PCR-LongSAGE libraries compared with the tags from the LongSAGE library prepared from nonamplified material, and none of these mapped to a unique site in the most recent version of the human genome (RefSeq database, build 35, August 26, 2004).

Figure Figure 4..

Comparisons of four LongSAGE libraries prepared from the same original RNA sample using different protocols. (A): Comparison using DiscoverySpace software of PCR-LongSAGE-B and PCR-LongSAGE-A. (B): Comparison using DiscoverySpace software of PCR-LongSAGE-B and the LongSAGE library constructed from the same RNA without prior amplification. (C): Comparison using DiscoverySpace software of PCR-LongSAGE-B and the SAGE-lite library constructed from the same RNA. (D): Hierarchical cluster analysis of the same four LongSAGE libraries demonstrates the similarity of the two PCR-LongSAGE libraries indicative of the reproducibility of the amplification process. Abbreviations: LongSAGE, long serial analysis of gene expression; PCR, polymerase chain reaction; SAGE, serial analysis of gene expression.

Pearson correlation analysis of tag frequencies in each pair of libraries generated correlation coefficients of 0.8 for the two PCR-LongSAGE libraries and somewhat lower values when these were compared with the library obtained from nonamplified material (0.61 and 0.65, respectively) or to a corresponding SAGE-lite library (0.61 and 0.66, respectively) (Fig. 4D). This latter method of comparison is more sensitive to differences between higher frequency tags. Hence, to avoid distortion from repetitive sequences, only tags that could be matched to a unique sequence in the most recent version of the human genome (build 35, August 26, 2004) were included in this analysis.

Construction and Analysis of a PCR-LongSAGE Library from CD34+ Cells Isolated from Normal Adult Human Bone Marrow

We then used this method to construct a library from ∼3,000 highly purified linCD34+ cells isolated by FACS from a sample of normal adult human bone marrow cells (Fig. 5A). Functional assays applied to these CD34+ cells demonstrated that 12% had granulopoietic, erythroid, or mixed granulopoietic and erythroid CFC activity in vitro. In addition, 0.3% of these cells were detectable as 6-week precursors of CFCs in LTC-IC assays [10], as described in Materials and Methods. From this library, 201,106 tags were sequenced, and 42,310 unique tag types were obtained with a typical SAGE tag frequency distribution (Fig. 5B). A complete listing of all the tags is given at http://www.transcriptomES.org. Q-RT-PCR of cDNA preparations generated from extracts of independently purified linCD34+ cells from the same bone marrow sample showed a good correlation between the transcript levels measured and those inferred from the PCR-LongSAGE tag counts using DiscoverySpace for tag-to-transcript identification (Fig. 5C).

Figure Figure 5..

Description and validation of PCR-long serial analysis of gene expression (PCR-LongSAGE) library from human linCD34+ adult bone marrow cells. (A): Fluorescence-activated cell sorting (FACS) plot demonstrating the high purity (98%) of the linCD34+ cells isolated by two successive re-sorts of lin low-density normal adult human bone marrow cells and reanalyzed in a third run of the cells through the FACS instrument. (B): Distribution of tags in the PCR-LongSAGE library generated from the RNA extracted from these cells. (C): Ten transcripts identified from the PCR-LongSAGE library were chosen to test the correlation between PCR-LongSAGE and Q-RT-PCR methodologies. linCD34+ cells were isolated from the same donor, and Q-RT-PCR was performed on the cDNA products obtained. EEF1A1 was the most abundantly expressed transcript of those analyzed (based on SAGE tag counts) and was therefore chosen as a standard against which the other nine transcripts were compared. The y-axis shows the ΔCt value obtained in each case from the Q-RT-PCR measurements (ΔCt = Ct(X) − Ct(EEF1A1)), and the x-axis shows the corresponding tag frequency expressed as a log2 value after normalization against the EEF1A1 tag frequency (log2 [EEF1A1 tag count/X tag count]). Abbreviations: APC, allophycoerythrin; Ct, cycle threshold; FITC, fluorescein isothiocyanate; Q-RT-PCR, quantitative real-time polymerase chain reaction.

The tag-to-transcript analysis showed that 8,959 tags in the PCR-LongSAGE library mapped to single RefSeq transcripts or multiple variants of a single gene in the RefSeq database. This included transcripts that are known to be expressed in CD34+ human bone marrow cells, such as transcripts that encode various transcription factors and cell surface receptors [15, [16], [17]–18]. A number of these transcripts have not been found in previously published libraries generated from phenotypically similar cell populations using the original 14-mer SAGE protocol (examples highlighted in Table 1) [17, 19]. Nevertheless, when DiscoverySpace was used to compare all of the tags present in our library with those present in the two related published libraries [18, 20], 96% and 94% similarity values, respectively, were obtained (at a 95% confidence interval; Fig. 6A, 6B). When we compared the nonsingleton tags in the newly constructed linCD34+ bone marrow library with nonsingleton tags in the other two CD34+ human cell libraries, the result showed that 2,166 tags were present in all three (Fig. 6C). The tag-to-transcript mapping of these 2,166 tags yielded 718 RefSeq transcripts (the tag and annotation information are summarized in supplemental online Table 5). The consistent expression of these transcripts in the three CD34+ libraries suggests that these genes may play important roles in the maintenance and/or differentiation of human hematopoietic stem/progenitor cells.

Table Table 1.. Transcripts detected in a polymerase chain reaction-long serial analysis of gene expression library prepared from normal adult human linCD34+ bone marrow cells
original image
original image
Figure Figure 6..

Comparison of polymerase chain reaction-long serial analysis of gene expression (PCR-LongSAGE) library from human linCD34+ adult bone marrow cells with published data. (A): Comparison, using DiscoverySpace software, of the tags present in the PCR-LongSAGE library constructed from the adult human linCD34+ bone marrow cells in this study and those identified in a published 14-mer serial analysis of gene expression (SAGE) library constructed from a different source of CD34+ adult human bone marrow cells [18]. (B): Comparison, using DiscoverySpace software, of the tags present in the PCR-LongSAGE library constructed from the adult human linCD34+ bone marrow cells in this study and those identified in a published 14-mer SAGE library constructed from CD34+CD38+ adult human bone marrow cells [20]. (C): A Venn diagram showing the intersect of commonly expressed tags in the PCR-LongSAGE, linCD34+CD38+, and the CD34+ SAGE libraries. To carry out this comparison we converted our LongSAGE tags to short SAGE tags using the DiscoverySpace software. Out of 2,166 tags, 718 could be annotated using the human RefSeq database. These tags and their annotations are listed in supplemental online Table 4.

Gene Ontology analysis of these 718 RefSeq transcripts showed the presence of cell death-related genes where there was a balance in the positive and negative regulators of cell death. We also observed the presence of several positive regulators of cell growth, reflecting the likelihood that some of the cells in the CD34+ subset of human bone marrow are proliferating [20]. In addition, we observed the presence of several transcripts encoding proteasome components and members of the ubiquitination complex (supplemental online Fig. 3). Interestingly, it was recently demonstrated that the proteasomal activity of human hematopoietic progenitor cells prevents their infectability with lentiviral vectors [21].

We also compared our normal adult linCD34+ human bone marrow cell SAGE library to 287 publicly accessible SAGE libraries prepared from multiple types of human cells (available primarily through the Cancer Genome Anatomy Project at http://www.cgap.nci.nih.gov, including the two human CD34+ cell libraries mentioned above). This more extensive comparison revealed 936 tags that appeared only in our linCD34+ bone marrow cell library, of which 192 mapped to a single sequence in the human genome and not to any site included in the mammalian genome collection (ftp://ftp.ncbi.nih.gov/repository/MGC/MGC.sequences), RefSeq (ftp://ftp.ncbi.nih.gov/refseq/daily), or Ensembl, version 20. We then estimated the probability of single-base pair errors by combining a library-wide construction error rate and a tag-specific sequencing error probability [22], which indicated that 190 of the 192 tags could be judged to be error-free (p ≤ .05). Of the 190 tags, 23 mapped to highly conserved regions in mouse, rat, and human genomes and, in the human genome, were located at least 5,000 base pairs away from well-annotated transcripts and were also not present in any human EST database. These 23 novel tags are listed with their chromosomal locations in supplemental online Table 4. Q-RT-PCR was then used to investigate the expression of these 23 novel tags in three cDNA samples prepared from independently from three samples of linCD34+ adult human bone marrow cells, including one prepared from the same pool of RNA used for making the PCR-SAGE library.

To assess the possibility of genomic DNA contamination and its contribution to the detection of the unique tag expression, we included a strict negative control in which RNA from each bone marrow sample was used as PCR template (described in Materials and Methods). Q-RT-PCR analyses showed 10 of the 23 tags to be consistently detectable in the cDNA samples examined with no detectable amplification in the negative controls. Four of these 10 novel tags were also observed in nine additional PCR-LongSAGE libraries that we have recently prepared from related sources of primitive human hematopoietic cells (i.e., the linCD34+CD38CD7CD36CD45RACD71 and linCD34+CD38+CD7CD36CD45RACD71 subsets of cells in normal adult human bone marrow, umbilical cord blood, G-CSF-mobilized peripheral blood, and human fetal liver; Y.Z. and C.J.E., unpublished data), and 1 of the 10 novel tags was present in two of these nine libraries (supplemental online Table 4).


SAGE technology offers a powerful approach to global gene expression profiling of defined cell populations and can serve as an important gene discovery tool. It is therefore particularly attractive for investigations of changes in cellular programs, both normal and aberrant. However, the use of SAGE to interrogate many key events is often precluded because these take place in rare cell types that are inaccessible to SAGE analysis because the amounts of RNA required cannot be obtained. Here, we describe a modified method for preparing amplified cDNA products that enables LongSAGE to be reproducibly applied to samples 10-fold smaller than were previously possible (103 cells or less). This modification makes use of a template switching primer containing a rare (AscI) restriction site and a 21-cycle PCR that yields sufficient cDNA product to allow the construction of SAGE libraries from which millions of tags can be derived by direct sequencing. Here, we used the Long-SAGE protocol because of the improved yield of tags obtained from such libraries that can be uniquely mapped to genomic DNA [23].

Currently, many of the methods available to amplify RNA make use of the error-prone T7 RNA polymerase. If applied to material to be used for SAGE, a high frequency of ambiguous or incorrect tags might be expected. Amplification of cDNAs by the PCR method makes use of Titanium Taq polymerase with a TaqStart antibody to provide automatic hot-start PCR, as well as proofreading activity. These latter features maximize reliability by ensuring that the amplified cDNA contains very little product derived from nonspecific cDNA strand amplification or mismatched sequence errors (estimated at 1/50,000 nucleotides). Here, we validated these predictions by a series of experimental and statistical comparisons of the tag or transcript representation in amplified versus nonamplified cDNA preparations and SAGE libraries prepared from these samples. The results demonstrated that PCR-LongSAGE is a reproducible method for performing SAGE analyses on small numbers of cells without significant distortion or loss of transcripts present in the original RNA extract.

The power of this method is illustrated here for the transcriptome analysis of the small fraction of linCD34+ cells present in normal adult human bone marrow. These cells are of particular interest because they are highly enriched in hematopoietic stem and progenitor cells [18]. Comparison of the PCR-LongSAGE library obtained from this subset with published (SAGE) libraries prepared from nonamplified cDNA obtained from similar cells showed extensive similarities in tag composition and the presence of many expected transcripts. In addition, our studies underscore the power of the LongSAGE protocol for identifying novel transcripts and transcripts of potential developmental importance because of their restricted but reproducible detection in closely related primitive cell populations. We therefore expect that this method will broaden the application of SAGE to other purified or microdissected subsets of cells and thereby facilitate the investigation of many processes not previously accessible to global gene expression analysis.

Disclosure of Potential Conflicts of Interest

The authors indicate no potential conflicts of interest.


We are grateful to members of the Stem Cell Research Laboratory and FACS facility in the Terry Fox Laboratory for assistance in the initial processing and FACS isolation of the human cord blood and bone marrow cells used, to Dr. J. Thomson (University of Wisconsin, Madison, WI) for providing the H9 cell RNA extract, and to A. Wanhill and D. Wytrykush for assistance in preparing the manuscript. This work was supported by funds from Genome BC and Genome Canada, the Stem Cell Network, and the Terry Fox Run (as a grant from the National Cancer Institute of Canada). Y.Z. held postdoctoral fellowships from the Stem Cell Network and the Leukemia Research Fund of Canada. A.R. held postdoctoral fellowships from the Canadian Breast Cancer Foundation and the Canadian Institutes of Health Research. D. Kent held Studentships from the Stem Cell Network and the Canadian Institutes of Health Research. M.A.M. and S.J. are Scholars of the Michael Smith Foundation for Health Research, and M.A.M. is a Terry Fox Young Investigator of the National Cancer Institute of Canada. Y.Z. and A.R. contributed equally to this work.