Deoxyribose nucleic acid (DNA) is made of the same four molecular components (nucleotides, bases, nucleobases) found in all organisms: two purine bases, adenosine (A) and guanine (G); Watson–Crick basepair with two pyrimidine bases, thymine (T) and cytosine (C), respectively. The sugar phosphate backbone with the nucleobases attached to it polymerizes to form a right-handed double helix. The human genome is divided into 24 chromosomes, 22 autosomes (with one copy inherited from the father and one from the mother) and two sex chromosomes, XX for females and XY for males, respectively. DNA does not exist as a ‘naked’ molecule, but it is packaged with proteins into chromatin with different densities correlating with the transcriptional status. The completion of the human genome sequencing project in 2003 has provided us with a (nearly) complete and accurate sequence of the three billion bases of which the human genome is composed. This multinational effort has already provided valuable insights into biology with impact on medical issues, and permits research to advance at an unprecedented pace. With the support of the sequence, research has turned to understanding the function and regulation of the different parts of the genome and to map regulatory elements of the approximately 30 000 genes that are found in the genome.
Polymorphisms of the human genome are a powerful tool to access interactions and pathways. It is hoped that better understanding of the biological processes will help unravel the complex genetic traits of, for example, multifactorial disorders. Alterations in gene sequences, expression levels and protein structure and function have been associated with many types of diseases. Two types of variations at the DNA level are thought to influence transcription into ribose nucleic acid (RNA) as well as the subsequent translation into proteins: genetic variations such as polymorphisms and mutations, which give rise to different DNA sequences on either of the two alleles of an individual or between individuals at specific loci; and epigenetic variations—mediated by DNA methylation and histone modifications—that are heritable in short term, but do not involve mutations of the DNA nucleotide sequence itself.
Mass spectrometry (MS), in particular matrix-assisted laser desorption/ionization (MALDI) time-of-flight (ToF) MS, is one of the most versatile tools in the post-genome sequencing era for the analysis of biomolecules. The instrumental method can be applied to the analysis of DNA and DNA methylation, expression profiling and proteomics. In this review, we focus solely on methods for the analysis of nucleic acids (DNA/RNA) and nucleic acid–based variations mainly by MALDI-MS, and the interested reader is directed to one of the numerous reviews dealing with aspects of proteomic analysis by MS.1, 2
DNA ANALYSIS BY MALDI MASS SPECTROMETRY
The principle of MALDI was discovered in the late 1980s independently by Tanaka in Japan and Karas and Hillenkamp in Germany.3, 4 This was approximately the same time as the idea of the Human Genome sequencing project was put forward by Watson.5 However, initially, MALDI-MS predominantly found its application in the analysis of proteins and peptides. The use of the instrument for DNA analysis was at first very tentative, as only small oligonucleotides up to 20 bases could be analyzed in the negative ion mode after substantial optimization.6 The discovery of 3-hydroxypicolinic acid (HPA) as a good matrix for DNA analysis in the positive ion mode7 and the integration of delayed extraction into the MALDI process dramatically improved resolution8 and set the stage for the development of commercial instruments. The major obstacles were that larger DNA fragments showed depurination, and the presence of metal contaminants such as sodium and potassium resulted in adducts. These cations interact with the multiply negatively charged sugar phosphate backbone causing peak broadening and reduction of resolution, sensitivity and accuracy. As it became apparent that the problems of limited size range, resolution, fragmentation and adducts were very difficult to improve on, DNA analysis by MALDI started shifting from the idea of using it for DNA sequencing to an emerging problem—that of analyzing mutations and single nucleotide polymorphisms (SNPs) where only small products need to be analyzed (3 to 25 bases). Duplex structures of extension primers and templates dissociate under MALDI conditions so that only the short extension primer and primer extension products are detected. As research realized the potential of SNPs as a means to discover the underlying genetic traits of disease susceptibility and drug response, MALDI-MS was ready to rise to the challenge and turned into one of the major players in the field of SNP genotyping. Speed increased from one spectrum every few minutes in 1993 to 30 000 spectra per day by 2001. For the analysis of polymorphisms, exceedingly high resolution is not required in contrast to proteomics. The problem lies rather in the number of spectra that has to be handled.9 To circumvent the problems associated with adduct formation and the resulting low resolution, various purification protocols were implemented to get rid of components of enzyme buffers such as detergents, stabilizers and glycerol as well as salts that might interfere with MALDI analysis. This process can be carried out by various methods, such as reversed-phase purification, ethanol precipitation, and streptavidin coated magnetic beads. Ion-exchange beads allow a homogeneous assay format and are suitable for automation of this step.10, 11 Another strategy consists in charge-neutralizing a modified sugar–phosphate backbone by chemical means to render it insensitive to adduct formation.12
DNA-based polymorphisms contribute to an individual's variability of cellular functionality, metabolism, RNA turnover, protein production and disease susceptibility. In recent years research has focused on the analysis of SNPs because of their simpler mutational dynamics and greater prevalence compared to microsatellites (short tandem repeats (STR) or CA repeats of variable length), which had been successfully applied to the identification of many disease susceptibility loci of monogenic disorders.13 SNPs are biallelic single base changes that occur at a specific position in a genome with a frequency of at least 1% in a given population (Fig. 1).14 Owing to their binary nature, SNPs are fairly easy to genotype and the interpretation of the readout can be automated. On average one SNP is found every 300 bases in humans, and the largest public database now contains approximately ten million human SNPs of which five million have been validated. Population frequency data is available for 600 000 polymorphisms (dbSNP build 125; http://www.ncbi.nlm.nih.gov/SNP). An allele of an SNP can constitute a genetic risk factor, as it may increase susceptibility to certain diseases either directly by changing the coding sequence of proteins or indirectly by affecting gene regulation, influencing mRNA stability and conformation, and the quantity and quality of expressed gene product. Association studies, in which the genotypes of a cohort of cases and controls are compared, are expected to detect even moderate effects of the individuals' genotypes in correlation with their phenotype. It is assumed that with a dense map of markers it will be possible to localize and characterize genes involved in complex human diseases such as cancer, asthma, diabetes and cardiovascular diseases.15 In the past few years, it has been found that the human genome is structured in blocks of complete linkage disequilibrium (i.e. the nonrandom association of two alleles) due to evolution.16, 17 Using high-frequency SNPs usually only a very limited number of haplotypes are detected, and depending on the region of the genome, LD can extend to more than 100 kb. The HapMap project (http://www.hapmap.org) has now provided a selection of SNP markers that tag common haplotype blocks in order to reduce the number of genotypes that have to be measured for a genomewide association study.18 These SNPs are surrogate markers that are themselves probably functionless but segregate with a gene or allele that is associated with a disease. First positive findings seem to confirm this approach and justify the enormous efforts involved in the human genome sequencing project and its follow-up projects. Using either the HapMap data or haplotypes reconstructed from their respective samples, three independent studies pinpointed a single SNP in the complement factor H as the underlying cause in age-related macular degeneration, the leading cause of blindness among the elderly.19–21
On the basis of the HapMap project, Perlegen, Illumina and Affymetrix have now developed array-based products with up to 500 000 SNPs for genomewide association studies. Other high-volume applications of SNP genotyping are found in pharmacogenomics, where the promise of individually tailored medicine could prevent adverse and inefficient drug responses, or agricultural applications such as quantitative trait loci analysis or generating genetic fingerprints of animals for identification and traceability purposes.
Methods for SNP genotyping are very diverse, and new or improved methods are still emerging.22 Broadly, each method can be separated into two elements, the first being a method for interrogating a SNP. This is a sequence of molecular biological, physical and chemical procedures for the distinction of the alleles of an SNP, such as hybridization, ligation, primer extension and cleavage or a combination thereof.23 The second element is the actual analysis or measurement of the allele-specific products, which can be an array reader, a mass spectrometer, a plate reader, a gel separator/reader system or any other. Each principle for allele differentiation has been combined with every detection device.23 Many SNP genotyping technologies have reached maturity in the last five years and have been integrated into large-scale genotyping operations. The choice of the method depends on the scale of the study and the scientific question a project is trying to answer. A project might require genotyping of a limited number of SNP markers in a large population or the analysis of a large number of SNP markers in one individual. Flexibility in choice of SNP markers and DNA to be genotyped or the possibility to precisely quantify an allele frequency in pooled DNA samples might be issues. Studies might also use combinations of typing methods at different stages.
MALDI-BASED ANALYSIS OF SNPS
In contrast to most other genotyping methods that use indirect detection of the alleles of a polymorphism such as measuring a fluorescent label that has been attached to a product of the allele differentiating reaction, MS directly measures a physical property of the allele—its mass, which eradicates a potential source of doubt and error. The four principles for allele discrimination mentioned above have been combined with MALDI analysis for SNP genotyping and are described in detail elsewhere.24 All assays for SNP genotyping by MALDI that were developed beyond the stage of proof-of-principle and that were implemented with a significant degree of automation into routine genotyping are based on primer extension. Primer extension uses an oligonucleotide (a primer) that is complementary to a sequence upstream of the polymorphism of interest, a DNA polymerase and nucleotides. In a thermocycling reaction, complementary bases are added to the primer, and specific termination in the chain extension allows the identification of the alleles of the SNP (Fig. 2). Primer extension assays are flexible, robust and well suited for high-throughput applications, and the incorporation of a complementary nucleotide by a DNA polymerase distinguishes more accurately between the two alleles than the different thermal stability of hybridizing allele-specific probes.25 In the late 90s and the beginning of this decade, it was thought that the combination of primer extension with MALDI-ToF-MS could develop into a gold standard for high-throughput genotyping. Following this assumption these MS-based methods for SNP genotyping were continuously improved, and MALDI is now one of the most automated and efficient detection platforms for SNP genotyping delivering results of the highest accuracy and reliability.26 It fulfils the criteria of a rapid, precise and cost-effective high-throughput method that is required to perform the large number of genotypes/measurements that are necessary for the discovery of genetic markers involved in the etiology and pathophysiology of complex disorders.
Several versions of primer extension assays using MALDI-MS as the detection platform have been developed and commercialized. In all these assays, the DNA target sequence is amplified by a polymerase chain reaction (PCR) (Fig. 3). Remaining deoxynucleotide triphosphates (dNTPs) and primers are removed from the reaction by digestion with shrimp alkaline phosphatase or similar methods, because these compounds could interfere in the ensuing primer extension reaction. A primer for primer extension anneals with its 3′ terminal base immediately upstream of the SNP on the target sequence. Discrimination of the alleles is achieved by essentially three different strategies: single base primer extension (SBE), multiple base primer extension and nucleotide depletion (Fig. 3). After the allele-specific reaction, samples need to be conditioned prior to the analysis step and sample preparation is of crucial importance to obtain satisfactory results. In SBE assays each allele product of an SNP is detected at a specific mass because of the addition of terminating dideoxynucleotides (ddNTPs), which are complementary to the alleles of the SNP on the template. As the four bases differ in mass, the extension products for different alleles are separated by the mass of the terminating bases. Examples for this approach are the PinPoint assay,27 the GOOD assay28, 29 and the genoSNIP assay.30 Analysis of some SNPs is difficult by SBE assays, as the smallest mass difference between two nucleotides (A and T) is only 9 Da, which is demanding to resolve in the typically used mass range of 4000 to 9000 Da. The separation of alleles can be enhanced by the use tags or modified nucleotides that increase the mass difference between signals of the two alleles.31, 32 Incorporation of chemically modified12 or photochemically cleavable30, 33 building blocks into the extension primer enables the reduction of the extension products to a core-sequence of four to five nucleotides occupying a mass range of 1200–2000 Da where the resolution and sensitivity of the mass spectrometer are best. By means of a chemical modification strategy (charge tagging) employed in the GOOD assay, signal intensity is increased to that of peptides.12
Multiple base extension enhances the clear assignment of the product peaks in the mass spectrum by generating products with a mass difference of at least 300 Da. A mixture of dNTPs and ddNTPs are used, and the termination occurs at the first nucleobase in the template complementary to a ddNTP. An alternative approach developed by Decode Genetics uses nucleotide depletion, i.e. the nucleotide complementary to one allele is absent in the primer extension mix.34 Multiplex base extension was proposed and spearheaded by Sequenom during the last ten years and is probably the most widely applied SNP genotyping assay with MALDI-MS detection. Sequenom offers a turnkey solution with chemistries, sample preparation robotics, a dedicated MALDI mass spectrometer and supporting software (MassARRAY system). This facilitates the genotyping procedure and makes it accessible to non-mass spectrometrists. The original protocol put forward in the PROBE assay35 was continuously improved to the homogeneous MassEXTEND (hME) assay running on a highly automated, high-throughput platform, available for different dimensions of throughput.11, 36 Nanoliter quantities of the samples are transferred onto silicon chips precharged with 3-hydroxypicolinic acid (HPA) as matrix by piezoelectric pipetting.37 As the entire preparation is volatilized with a few laser pulses, the need for ‘sweet’ spots of the HPA matrix is avoided and the spot-to-spot reproducibility increases. Multiplex primer extension reactions of up to 12 SNPs have been reported and a four-plex format can be used routinely.38 Sequenom has recently introduced the highly multiplexed genotyping assay termed ‘iPLEx’ that routinely analyzes 24 SNPs, and multiplexes up to 29 are feasible.39 The assay is based on single base extension using acyclic mass-modified terminating nucleotides, which create mass differences large enough to differentiate between all four bases. Compared to the hME assay, greater multiplexing capacities can be achieved by the utilization of a universal termination mixture. Sophisticated design software for multiplex PCR and primer extension ensures that no signals are within 15 Da of each other by also adding nucleotides noncomplementary to the template at the 5′ terminus of the extension primer. An added advantage is the reduction of the heterozygous skew (in a heterozygous individual the peaks corresponding to the two alleles do not display the same intensity/peak area) that is encountered in some hME reactions. This permits more stringent calling algorithms for genotyping.
A SELECTION OF RESULTS OF SNP GENOTYPING BY MALDI
Sequenom has carried out ten whole-genome scans using 28 000 to 91 000 SNPs per case–control association study for the discovery of the susceptibility genes of a variety of common diseases including diabetes, high-density lipoprotein, schizophrenia, melanoma and breast and lung cancer.40 In the scans, some of the known susceptibility genes for a condition were confirmed and new genes were discovered and subsequently biologically validated. Some of the published results include the nuclear mitotic apparatus protein (NuMA) gene region as a susceptibility locus for breast cancer, and polymorphisms in the intercellular adhesion molecules genes (ICAM) associated with breast and prostate cancer risk.41, 42 Another example is the very recently published association between a polymorphism in the first intron of the leucine-rich repeats and calponin homology containing 1 (LRCH1) gene and knee osteoarthritis.43 In a recent review most of the genetic association studies, either gene-based or genome-wide using MALDI-MS for SNP genotyping, diagnostic tests for various diseases and conditions that have been carried out by MALDI and the use of MS for bacterial and viral identification are described in detail.44 Although these published studies demonstrate clearly the capabilities of MALDI for the use of genomewide scans of SNPs in association studies, other commercially available microarray-based genotyping products (Affymetrix, Illumina), which enable analysis up to 500 000 SNPs in a single experiment, are becoming increasingly more popular for this type of work. SNP genotyping by MALDI-MS will probably find its place for fine typing in gene-based approaches and/or confirmation and replication of associations found in genomewide scans carried out on other technology platforms. It is ideally suited for studies of a limited number of polymorphisms in large sample cohorts. With the newly developed iPLEx system, multiplexes of up to 29 SNPs become an alternative to methods like SNPlex (Applied Biosystems, which analyzes up to 48 SNPs in one experiment analyzed on a capillary sequencer),45 where the features of MALDI-MS as a rapid, precise and cost-effective high-throughput method show some advantage.
MORE DEMANDING APPLICATIONS
Recently mass spectrometers have lost their position for very high-throughput SNP genotyping mainly because of cost, effort and failure rates in contrast to the competition. However, advances in the application of MALDI-MS for more demanding DNA analysis, such as molecular haplotyping, human leukocyte antigen (HLA) microhaplotyping, gene expression profiling, DNA methylation analysis and resequencing/mutation detection, demonstrate the potential of MALDI-MS as a versatile analyzer for nucleic acids besides simple SNP genotyping. In a recent review, the versatility of MALDI-MS was nicely demonstrated using the platform for the discovery of SNPs in the coding sequence of the TP73 gene, individual genotyping and allele frequency determination of the discovered SNPs in pooled DNA samples, quantitative measurement of allelic gene expression levels and correlating imbalanced expression to DNA methylation patterns.46
MALDI-MS has been demonstrated for the quantification of proteins and peptides.47 Quantification of oligonucleotides is more demanding owing to the heterogeneity of matrix crystallization, different ionization rates of the analytes and signal-to-signal interactions leading to low reproducibility. However, the chemical similarity of DNA components does provide some compensation. Several approaches have been presented using MALDI-ToF for the quantitative analysis of SNPs in pools of DNA, as pooling of DNA has been proposed as a means to reduce the number of genotypes necessary for detecting the subtle effects of genetic variations in multifactorial diseases for large-scale association studies.48 In addition, pooling strategies are applied to SNP verification and allele frequency determination. To circumvent the problems of low baseline resolution and confounding of the peak area measurement by salt adduct peaks, the two SNP alleles are separated by 300–400 Da by a single and double base extension. Several factors have to be taken into account that may cause bias and complicate accurate quantification. Alleles can be differentially amplified during PCR, and terminating ddNTPs may have different incorporation rates in the primer extension reaction. The mass spectrometer may have an instrumental and detection bias toward smaller molecules, as detection sensitivity decays with increasing mass and smaller oligonucleotides are favored during ionization, while larger DNA molecules have more opportunity to form salt adducts.49, 50 Differential signal response is corrected by measuring a known heterozygous individual. Highly linear calibration curves with correlation factors > 0.98 are obtained when pooling two PCR products in different ratios.51 The limit of detection is ∼2% for the rare allele and the limit of quantitation between 5 and 10%. The chip-based DNA MassARRAY™ system (Sequenom) has been used to characterize allele-specific primer extension products of ∼9000 gene-based SNPs in DNA pools of 94 individuals.26 By irradiation of the complete matrix/analyte spots, high reproducibility of ± 1.6% for alleles with a frequency of at least 10% can be achieved. Accuracy of allele frequency measurement is independent of the pool size50 and does not differ when assays are performed in either simplex or multiplex format (4-plex).11 Deviation of the measured allele frequencies in pooled DNA compared to genotypes obtained in individual DNAs at several loci is usually better than 4%.26, 52 In several comparative studies, results from MALDI-MS performed as well as or better than other techniques.53, 54 Combining the accuracy of primer extension assays with the flexible, high-throughput platform of analysis of pooled DNA samples is a cost- and time-effective method, if abundant variation and/or common alleles are of interest. However, rare alleles may be missed and individual information like haplotypes is lost. Statistical power is somewhat reduced as homozygous and heterozygous genotypes can no longer be distinguished. Quantification by MALDI is now in routine use for various applications. The first step of the whole-genome scans carried out by Sequenom for the various conditions consists of comparison of allele frequencies in pools of cases and controls, only SNPs that show a significant difference between the two pools being selected for individual genotyping.40 Another recent example is the detection of Down's syndrome (trisomy of chromosome 21) by quantitative genotyping of the amniotic fluid.55
RESEQUENCING AND MUTATION DETECTION
In the beginning, MALDI-MS was thought of as an alternative to conventional Sanger sequencing, which uses fluorescently labeled, chain-terminating ddNTPs.56 MS enables separations of the products in milliseconds and has much higher resolution compared to conventional sequencing techniques.57, 58 MALDI analysis of cycled Sanger sequencing using separate reactions for each of the terminating ddNTPs revealed the limitations of the approach.59–61 Owing to the problems of adduct formation, fragmentation and low sensitivity in the high mass range, only short sequences of 50 to 100 bases could be routinely analyzed, while gel-based analysis techniques today are well established with read-lengths of up to 1000 nucleotides. Randomly occurring fragments prematurely terminated by dNTP complicate correct assignment of peaks in the mass spectra. Modified nucleobases to reduce ion fragmentation,62 solid phase purification by hybridization63 or biotinylated terminating ddNTPs64 improved the sequencing readout to some extent, but did not resolve the general problem. This clearly limits the use of MS for sequencing in routine applications. Limited exonuclease digestion of the template from either the 3′ and/or 5′ terminus was proposed as an alternative approach but read-length is limited to even fewer nucleotides.65 A method to circumvent some of the problems associated with the analysis of DNA in a MALDI mass spectrometer is to transcribe the DNA into RNA with the help of an RNA polymerase. RNA is more easily analyzed in a MALDI mass spectrometer, as the stabilizing effect of the 2′OH group on the sugar ring reduces N-glycosylic bond fragmentation.66 It should be noted that a drawback of RNA polymerases is their lack of 3′ → 5′ exonuclease activity, which leads to an increased rate of false nucleotide incorporation. The RNA can be sequenced either in a Sanger-type extension termination strategy67 or by the creation of base-specific fragments. Rather than generating a sequence ladder and determining the presence or absence of a specific fragment of the ladder, these latter protocols rely on endonucleolytic cleavage using different RNAses, e.g. Rnase T1 (G specific), U2 (A specific), PhyM (A and U specific) and A (C and U specific).68 Several similar protocols were developed differing in the use of the employed polymerase or/and endonuclease.69–71 The base-specific fragments are generated by carrying out PCR with a primer containing at its 5′ end an RNA polymerase-specific transcription initiation site. This enzyme transcribes the DNA template into an RNA or a mixed DNA/RNA oligonucleotide, which permits the subsequent base-specific cleavage using a variety of RNAses. In Sequenom's MassCLEAVE protocol, a modified T7 polymerase is used that does not discriminate between ribonucleotides (RNA) and deoxyribonucleotides (DNA). One dNTP is replaced by the respective ribonucleotide. RNAse A is used for specific C or U cleavage. Additional information is obtained by reverse transcription of the complementary strand and subsequent cleavage with RNAse A, which corresponds to a G and A specific cleavage reaction on the forward template strand as G base-pairs with C and A with U.71 Changes in the expected peak patterns identify unknown sequence changes and allow their immediate assignment without the need for sequencing. The use of universal forward and reverse transcription primers makes this procedure suitable for high-throughput applications. A multiplexed version of the protocol has been demonstrated in a proof-of-principle experiment, which allows the simultaneous amplification and analysis of several target regions.72 However, this protocol might be impractical in routine use, as clear assignment of the peaks can be hampered by the increased number of fragments. Similar RNA cleavage protocols have also been applied to the analysis of DNA methylation patterns after sodium bisulphite treatment.73, 74
In a method under development in our laboratory (Mauger et al., manuscript in preparation), the PCR is carried out with a DNA polymerase that accepts NTPs as substrate for incorporation. One of the dNTPs is replaced by the corresponding NTP, and fragments are generated by a simple and inexpensive alkali cleavage. After a desalting step, the fragments are analyzed in the mass spectrometer and the obtained fragment pattern compared to the fingerprint from a reference sequence.
RNA cleavage–based MALDI-MS methods have been implemented for routine use in recent years. Although they will probably not displace conventional DNA sequencing, MS using these approaches provides an interesting and reliable option for DNA stretches that are difficult to sequence by gel-based techniques (such as GC rich regions), for gap closures and resequencing to detect polymorphisms and identification of genes at the cDNA level.
GENE EXPRESSION STUDIES
After the completion of the human genome sequence and with the large number of validated polymorphisms in public databases, attention increasingly has turned to the elucidation of gene function and regulation, as inter-individual variation in gene expression can lead to phenotypic differences and thereby potentially confer differential disease susceptibility. Recently, the principles of MALDI-MS combined with primer extension were also extended to gene expression analysis and transcriptional profiling making use of the excellent quantitative detection capabilities of the instrument. A high-throughput method of measuring both absolute and relative mRNA levels of specific alleles was recently presented by combining competitive PCR and MALDI.75 As for most gene expression profiling methods, the RNA was reverse-transcribed to yield a cDNA product. A DNA standard was designed containing an artificial single nucleotide mismatch to the cDNA of interest and this standard was added to the reverse transcription product prior to competitive amplification, i.e. the co-amplification of the internal standard with the target DNA. Subsequent to this PCR of a short amplicon (<100 bp), a single base extension was carried out onto this artificially created SNP and the ratio of the standard compared to the gene expression product was quantified by MALDI-ToF-MS by measuring the peak areas of the two signals. The same principle was also applied to the detection of allelic imbalance of gene expression that can be caused by various genetic and epigenetic phenomena such as imprinting, differential activity of promoters and mRNA stability. For these analyses, a transcribed polymorphism/mutation is used for analyzing relative expression levels.76 This method was recently applied to demonstrate preferential expression of one haplotypes in the interleukin-8 gene, which correlated with the severity of bronchiolitis.77 A similar approach (haploChIP) that quantitatively assesses allele-specific transcripts does not depend on a transcribed polymorphism, but rather on a polymorphism in the vicinity (1kb) of an active transcription start site, which is characterized by the binding of the phosphorylated RNA polymerase II containing complex.78 Cross-linked DNA fragments containing active transcription start sites are immunoprecipitated using antibodies against the phosphorylated polymerase complex. After reversion of the cross-link, a target sequence containing a heterozygous SNP is amplified and the alleles of the SNP quantified by primer extension. The phosphorylated polymerase load is thereby used as a marker for transcriptional activity of an allele. The same approach was very recently used to identify a haplotype that reduces the expression of the KIAA0319 gene on chromosome 6p22.2, which leads to impairment in neuronal migration and contributes to reading disability.79
Absolute quantification of transcripts by MS is also feasible by including a third—competitor—oligonucleotide with an artificial variant close to the transcribed variation thereby generating three extension products with a multiple base primer extension.76 The concentration of this competitor oligonucleotide is titrated to establish a standard curve, with which absolute concentration of the analytes can be determined. This method showed equal performance when compared to quantitative real-time PCR with the added advantage of multiplexing to determine gene expression signatures of several genes simultaneously.80 Instead of differentiating between gene expression levels of two alleles, the same approach can be used to distinguish alternatively spliced transcripts (of approximately the same length). In this case, the primer in the allele discriminating reaction is extended onto a potential alternative splice site, and mass spectrometric signals corresponding to the relative levels of the extension products determine the relative amounts of the different transcripts.81
DNA METHYLATION ANALYSIS
All cells of a multicellular organism carry the same genetic material coded in their DNA sequence, but cells obviously display a broad morphological and functional diversity. This heterogeneity is caused by the differential expression of genes. Epigenetics can be defined as the study of heritable changes in gene expression without alteration of the DNA sequence itself, i.e. epigenetic variants are stable alterations that are heritable during somatic cell divisions (and sometimes germ line transmissions) but do not involve mutations of the DNA itself.82 Epigenetic phenomena are mediated by at least three molecular mechanisms: histone modifications, polycomb/trithorax protein complexes and DNA methylation.
The human genome actually consists of five bases rather than four, as a methyl group can be enzymatically and covalently attached to the 5 position of cytosines in the context of the CpG dinucleotide to yield 5-methylcytosine (5MeC), a nucleotide with distinct biological properties, while C and 5MeC behave identically in terms of base pairing. The methylation at these CpG nucleotides is of crucial importance for proper embryogenesis and development and83 for imprinting (i.e. the asymmetric expression of either the maternal or paternal allele in a parent-of-origin-specific manner in somatic cells of the offspring)84 and is implicated in gene regulation.85 Aberrant methylation patterns were found in various neurodevelopmental disorders and imprinting anomalies. Its implication in complex, non-Mendelian disorders like type 2 diabetes has been postulated.86 A cumulative series of genetic and epigenetic alterations leading to unregulated cell growth is the foundation of tumorigenesis. Epigenetic changes occur early in the progression process and often precede malignancy. A global hypomethylation of the genome is accompanied by a region- and gene-specific hypermethylation of CpG islands, which can lead to inappropriate inactivation of tumor-suppressor genes.87 Methylation patterns can be shared by different types of tumors, and are tumor type-specific, and the extent of hypo- and hypermethylation often correlates with the grade of malignancy and/or the disease stage.88 Concurrently with the increasing interest in the analysis of DNA methylation patterns, a broad variety of methods have been devised each with its advantages and inconveniences.89, 90 The choice of the method mainly depends on the analytical question to be answered. Irrespective of whether the aim is the analysis of whole-genome or gene-specific methylation patterns, most methods can be classified into two approaches using either methylation-sensitive restriction endonucleases or the chemical treatment of genomic DNA with sodium bisulphite. This latter chemical reaction induces hydrolytic deamination of nonmethylated cytosines to uracils, while methylated cytosines are resistant to conversion under the chosen reaction conditions.91 This method permits one to chemically ‘freeze’ the methylation status that would otherwise be lost during PCR amplification and converts the methylation signal into a sequence difference. After a PCR, the methylation status at a given position is manifested in the ratio C (former methylated cytosine) to T (former nonmethylated cytosine) and can be analyzed as a virtual C/T polymorphism spanning the entire allele frequency spectrum from 0–100% in the bisulphite-treated DNA. A quantitative analysis method is therefore required, and MALDI has repeatedly proven its ability for accurate quantification in, for example, allele frequency determination in pooled DNA samples as described above. Until recently, MS was mainly used for the analysis of global DNA methylation patterns by quantitatively hydrolyzing DNA samples and subsequent analysis by (liquid chromatography) LC ESI-MS.92, 93 We have developed a method for the accurate quantification of DNA methylation levels at individual CpG positions based on the GOOD assay following bisulphite treatment of genomic DNA.94 The establishment of good calibration curves is quintessential to compensate for various parameters that might confound accurate and absolute quantification of CpG methylation levels such as preferential amplification of a certain methylation pattern during PCR—a phenomenon often reported for bisulphite-treated DNA95—or the sequence-specific annealing behavior of the extension primers. The second effect is due to the close proximity of CpG nucleotides in CpG islands that often makes it unavoidable to have additional methylation positions underlying the primer annealing sites. Extension primers containing between zero and three degenerate bases show increased complexity of the annealing behavior correlating with increasing number of degenerated bases within the extension primer. The calibration effort is easily outweighed by the throughput and accuracy of the resulting assay. However, extreme PCR biases resulting in curves with a low slope do complicate accurate quantification of the affected methylation states. An advantage of the GOOD assay is that degenerate bases within the extension primers do not add to spectral complexity in contrast to other primer extension methods with or without mass spectrometric detection, as during the phosphodiesterase digestion the extension primers are reduced to a core-sequence of four to five bases and the bases on the degenerate positions are removed. Accordingly, the mass spectrometric signatures remain simple and allow multiplexing, which was demonstrated in a pilot study for the Human Epigenome Project where this assay in a multiplexed form was used as a reference method for verification and quantitative fine typing of the results obtained by direct bisulphite sequencing.96 Analyses of individual CpG positions require prior knowledge of the CpG positions of interest, as, for example, CpG islands can contain hundreds of CpGs and therefore potential positions of interest. Recently developed approaches based on RNA transcription and subsequent cleavage by RNAses, which have been described in detail in the ‘Resequencing and mutation detection’ section, offer the high-throughput scanning tool to identify positions that might be differentially methylated between two samples, e.g. a cancerous and a normal tissue.73, 74 Owing to its multiplexing capabilities, the quantitative readout and the simple and reliable procedure, these MALDI-MS-based assays are fine tools for the identification of methylation variable positions in a gene-targeted approach.
Haplotypes are specific combinations of genetic variants located on one allele, i.e. the series of polymorphisms in physical vicinity on one chromosomal molecule. It is assumed that the detection of disease susceptibility genes via fine mapping association studies is facilitated by the consideration of marker haplotypes, as these carry more information about the genotype–phenotype relationship owing to their multiallelic nature and higher level of heterozygosity compared to individual, biallelic SNPs.97 Haplotype structure, rather than individual SNPs, is the determinant of phenotypic consequences. Theoretically, for a region of interest containing n SNPs, 2n haplotypes are possible, although usually only a limited number of them are found. Knowledge of the haplotype structure of a region of interest allows decreasing the number of markers that need to be genotyped in association studies. Only those are genotyped that distinguish the different haplotypes from each other (htSNPs).98 This is the rationale behind the HapMap project (http://www.hapmap.org), a recently finished international effort to categorize the most frequent human SNPs, their allele frequency and linkage disequilibrium (measure to infer the allele of one untyped SNP from the alleles of a typed polymorphism).18 And these are also the SNPs included in most whole-genome SNP analysis tools. However, the usefulness of these approaches for the dissection of the complex genetic traits underlying multifactorial diseases is still under debate.99 Several methods for the physical determination of the phase of SNPs have been devised, most based on either allele-specific amplification of a region of interest or the physical separation of entire chromosomes. However, most techniques for molecular haplotyping are tedious, time consuming or prone to errors so that haplotypes are mainly inferred from genotype data using mathematical algorithms hazarding the consequences of their inherent statistical uncertainty. Three approaches for molecular haplotyping using MALDI-MS were demonstrated in proof-of-principle studies.
In the first, fragments of up to 4 kb in length are amplified by allele-specific PCR using heterozygous boundary positions for anchoring the allele-specific primers. The alleles of the SNPs contained in these haploid fragments are subsequently determined by a multiplexed primer extension reaction and MS exploiting the multiplexing potential and resolving power of MALDI-MS.100 For the second approach, phase information from polymorphisms separated by larger distances were obtained by the construction of cosmid/fosmid libraries and the subsequent genotyping of the polymorphic positions using the GOOD assay in the negative ion mode.101 Pools of 96 clones were constructed representing ∼10% of the genome, and pools were genotyped for the region of interest. If positive, the individual clones of the pool were genotyped. The third approach uses dilution to the statistical level of a single DNA molecule on which the SNP alleles are then analyzed by the homogeneous MassEXTEND assay.102 Each of the presented methods has its advantages and inconveniences. The first method is restricted to the analysis of relatively small amplicons (such as haplotype block boundaries), and analysis of larger fragments would necessitate walking from one fragment to the next. The second method requires the construction of clone libraries and relies on the two informative positions being on the same clone to construct the haplotypes. And the third relies heavily on template integrity and is very sensitive to contamination. However, the possibility to multiplex SNP genotyping assays that are far apart compensates for some of this inconvenience.
The major histocompatibility complex (MHC) is the most gene-dense region of the human genome and many diseases with a genetic component have been associated with the MHC.103, 104 It is therefore one of best-studied regions of the genome and several common haplotypes of the entire MHC have been sequenced.105 The MHC is divided into three classes (I, II and III), which are all potentially implicated in pathogenesis. Disease gene identification within the MHC is difficult because of gene density, the high density of polymorphisms in some of the genes and limited understanding of the functionality of the genes. For tissue matching between a donor and recipient, primarily genes in the MHC class II region are relevant followed by genes in class I. HLA-typing is conventionally carried out by either serological methods using antibodies or by PCR-based methods such as Sequence Specific Oligonucleotide Probe Hybridisation (SSOP),106 Reference Strand Conformation Analysis (RSCA)107 or Sequence Based Typing (SBT).108, 109 While the first is hampered by the potentially high degree of cross reactivity and limited resolution capabilities, the second suffers from difficulties associated with the efficiency of the PCR due to very limited possibilities for positioning primers because of polymorphic positions. We have developed an effective method to screen bone marrow donor registries by typing a set of SNPs in groups of microhaplotypes (phased SNPs within four to five bp segments) that could be used to distinguish ‘frequent’ and ‘rare’ HLA alleles in HLA-A, HLA-B, and HLA-DRB1. This screening permits the identification of ‘rare’ HLA-types for which no known bone marrow donor might exist. It should be noted that for the HLA-B, a class I gene, over 600 alleles are known, while HLA-DRB1, a class II gene, has more than nearly 400 known alleles. We used a modified protocol of the GOOD assay28, 29 realizing that a mismatch under the first eight nucleotides from the 3′ end of the primer prevents extension of the primer by the DNA polymerase. This characteristic allows using a pool of primers for the primer extension reaction where each primer is complementary to a different HLA allele. In the GOOD assay, only the last four nucleotides of the primer plus the single extended base are carried to the mass spectrometric analysis. Consequently, the maximum number of potential mass peaks was usually less than 8, with each mass representing a unique microhaplotype. Heterozygous samples result in two out of the ensemble of possible masses, thus both parental microhaplotype alleles are resolved in a single reaction. By selecting 19 positions for HLA-A, 19 positions for HLA-B and 10 positions for HLA-DRB1, resolution of frequent and rare HLA alleles and in some instances four digit resolution could be achieved, which is substantially higher than what can be achieved with serological methods (Kucharzak et al., manuscript in preparation). This demonstrates one of the great strengths of MS for detection. It has no problem distinguishing two out of eight possible species unambiguously. This procedure is not possible with fluorescent detection methods.
WHAT IS THE CURRENT POSITION OF MASS SPECTROMETRY IN GENETICS?
The human genome sequencing project was initiated in the late 1980s at a time when the first automated fluorescence DNA sequencers were introduced to the market. These instruments were capable of sequencing 300 bases of contiguous sequence in 32 samples in a 12-h run. By the early 1990s, several large sequencing centers equipped with hundreds of these automated sequencers were being established. This coincided with the launch of the first commercial MALDI mass spectrometers. Initially, many people believed that MALDI mass spectrometers would be a good choice of analyzer for sequencing in the human genome sequencing project.58 However, owing to the technical problems outlined above, it was clear by the mid-1990s that the human genome sequencing project would quite likely be completed with the initial technology, which in its own right had gone through a dramatic evolution. So the focus for the DNA applications of MS shifted to the problem of analyzing DNA variation, such as mutations and SNPs. By the late 1990s, mainly through the work of Sequenom (San Diego), MALDI-MS had positioned itself as one of the prime candidate technologies for SNP genotyping. Many of the initial association studies using SNPs were carried out on this technology.26 The current generation of array-based DNA analysis tools allows the analysis of 500 000 SNPs spread throughout the human genome. As a result of the HapMap project, these markers were selected to best capture the variability of the human genome while taking into account regions of the genome that have undergone different degrees of recombination and capture haplotypes that are frequent in the general population. There is an inherent danger when only a selection of high-frequency SNPs is used to look for regions of the genome that show association with a disease. That is, if the disease susceptibility is made up of many different rare mutations, trawling with this large-mesh net will not capture the susceptibility variants. The alternative would be to sequence each individual in a study. Even though this would be possible technically, it is prohibitively expensive because DNA sequencing strategies have effectively not changed since the 1980s. There is a huge reward out for the technology that can provide the entire sequence of a human genome at a cost of US$ 1000.110 However, apart from approaches like massively parallel signature sequencing (MPSS),111 polony sequencing112 and the recently presented GS20 sequencer now marketed by Roche,113 DNA sequencing still largely relies on approaches developed 20 years ago. Most approaches for DNA sequencing that are applied nowadays use optical detection either of fluorescence or chemiluminescence. However, looking at the rate at which the hardware and software of mass spectrometers has evolved in recent years in terms of speed, accuracy, automation and integration, this trend is not entirely understandable. With the current generation of mass spectrometers hundreds of thousands of spectra can be recorded in a day. Compared to fluorescence scanners that are used, for example, for SNP genotyping on arrays, far more analytical accuracy and resolution can be achieved with a mass spectrometer than by quantifying less than a handful of different fluorescent emissions. The number of discreet detection channels of a mass spectrometer is huge and at least 2 to 3 orders of magnitude greater than for a fluorescent detection system. The problem with using all the analytical might of mass spectrometers lies in the availability of appropriate preparation procedures before samples are introduced into the mass spectrometer. It is crucial that sample preparation is not too cumbersome.
Clearly the approaches taken for SNP genotyping by MS do not hold a lot of promise to be turned into viable options for DNA sequencing of the type geneticists would like to apply. To make a mass spectrometer into a useful DNA sequence analyzer will require a departure from the trodden path of approaches taken for molecular biology and sample preparation for DNA sequencing by MS until now. As demonstrated in the Roche GS20, there are viable options for effective samples preparation.
DNA analysis by MS has three major advantages over that of proteins: first, with its four bases it is quite bland and any sequence can be analyzed with virtually the same settings of the mass spectrometer; second, each nucleated cell carries two copies of genomic DNA and as a consequence there is no problem with dynamic range; and third, amplification procedures exist that allow the replication of the same DNA sequence making millions of copies even when starting with only a few molecules.
Combining sample preparation procedures with arrays and transferring the array into the mass spectrometer for analysis seems one way ahead to solve the problem of sample preparation. The sensitivity of the mass spectrometer does not require the preparation of huge amounts of sample. In the past, less than 1% of the preparation was consumed for DNA analysis by MS. Thus, amounts of products prepared on an array would be sufficient. One further huge advantage is that it does not require the use of expensive fluorescent dyes. This in turn helps in terms of reagent costs.
WHAT ARE THE STRENGTHS OF MASS SPECTROMETERS IN DNA ANALYSIS?
Clearly mass spectrometers have improved by orders of magnitude in terms of speed and accuracy in the past ten years. One feature that is quite striking about the use of a mass spectrometer is its use as a DNA diagnostic tool. A result essentially is the presence or absence of an allele of a polymorphism at a specific mass in an individual spectrum. This means that by detecting only a peak at mass A the tested sample is homozygote A, or a peak at mass B a homozygote B or peaks both at mass A and mass B a heterozygote. Consequently an individual spectrum, for example, of an SNP genotyping assay can be interpreted in the absence of other results. Most other SNP genotyping methods rely on clustering for data interpretation. In practice, several samples have to be compared to assign a value to an unknown. In contrast, MS can provide a stand-alone result. The same feature holds for the interpretation of more complex results such as for the microhaplotyping described above or the MS-based DNA sequencing protocols. Even fairly complex mass spectra of sequence-specifically fragmented DNA can be accurately called.
POTENTIAL OF MASS SPECTROMETER TO SEQUENCE A GENOME
The human genome is composed of 3 × 109 bases. If one assumes that 10% make up the genes and some regulatory sequence in their proximity, this leaves 3 × 108 base pairs of interesting sequence. If one further assumes that 1000 bases can be sequenced per reaction, sequencing the relevant part of a human genome would require 3 × 105 reads. Recording five spectra per second would mean that single coverage of the relevant part of a human genome could be read in less than a day. Three key elements are currently still missing to carry out this experiment:
First, MALDI mass spectrometers capable of delivering this speed of accumulation: However, instruments are approaching this value in applications like MALDI imaging.
Second, molecular biology and chemical procedures to prepare suitable samples: Procedures that could be carried out on arrays seem ideal, as many different entities can be processed in parallel. This takes the onus off the automation for sample preparation. Further, using arrays with small spacing of features would reduce the travel time from one sample to the next, which aids the speed of accumulation.
And third, informatics to deal with data interpretation, calling, assembling and warehousing: Clearly this is a difficult problem to solve. However, a lot of effort is being made in this direction for other MS applications.
DNA AND IMAGING BY MASS SPECTROMETRY
Imaging with a MALDI mass spectrometer is an interesting development of recent years.114 Major challenges remain with respect to applying this technology more broadly: one is to improve the spatial resolution, a second is to increase the dynamic range of what can be detected and a third is data handling. It should be possible to improve spatial resolution by improvements in laser adjustments by the movement of the target and by improved methods for the delivery of matrix. Data handling will quite likely improve with next generations of software. This leaves the increase of dynamic range. Proteomics has been grappling with this problem since the beginning. Procedures for the depletion of albumin and other abundant proteins have been developed in solution-based proteomics in order to dig deeper. However, it is still unclear what is lost for the analysis in terms of albumin-associated proteins. Optimistically, probably 4 orders of protein expression dynamics can be captured by MS. In terms of imaging by MS, it is somewhat limiting that only the top layer of dynamic range of protein expression can be captured. Immunohistochemistry makes extensive use of enzyme-induced fluorescence detection.115 In the recent past, fluorescently tagged antibodies have been applied. Optical imaging of histology cuts makes extensive use of antibodies that are specific for target proteins.116 For many years, the groups of Landegren, Nilsson and Koch have been collaborating on the translation of in situ RNA and protein signals into DNA and procedures to carry out in situ DNA amplification.117–120 These methods have matured to a point where they can be used for in situ signal amplification on histological sections to visualize very low abundance markers. The procedures rely on the translation of a specific signal into a generic signal. Specific signals can be alleles of polymorphisms, RNA transcripts or proteins. By ligating probes to the specific signals, circularized oligonucleotides can be produced. Once the specific signal has been translated into DNA, amplification procedures such as PCR or rolling circle replication can be applied to boost signal. In the incarnations, these groups have shown so far that the visualization of the amplified signal is done by fluorescence. One of the easiest probes to detect is a molecular beacon, which contains a stem–loop structure with a fluorescing molecule attached to one end and a quencher to the other. If the probe finds a probe complementary sequence, it hybridizes, which results in the separation of the quencher from the fluorophore. Even though detection of fluorescence is fairly simple and many different versions of microscopes can be used, the number of well-separated detection channels for fluorescence tends to be limited. Fluorescent dyes also tend to get degraded by atmospheric compounds such as ozone. If one were to use tags detectable by MS in the imaging mode, far more species could be interrogated simultaneously. This potentially could solve several problems of imaging: one is that it would allow targeting selected proteins; a second that it would allow increasing the signal of the target protein by the in situ amplification procedure; and a third that owing to the many detection channels of the mass spectrometer, far more proteins/images could be interrogated in each experiment than with optical detection methods.
The analysis of DNA by MS has evolved to where it can be used to analyze every known type of DNA and RNA situation. It can efficiently deal with the analysis of polymorphisms, sequences, haplotypes, HLA-typing, DNA methylation and RNA expression. The latter applications require good quantification over a large dynamic range, which the mass spectrometer with suitable protocols as described here can provide. Unfortunately, because of the evolution of competing technologies for DNA analysis, MS has lost some of its popularity. We believe that this is not justified and that DNA analysis by MS is still some way away from having realized its full potential. Applications such as very cost-effective DNA resequencing or highly multiplexed, targeted protein imaging are two examples where this technology could have massive impact.