Epidemiology is very successful in identifying environmental and lifestyle factors that increase or reduce risk of specific cancers, leading to cancer prevention strategies. However, the etiology of many types of cancer is still poorly understood, despite extensive use of questionnaires and interview-based approaches in conventional epidemiologic studies. The integration of molecular techniques into epidemiology studies may provide new insights and has been referred to as molecular epidemiology. For instance, our ability to make connections between lifestyle and cancer risk is limited by difficulty in accurately measuring exposure to many carcinogens—newer molecular markers of exposure may provide better information. The completion of the Human Genome Project gives us knowledge of the genetic variations that presumably underlie the fact that a family history of cancer is a risk factor for most cancer types. Some of this excess risk has been explained over the last decade by identification of mutations in genes that give rise to a very high familial risk. Molecular epidemiologists are searching for genes that may give rise to much smaller increases in individual risk, but account for much of the residual risk associated with family history. These genes may also interact with environment and lifestyle factors such that cancer risk is not equally elevated in all persons exposed to an environmental factor (but not genetically susceptible), or all gene carriers (but not exposed to the environmental factor). Molecular markers may help to differentiate tumors with the same histologic appearance into different etiologic subtypes. Finally, response to treatment may be determined by molecular subtypes of the tumor, or inherited variation in drug metabolism. Examples will be given of how use of molecular techniques is informative in epidemiological studies of cancer and is predicted to lead to improvements in cancer incidence, early detection, and mortality.
Molecular epidemiology refers to the use of molecular biology techniques in epidemiologic research. A PubMed search on the term “molecular epidemiology” in July 2004 found more than 3,500 citations; the vast majority of these describe the use of molecular techniques to identify subtypes of infectious organisms and thus make inferences about their transmission and pathogenesis. The term was first popularized in the context of infectious diseases, and in the early 1980s it was applied to chronic disease research. Schulte1 defined the term as “the incorporation of molecular, cellular, and other biologic measurements into epidemiologic research”; and inspection of the citations in the above search shows that most of the applications of molecular epidemiology to chronic disease are in cancer research.
A well-established paradigm for molecular epidemiology contrasts it with “traditional” epidemiology (Figure 1). In this comparison, “traditional” epidemiology is concerned with correlating exposures with cancer outcomes, and everything between the cause (exposure) and the outcome (a cancer) is treated as a “black box.” This formulation is somewhat unfair to “traditional” epidemiologists, who have usually been very interested in the biologic basis of their observed associations. Indeed, the Bradford-Hill criteria for assigning causality to an association featured biologic plausibility as an important criterion.2 However, it is by and large fair to characterize the interaction of epidemiology with biology as mostly passive—explanations were sought in the work of molecular biologists, toxicologists, and others for mechanisms by which a specific exposure might cause a specific cancer type. Applications to cancer were also hindered by our limited understanding of the molecular pathways and mechanisms underlying cancer causation, resulting in limited opportunities for rational integration of molecular analysis into cancer epidemiology studies. Furthermore, many “traditional” epidemiologic studies are performed using databases, mailed questionnaires, or telephone interviews, providing little opportunity for obtaining the biologic samples necessary for molecular analyses.
In molecular epidemiology, the epidemiologist is much more of a participant in the assessment of the biologic basis for an association, by using biologic measurements to assess exposure, internal dose, biologically effective dose, early biologic effect, altered structure/function, invasive cancer diagnosis, tumor metastasis and prognosis. In this way, the epidemiologist may help open up the “black box” by examining the events intermediate between exposure and disease occurrence or progression.
The continuum of events represented in Figure 1 may be measured by “traditional” means, eg, a questionnaire measurement of exposure, or a histologic assessment of invasive cancer. Increasingly, however, the tools of molecular biology are used to provide more accurate and specific measurements.
Biomarkers used for molecular epidemiology in cancer research can be categorized into different classes: markers of exposure (eg, presence of a cancer-causing virus),3 markers of dose (eg, amount of the cancer-causing virus),4 markers of internal dose (eg, DNA adducts formed after the metabolic activation of certain carcinogens such as aflatoxin),5,6,  markers of biologically effective dose (eg, somatic p53 mutations that may both indicate exposure to a specific carcinogen and be one of the genetic “hits” in the multihit model of carcinogenesis),7 markers of altered structure/function (eg, chromosomal aberrations),8,9,  markers of susceptibility (eg, metabolic polymorphisms in genes that are involved in carcinogen metabolism or detoxification),10 markers of cancer subtype (eg, estrogen and progesterone receptors in breast cancer), and markers of prognosis (eg, metabolic polymorphisms in drug-metabolism gene) (Table 1). Some of these examples are discussed in more detail below.
Table TABLE 1. Examples of Important Contributions of Molecular Epidemiology to Cancer Research
Class of Biomarkers
Quantify human papillomavirus viral load by using real-time polymerase chain reaction
Measuring exposures (environmental and lifestyle factors potentially related to cancer) can be very challenging, particularly if the relevant exposures occurred in the distant past. If an exposure is measured with substantial error, then the statistical relation between the exposure and cancer occurrence will be greatly attenuated. Since we are often interested in common, but relatively weak, causes of cancer (eg, exposures that may double or halve the risk of cancer, as opposed to a factor like cigarette smoking that increases the risk of lung cancer 20-fold or more), the ability to estimate exposures accurately can be the difference between establishing a causal link and an inconclusive study. Thus, for a strong risk factor like smoking, the association is attenuated, but still observable, if misclassification of the exposure exists. However, the association between disease and a weak risk factor may be so attenuated as to be statistically indistinguishable from chance results. The history of studies of human papillomavirus (HPV) and cervical cancer provides an example of how the availability of increasingly accurate molecular techniques has strengthened the relation between exposure and disease. In this example discussed below, the relevant measurement concerns the presence or absence of a causal infection; however, the exposure may have occurred in the distant past and cannot be measured, or accurately recalled, at the time of cancer diagnosis. A second example of aflatoxin and liver cancer illustrates the use of the “molecular fingerprint” from past exposure to infer its existence.
HPV and Cervical Cancer
More than 100 genotypes of HPV have been identified, although only a subset of these have carcinogenic potential.15 HPV selectively infects the epithelium of skin and mucous membranes. Specific HPV types are associated with squamous cell carcinoma, adenocarcinoma, and dysplasias of the cervix, penis, anus, vagina, and vulva.16 Using current technologies, HPV DNA can be detected in 95% to 100% of cervical cancer specimens,17 and has been called a “necessary cause” of cervical cancer.18,19, 
HPV Detection Techniques
Initially, nonamplifying techniques (eg, Southern blot, in situ hybridization, and dot-blot) used radio-labeled nucleic acid probes to detect HPV infection in cervical samples.20,21,  The disadvantages of these direct-probe approaches include low sensitivity and the need for relatively large amount of purified DNA.21 More sensitive amplification methods include Hybrid Capture (HC) and polymerase chain reaction (PCR). The Hybrid Capture II (Digene, Inc.) is a commercial HPV detection kit, used for detection of all high-risk oncogenic and most of low-risk nononcogenic HPV genotypes.21 The genotype-specific RNA probes bind to single-stranded HPV DNA and the hybrids are “captured” and detected using antibodies and chemiluminescent detection.21 The various PCR approaches for HPV detection can detect either a group of HPV types or a single type of HPV.4,20,  After PCR is performed by using consensus primers, specific HPV genotypes can be determined by restriction fragment length polymorphisms, linear probe assays, direct sequencing, or genotype-specific PCR primers.21
The strength of association between HPV and cervical cancer depends on the sensitivity of the HPV detection technique. Schiffman and Schatzkin22 compared two studies utilizing different methods for HPV detection in which the less sensitive Southern blot approach resulted in a lower estimate of the relative risk and attributable fraction than the more sensitive PCR approach. Similarly, Bosch, et al.23 compared three HPV DNA testing techniques (Southern hybridization, PCR, and Virapap) in relation to the odds ratios (OR) and attributable risk fractions for 926 female cases and controls with cervical scrapes in Spain. Virapap was a commercial dot blot screening test kit and could be used for seven HPV types. For cervical cancer outcome, ORs obtained by using Southern hybridization (OR, 16.3; 95% confidence interval (CI), 7.7–34.4) and PCR technique (OR, 24.3; 95% CI, 14.4–41.0) were about 2.5 and 4 times higher than OR obtained by utilizing Virapap (OR, 6.3; 95% CI, 3.4–11.6), respectively.23,24,  The risk of cervical cancer attributable to HPV infection was substantially higher for Southern hybridization and PCR techniques (attributable risk fraction, 33% and 67%, respectively) as compared with Virapap (attributable risk fraction, 23%). Subsequent studies using increasingly sensitive methods have estimated the proportion of cases of cervical cancer with evidence of HPV infection to be greater than 99%.18 These data show the importance of using the most sensitive molecular analysis in molecular epidemiology studies if accurate estimates of relative risk and attributable risk fractions are to be made.
Most women with evidence of HPV infection do not develop cervical cancer and most infections resolve within 1 to 2 years. Thus, the current focus of molecular epidemiology studies is to find the determinants of initial infection, subsequent persistence of oncogenic HPV infection, and progression of early cervical abnormalities to invasive cancer, using large prospective studies of women with repeated measurements of HPV, cervical cytology, and potential cofactors for infection, progression, and invasion.19 It has been suggested that the amount of virus present in cervical tissues, as estimated by quantitative PCR, may distinguish HPV carriers at low and high risk of development of cervical cancer.3
Interaction Between p53, Rb and HPV in Cervical Carcinomas
Further evidence for both the causality of the HPV-cervical cancer association, and the mechanism for the association, comes from studies of the biological interactions of HPV proteins with cellular proteins that are critical in controlling abnormal cellular proliferation. Integration of the genome of cancer-causing HPV subtypes into the host genome leads to expression of the viral oncogenic proteins E6 and E7. The E6 protein binds to and inactivates p53, and the E7 protein binds to and degrades the Rb protein. Inactivation and degradation of these two tumor-suppressor proteins directly contributes to genetic instability and loss of normal cell cycle controls,25 thus providing a mechanism for the compelling epidemiological association between specific HPV subtypes and cervical cancer. Although smoking has been associated with risk of cervical cancer in epidemiologic studies, there has been concern that this association was confounded by smokers having more or different sexual partners and thus being more likely to be exposed to HPV. In studies in which HPV testing was used to identify HPV, smoking increased the risk of subsequent detection of cervical intraepithelial neoplasia.26 By limiting the study population to women with HPV, these findings strengthen the evidence that smoking causes cervical cancer, rather than merely reflecting an association of smoking with the probability of acquiring HPV infection.
SOMATIC MUTATIONS AND CANCERS
Ultraviolet-induced p53 Mutation and Skin Cancer
Cyclobutane pyrimidine dimers and pyrimidine-pyrimidone(6–4) photoproducts are two major forms of DNA damage after ultraviolet-B radiation. Among cyclobutane pyrimidine dimers, thymine-cytosine and cytosine-cystosine dimers are the most mutagenic. Thymine-cytosine → thymine-thymine and cytosine-cystosine → thymine-thymine mutations are often found in the p53 sequence of ultraviolet-induced skin cancer cells27–31, , , ,  and have been observed in animal studies.32,33,  Also, pyrimidine-pyrimidone(6–4) photoproducts are formed in DNA after ultraviolet-B radiation and may lead to DNA mutations.34 Although conventional epidemiology has long established the link between sun exposure and skin cancer, the specificity of the relation between ultraviolet-B and the p53 mutation permits epidemiologic data to identify the specific wavelengths (290 to 320 nm) in sunlight that are carcinogenic.
Aflatoxin, Hepatitis Viruses, and Hepatocellular Carcinoma
Aflatoxin is a naturally occurring mycotoxin35 found in contaminated foods such as peanuts, some other nuts, corn, soy sauce, and fermented soybeans.36 Aflatoxin exposure, hepatitis B or hepatitis C infection, and alcohol intake are major risk factors for hepatocellular carcinoma.37 Aflatoxin-nucleic acid adducts in urine and serum are intermediate biomarkers for assessing the biological effective dose of aflatoxin. Quantification of aflatoxin-serum albumin adducts has also been validated in several experimental and epidemiologic studies, making it a useful screening tool for large-scale studies.5 In prospective studies, levels of albumin adducts are higher in persons who subsequently develop liver cancer, particularly if they are hepatitis B virus carriers.5,6,  The finding that a high proportion of liver tumors diagnosed in aflatoxin-exposed areas have the characteristic G:C somatic mutation in codon 249 of the p53 gene adds to the associations between serum levels of aflatoxin and liver cancer risk.38,39,  Aflatoxin B1 induces typical G:C to T:A transversions at the third base in codon 249 of p53.40 This molecular marker of aflatoxin-induced mutagenesis helped to establish that aflatoxin is a human liver carcinogen, responsible for a high proportion of the tumors in geographic areas with high dietary exposure to aflatoxin.
MARKERS OF SUSCEPTIBILITY
At each step in the continuum in Figure 1, between-person differences in the absorption, activation, and detoxification of carcinogens, or the response to DNA damage caused by some carcinogens, may mediate the relation between exposure and cancer outcome. The most common markers of susceptibility are mutations in specific genes that confer increased or decreased risk. Familial clustering of specific cancer sites has long been recognized, and family history of a specific cancer is associated with increased risk of that cancer for most cancer types. Molecular analysis now permits some of this familial clustering to be traced to these mutations (for examples in breast and colorectal cancer, see below). A spectrum exists ranging from “high penetrance” susceptibility genes, ie, those with a high probability of cancer in mutation carriers, to “low penetrance” genes, in which cancer is diagnosed in a much lower proportion of carriers or only those exposed to specific environmental or lifestyle factors develop the disease. High penetrance mutations are usually identified by studying families at high risk of specific cancers or multiple cancer syndromes. Linkage analysis is used to identify chromosomal segments that occur more commonly in family members diagnosed with cancer than in those free of cancer. Specific mutated genes are then sought in these segments. Low penetrance gene variants do not give rise to familial clustering and are sought by comparing the prevalence of the variant in a series of unrelated cases compared with controls, often called “association studies.”
Germline Mutations and Cancer
Germline mutations are inherited from a parent and are present in each cell in the body. Up to 5% of breast cancer and up to 15% of colorectal cancer are attributable to these high penetrance hereditary germline mutations, with an even larger percentage associated with low penetrance mutations.
About 5% of breast cancer is attributable to rare high-penetrance mutations in a small number of specific genes (eg, BRCA1, BRCA2, ATM, PTEN, and TP53); mutations in BRCA1 and BRCA2 account for up to 50% of hereditary and familial breast cancer41 (Figure 2). This proportion is higher at younger age at onset, and in families with multiple breast and/or ovarian cancer cases. Each gene shows some differences in the familial pattern of cancer: BRCA1 is more strongly associated with ovarian cancer, BRCA2 with male breast cancer, ATM with radiosensitivity, PTEN with Cowden's syndrome (including hamartomas of the skin and other organs), and p53 with Li-Fraumeni syndrome (including sarcomas and a variety of other tumors). These genes were originally identified in high-risk families in which information from the pedigrees was used to identify the chromosomal localization of the mutated genes. Follow-up studies of carriers of mutations in these genes are examining whether other genes are involved in modifying the age at onset or lifetime incidence of the disease.
Twin and family history studies have shown that the proportion of breast cancer due to inherited susceptibility is substantial, and only part of this is accounted for by these high penetrance genes. Variants in genes that give rise to lower relative risks may be much more common, and thus account for higher attributable risks than for high penetrance genes (Figure 2). These gene variants do not give rise to classic pedigrees, and are studied in conventional molecular epidemiology studies in conjunction with known lifestyle and environmental causes. Because of the lower relative risks being studied, large case-control studies are needed. Most of the current work examines genetic variants in candidate genes such as steroid-hormone metabolism genes. Future approaches will include whole-genome single nucleotide polymorphism searches, in which the whole genome of cases and controls is queried for polymorphisms that are more or less common in cases.
There are four hereditary colorectal cancer syndromes: familial adenomatous polyposis, hereditary nonpolyposis colorectal cancer (HNPCC), Peutz-Jeghers syndrome, and juvenile polyposis.
HNPCC accounts for up to 3% of all colorectal carcinomas,42,43,  and is associated with mutations in one or more of several mismatch repair (MMR) genes. Two human homologues of two bacterial MMR proteins, hMSH2 and hMLH1, were discovered in the early 1990s by studying families with a high burden of colorectal cancer.45,46,  Subsequent analyses of a larger set of families showed that germline mutations in these two MMR genes are responsible for 70% to 90% of all HNPCC cases.46
It is possible that a large proportion of colorectal cancer cases result from inherited medium- or low-penetrance alleles, in which environmental factors may interact with the inherited factors. For example, MSH6 is a low-penetrance gene compared with MLH1 and MSH2 mutations; mutations in MSH6 are associated with a later age of onset of the disease and account for about 10% of all germline MMR gene mutations in HNPCC patients.47 Future studies of medium- and low-penetrance alleles, which are more difficult to identify and assess, will be needed to explain the balance of colorectal cancer cases associated with a family history.
SUSCEPTIBILITY GENES: CARCINOGEN METABOLISM AND DETOXIFICATION GENES
Almost all exogenous carcinogens require activation by metabolic enzymes, and detoxification enzymes frequently exist to deactivate carcinogens or their intermediate metabolites. Inherited polymorphisms in these enzymes may alter their rate of activation or detoxification, thus increasing or decreasing the carcinogenic potential of the environmental exposures they act on.
The high temperature cooking of animal protein creates carcinogens such as heterocyclic amines and polycyclic aromatic hydrocarbons.48 Heterocyclic amines are metabolized by a number of enzymes, including N-acetyltransferase 2 (NAT2); single nucleotide polymorphisms in the coding sequence for NAT2 give rise to different forms of the NAT2 protein and these divide Caucasian populations into slow acetylators (about 55% of the population) and rapid acetylators (about 45%). Thus, a theoretical basis exists for about half of the Caucasian population to metabolize heterocyclic amines differently from the other half, and have different susceptibility to genotoxic damage after exposure to heterocyclic amines in cooked meat. Almost all studies have shown that when analyzed without consideration of meat consumption, there is very little difference in risk of colorectal cancer between slow and rapid acetylators. Several studies have shown that colorectal cancer cases are more likely to be rapid acetylators if they consume diets high in red meat intake.10 In some, but not all, studies that assessed exposure to heterocyclic amines by using information on meat intake, and preferred methods of cooking the meat (eg, grilling versus broiling versus roasting),49 rapid acetylators consuming higher predicted heterocyclic amine intake have again been at somewhat higher risk of colorectal cancer. Studies such as these can help identify the specific human carcinogens in complex mixtures, like diet, and may ultimately lead to preventive advice for populations (eg, changes in meat cooking practices) or individuals (eg, those with susceptibility genotypes for common environmental exposures).
As our knowledge of the molecular mechanisms of carcinogenesis has expanded rapidly, additional gene pathways have become the focus of molecular epidemiology research. These include pathways such as DNA repair, cell cycle control, immune response, and the inflammatory response. Genes in these pathways are candidates in the search for between-person genetic variants that alter cancer risk or response to the environment.
MARKERS OF ALTERED STRUCTURE/FUNCTION
The first genotoxicity biomarker shown to be associated with cancer risk was based on chromosomal aberrations in peripheral blood lymphocytes.8 High levels of chromosomal aberrations were associated with increased total cancer incidence in cohorts in Nordic countries and increased total cancer mortality in an Italian cohort.8 These results support an approximately two-fold cancer risk among subjects with high frequencies of chromosomal aberrations.9 Factors related to genetic susceptibility are known to affect the frequency of chromosomal aberrations in peripheral blood lymphocytes.9 Thus, chromosomal aberrations may serve as an integrated index of susceptibility, as well as DNA damaging exposure in populations. Other markers of altered structure/function are being actively sought to identify high-risk individuals.
MARKERS OF DIAGNOSIS
Conventional diagnosis of cancer types has relied on anatomic localization and histopathologic appearance. It has been long speculated, however, that these classifications ignore heterogeneity among tumors of similar histopathologic appearance that may mean that there are two or more different types of cancer that we currently treat as a single entity.12 A limited number of studies have assessed whether risk factors for breast cancer, for instance, are different according to the estrogen and progesterone receptor status of the tumors;12,13,  these studies have concluded that differences exist, although more studies of this phenomenon are needed to identify the differences that are consistent across multiple studies. For instance, the relative risk associated with combined hormone replacement therapy (estrogen plus progestin) was 1.67 (95% CI, 1.33–2.10) for estrogen receptor-positive, progesterone receptor-positive breast cancers, compared with 1.21 (95% CI, 0.87–1.68) for estrogen receptor- negative, progesterone receptor-negative breast cancer.13 The development of gene expression microarrays with thousands of genes has permitted the characterization of tumors previously treated as a single entity into multiple entities based on gene expression patterns. For hematologic malignancies, such as diffuse B-cell lymphoma, gene expression patterns have prognostic significance.50 Further work based on large-scale assays have reduced the number of genes involved to a smaller number. For instance, six genes (LMO2, BCL6, FN1, CCND2, SCYA3, and BCL2) were selected out of a much large number for prediction of survival in diffuse large B-cell lymphoma based on the gene expression results from microarrays.51 The progression from microarray to a small number of specific RNAs or proteins that could be tested by using real-time PCR or immunohistochemistry assay will permit treatment decisions to be tailored to the tumor subtype using techniques more familiar to clinical laboratories. It may be no accident that the hematologic malignancies are among those for which epidemiologic studies have provided the fewest clearly established risk factors—if we are studying multiple etiologic entities as if they are a single entity, then the statistical signal for environmental and lifestyle factors associated with just a single subtype will be greatly attenuated. Unfortunately, for molecular epidemiologists interested in etiology rather than prognosis, much of the work in this field has been done on samples with either no or minimal demographic information, and essentially zero environmental or lifestyle information. The challenge for molecular epidemiology is to take advantage of the emerging knowledge about tumor subgroups and design studies that permit simultaneous analysis of both risk factors and tumor subclassification.
MARKERS OF TREATMENT RESPONSE
One of the limitations in cancer chemotherapy is the recognition that some patients experience toxicity from cancer treatment that may be fatal, undermine quality of life, or limit the dose of drugs able to be given. Much of the variation in drug tolerance is thought to result from inherited differences in drug metabolism. Some people respond very favorably to certain drugs, while others experience treatment failure or toxic reactions. Just as most carcinogens require metabolic activation, most drugs also require metabolic activation to their active form. There are inherited differences in the metabolism of drugs, just as there are inherited differences in metabolism of carcinogens. Perhaps the best elucidated example of inherited differences in cancer drug response is the pharmacogenetics of the thiopurine drugs mercaptopurine and azathioprine. Fewer than 1% of persons given these drugs experience life-threatening drug-related myelosuppression; it has been shown that almost all of these people are homozygous for low activity alleles of a critical enzyme that metabolizes these drugs, thiopurine S-methyltransferase.50 Evidence suggests that carriers of one low activity allele require different doses of these drugs than carriers of two copies of the high activity alleles, to minimize toxicity and maximize response. Thus, prescreening for these metabolic enzyme polymorphisms may not only avoid unnecessary deaths due to acute toxicity, but may also help optimize drug dosage.
Progress has been made in identifying somatic alterations in tumors that may predict response to therapy. For example, about 10% of nonsmall-cell lung cancer patients had a rapid and strong response to gefitinib, a tyrosine kinase inhibitor. It has recently been shown that most of these tumors had activating mutations in the epidermal growth factor receptor (EGFR) gene and thus screening lung tumors for these mutations in EGFR may identify patients who can expect a good response to this drug.51
The integration of molecular biologic techniques into epidemiologic studies offers a panoply of opportunities to better understand the causes of cancer, the natural history of the different types of cancer, and the determinants of survival once a cancer is diagnosed. A hallmark of molecular epidemiology studies is the availability of tissues in which biological measurements can be made along the spectrum from exposure to death. Another hallmark is the underlying idea of heterogeneity—of exposures, susceptibility, and tumor types. This predicts that very large studies will be needed to permit stratification by these different elements of heterogeneity. On the other hand, there may be opportunities to make more robust conclusions in smaller studies, if we can use molecular tools to prestratify by susceptibility or outcome and use more accurate measurements of exposure, thus reducing attenuation of exposure-cancer associations. Ultimately, the ability to maximize the utility of these new molecular methods to better understand the epidemiology of cancers will depend on close collaboration of epidemiologists with molecular biologists, and the training of a new generation of molecular epidemiologists for whom this collaboration is not exceptional, but normal.