A systematic review of validated methods for identifying lymphoma using administrative data


R. A. Herman, Iowa Drug Information Network, University of Iowa, Research Park, 2500 Crosspark Road, Room W145, Coralville, IA 52241–4710, USA. E-mail: Ronald-A-Herman@uiowa.edu



To systematically review published studies for algorithms that identified lymphoma as a health outcome of interest in administrative or claims data and examined the validity of the algorithm to identify lymphoma cases.


A systematic literature search was executed using PubMed and the Iowa Drug Information Service database. Two investigators reviewed search results to identify studies using administrative or claims databases from the USA or Canada that both reported and validated an algorithm to identify lymphoma.


The search identified 713 unique citations with 402 eliminated by an initial screen of the article abstract. The remaining 311 resulted in one study that identified and validated an algorithm. Ten other studies reported algorithms but were not validated. The validated study reported four possible algorithms that had a specificity (> 99%), but the algorithm using two diagnostic codes recorded within 2 months had the best positive predictive value (PPV = 62.83%) and a sensitivity (79.81%). The most comprehensive algorithm required multiple diagnostic codes 2 months apart or diagnostic, and procedure codes on the same day had the greatest sensitivity (88.31%) and a PPV = 56.69%. The algorithm that required only a single diagnostic or procedure code had the worst PPV (34.72%).


The International Classification of Disease, Ninth Revision diagnostic, clinical procedure, and complication codes for lymphoma can identify incident hematologic malignancies and solid tumors with high specificity but with relatively low to moderate sensitivity and PPVs. When diagnostic and procedure codes were required on the same visit or multiple codes between visits, then PPV was increased. Relying on a single registry to confirm true positive cases is also not sufficient. Copyright © 2012 John Wiley & Sons, Ltd.


As a part of the US Food and Drug Administration's Mini-Sentinel pilot program, a systematic review was conducted to identify validation studies of algorithms used to detect 20 health outcomes of interest (HOIs) in administrative and claims data. These reviews provide the foundation for future studies of these HOIs in Mini-Sentinel and other administrative or claims data sources to conduct active surveillance to identify and refine signals for potential safety issues. It is extremely important to understand the performance characteristics of the codes used in these algorithms to identify the HOI. The single presence of a code in a database is not always sufficient to determine that an HOI actually occurred. Lymphoma was one of the HOIs selected for this systematic review and the topic of this article.

Lymphoma, multiple myeloma, and lymphoid leukemia develop from a malignant transformation of normal lymphoid cells at various stages of differentiation. Some of these transformations may be related to pharmaceutical exposures; thus, Mini-Sentinel is interested in this potential adverse effect. These lymphoid neoplasms are the sixth most common group of malignancies worldwide in men and fifth in women.[1] There are many different types of lymphoma, and they can be separated into categories on the basis of several different properties, most of which do not track well with the International Classification of Disease (ICD) coding methodology.[2-6] In particular, they can be separated for etiologic purposes into T-cell and B-cell lymphomas as is the preferred method of the World Health Organization (WHO), or they could be separated for outcome purposes into curable and noncurable lymphomas, or for purposes of tracking intensity of therapy into aggressive or nonaggressive lymphomas. The historical approach of separating into Hodgkin's or non–Hodgkin's lymphoma is perhaps not the most clinically meaningful classification; however, it may be important to certain safety questions being evaluated and the population examined.[6] Another strategy might be to classify disorders as nodal or tumor forming (Hodgkin's lymphoma and others), primarily blood-borne lymphoid leukemia (chronic or acute), and plasma cell disorders (amyloidosis, Huppert's disease, or multiple myeloma). The importance of these distinctions depends on the nature of the hypothesized pathogenic mechanisms as they relate to the exposures under study.

The screening criteria for potential articles of interest used broad definitions for lymphoma, including lymphoproliferative diseases, and included lymphoid leukemia and perhaps even plasma cell disorders because they are more biologically related to “lymphomas” than their naming convention might suggest. The decision was made that for the pharmacoepidemiologic studies that would be considered in this systematic review, a wide net should be cast. Potential lymphoma cases from screening the administrative or claims data must be validated. Microscopic diagnosis in the pathology laboratory should be the gold standard in confirming the diagnosis, but the reality is that laboratory tests and radiographic or other imaging are occasionally used.[7]

This report describes the review process and findings for the lymphoma definition algorithms. The full report can be found at http://mini-sentinel.org/foundational_activities/related_projects/default.aspx.


All 20 health outcomes that were systematically reviewed as a part of the Mini-Sentinel program used similar methods. Details of the method for these systematic reviews can be found in the accompanying manuscript by Carnahan and Moores.[8] In brief, the base PubMed search as described in the method manuscript was combined with the following terms to represent the HOI: “Lymphoma” [mesh] and “Lymphoma” [all fields]. A search of the Iowa Drug Information Service was also conducted. The details of both of these searches that were conducted on 23 June 2010 can be found in the full report on the Mini-Sentinel Web site. All searches were restricted to articles published in 1990 or after. Mini-Sentinel collaborators were also requested to help identify any additional relevant validation studies.

The abstract of each citation identified was reviewed by two investigators. The goal was to review any administrative or claims database study that used data from the USA or Canada and studied the HOI. When either investigator selected an article for full-text review, the full text was reviewed by both investigators. During this review, if there was disagreement on whether a study should be included in the evidence table, the two reviewers reached consensus on inclusion by discussion. There was a discrepancy in the reviewer definition of administrative or claims database studies in the first round of full-text reviews, which required this additional step to adjudicate articles selected for inclusion. This discrepancy related to uncertainty about whether registries were administrative data. Agreement on whether to review the full text or to include the article in the evidence table was calculated using a Cohen's kappa statistic. If fewer than five studies were identified that performed validation of an algorithm, up to 10 algorithms used without validation were reported. The data in the evidence table were extracted by one investigator and confirmed by a second for accuracy. This systematic review evaluated only published literature, and no patient data were evaluated or used.


Literature searches and reviews

The PubMed search identified 700 citations, and the Iowa Drug Information Service search identified 24 citations. There were 11 duplicates for a total number of 713 unique citations.

The review of the 713 abstracts identified 311 that would require full-text review. The 402 that were excluded included 223 that did not study lymphoma, 64 that were not administrative or claims database studies, and 115 that were excluded because the data source was not from the USA or Canada. Cohen's kappa for agreement between reviewers on inclusion versus exclusion of abstracts was 0.61. There were a large number of articles identified for inclusion because it was difficult to ascertain from the abstract whether the study used an administrative or claims database or used a registry of confirmed cancer cases, which was not the focus of this project.

The review of 311 full-text articles identified only one study with algorithm validation that was included in the final evidence tables (Tables 1 and 2). Ten studies were identified that used an administrative or claims database without validation of the lymphoma outcome (Table 3). The exclusions included 115 reports where the lymphoma identification algorithm was poorly defined. No additional studies were excluded because they included no validation of the outcome definition or reporting of validity statistics other than the 10 summarized in Table 3. There were an additional 79 studies eliminated because the study outcome was not lymphoma, 87 more did not use an administrative or claims database, and 19 studies were not conducted in the USA or Canada. Reviewers identified no additional citations for review from full-text article references. Cohen's kappa for agreement between reviewers on inclusion versus exclusion of full-text articles reviewed was 0.75. It should be noted that the initial full-text review process selected a larger number of manuscripts for inclusion. However, an attempt to build the evidence table identified a distinction between already validated cancer registries and administrative or claims databases. As a result, an adjudication process was conducted, which excluded most of the originally selected studies because they included only validated registry cases but did not use administrative or claims data of the kind captured through billing claims, as was the intended focus of the report.

Table 1. Lymphoma algorithm identified and validated
CitationStudy population and time periodDescription of outcome studiedAlgorithmValidation/adjudication procedure, operational definition, and validation statistics
  1. The specific codes used to identify lymphoma and the complications are listed in Table 2.

Setoguchi et al.[9]Healthcare utilization data were derived from Medicare claims data linked to pharmacy dispensing data from the PACE in Pennsylvania between 1 January 1997 and 31 December 2000. Our gold standard cancer information was the Pennsylvania State (PA) Cancer Registry data from 1 January 1989 to 31 December 2000. Lymphoma patients identified (n = 629) were 21.6% male, 65 years and older with a mean age of 80.8 years (SD = 6.5 years).Incident cancer cases identified in the administrative or claims database, which were then linked to the PA cancer registry to validate the diagnosisClaims data-based definitions of incident cancer using (a) ICD-9 diagnosis codes; (b) CPT codes for screening procedures, surgical procedures, radiation therapy, chemotherapy, and nuclear medicine procedures; and/or (c) National Drug Code prescription codes for medications used for cancer treatment available in PACE. Four different algorithms were evaluated:Claims data was linked to the cancer registry to confirm diagnosis. Microscopic diagnosis was reported in 87.6% of the cases, laboratory test in 5.9% of cases, radiographic or other imaging (1.4%), clinical diagnosis (1.6%) and unknown (3.5%).
When the ICD-9 diagnosis codes and current procedure terminology codes (CPT) were used for lymphoma according to algorithm 1, 564 cases were identified with a sensitivity of 55.17%, a specificity of 99.86% and a PPV of 61.52%.
Algorithm 1: one or more ICD-9 diagnostic codes for lymphoma plus any procedure code related to complications of cancer in 2 weeks of the diagnosis plus another diagnostic cancer code within 12 months; one diagnostic procedure with biopsy followed by two or more cancer diagnoses at two different occasions within 12 months (recorded on different dates from the procedures); one cancer diagnosis + any surgery related to cancer during the same hospitalization and/or visit; one cancer diagnosis + any cancer chemotherapy during the same hospitalization and/or visit; one cancer diagnosis + any radiation therapy during the same hospitalization and/or visit; one cancer diagnosis + hematopoietic cell transplantation during the same hospitalization and/or visit (for leukemia only); or one cancer diagnosis + oral chemotherapy dispensing within 2 weeks after the diagnosisWhen algorithm 2 was used, which required two diagnostic ICD-9 terms used within 2 months, there were 799 cases identified with a sensitivity of 79.81%, a specificity of 99.81%, and a PPV of 62.83%.
Algorithm 3, which identified cases meeting criteria for algorithm 1 or algorithm 2, identified 926 cases with a sensitivity of 88.31%, a specificity of 99.74%, and a PPV of 56.59%.
Algorithm 4, which only required the presence of one ICD-9 diagnostic code, identified 1607 cases with a 88.71% sensitivity, a 99.33% specificity, and a PPV of 34.72%.
The ICD-9 diagnostic codes and clinical procedure codes can identify incident hematologic malignancies and solid tumors with high specificity but with relatively low to moderate sensitivity and PPVs
Algorithm 2: two or more diagnoses of cancer (ICD-9 codes) within 2 monthsThe authors noted that registries may not capture all cancers, so the gold standard in this study is limited compared with a medical record review of every identified case. Limitations of case ascertainment methods may be more significant for less common cancers such as lymphoma compared with others such as breast or prostate cancer.
Algorithm 3: cases defined by using algorithm definition 1 or 2
Algorithm 4: one or more diagnosis codes for cancer (ICD-9 codes)
Table 2. Codes for cancer-related complications, procedures, or treatments (Setoguchi et al.[9])
Type of codeDefinitionCode
ICD-9Lymphoma200.XX, 201.XX, 202.XX (except 202.5X and 202.6X)
ICD-9Secondary malignant neoplasm of brain and spinal cord198.3
ICD-9Unspecified disease of the spinal cord336.9
ICD-9Superior vena cava syndrome459.2
CPTPain management or palliative care99551, 99552
CPTDiagnostic procedures with biopsy38500, 38505, 38510, 38520, 38525, 38530, 38542, 49180, 76003, 76360, 76365, 76942, 88170, 88171, 88172, 88173, 85095, 85097, 85102, 38220, 38221, 38100, 38101, 38102, 38115, 38589, 38562, 38564, 38570, 61332, 54550, 54505, 54512, 54520, 54530, 19100, 19101, 19102, 42826, 11100, 11101
CPTCancer chemotherapies36640, 51720, 96400, 96405, 96406, 96408, 96410, 96412, 96414, 96420, 96422, 96423, 96425, 96440, 96445, 96450, 96500, 96501, 96504, 96505, 96508, 96510, 96511, 96512, 96520, 96524, 96530, 96538, 96540, 96542, 96545, 96549, 96450, 99555
CPTCancer-related radiations77261, 77262, 77263, 77280, 77285, 77290, 77295, 77299, 77300, 77305, 77310, 77315, 77321, 77326, 77327, 77328, 77331, 77332, 77333, 77334, 77336, 77370, 77399, 77401, 77402, 77403, 77404, 77406, 77407, 77408, 77409, 77411, 77412, 77413, 77414, 77416, 77417, 77419, 77420, 77425, 77430, 77431, 77432, 77470, 77499, 76960, 55859, 55860, 55862, 55865, 77750, 77761, 77762, 77763, 77776, 77777, 77778, 77781, 77782, 77783, 77784, 77789, 77790, 77799, 79200, 79300, 79400, 79420, 79440, 79900, 79999
CPTHematopoietic transplantations38230, 38231, 38240, 38241, 38242
Table 3. Lymphoma algorithms identified, but not validated
CitationStudy population and time periodDescription of outcome studiedAlgorithm
  1. HL, Hodgkin's lymphoma; NHL, non–Hodgkin's lymphoma.

Bernstein and Nabalamba[10]Hospitalization data from Statistics Canada Health Person Oriented Information hospital database (1994/1995 to 2003/2004); patients with irritable bowel disease in the study period were 47 000The database was first scanned to identify patients with irritable bowel disease, and this subset was then scanned to identify new onset of lymphoma and other comorbiditiesNHL (ICD-9: 200.x or ICD-10-CA: C83.2, C83.3, C83.4, C83.5, C83.7, C96.3 and C88.08); Hodgkin's disease (ICD-9: 201.x or ICD-10-CA: C81)
Caillard et al.[11]Medicare data were evaluated from the US Renal Data System for patients who had received a renal transplant from January 1992 through July 2000. A total of 66 159 kidney recipients were included in the analysis. Lymphoid proliferations were diagnosed in 1169 patients (1.8%): 823 (1.6%) were reported as NHL, 160 (0.24%) as myeloma, 60 (0.1%) as Hodgkin's disease, and 126 (0.19%) as lymphoid leukemia. Mean age for each of those groups was 45.4, 46.2, 52.4, 47.3, and 48.3 years, respectively. Gender ranged from 55% to 67% male for each group.To analyze the institutional Medicare claims with a new diagnosis of lymphoid disorders occurring after transplantationICD-9-CM diagnosis codes: NHL: 200.x and 202.x; Hodgkin's diseases: 201.x, myelomas: 203.x, lymphoid leukemia: 204.x.
Dalager et al.[12]VA National Administrative data on Vietnam era veterans born between 1937 and 1954. There were 283 HL cases and 404 controls that served in the military from July 1965 to March 1973.New onset of histology confirmed HLICD-8 codes 201.x
Dalager et al.[13]VA National Administrative data on Vietnam era veterans born between 1937 and 1954. There were 201 NHL cases and 358 controls that served in the military from July 1965 to March 1973.New onset of histology confirmed NHLICD-8 codes 200.x or 202.x
Kasiske et al.[14]The Medicare enrollment database was searched to determine Medicare Part A and Part B primary pay status along with coverage start and stop dates. There were 35 765 (47%) fulfilling these criteria out of a total of 76 467 first transplantations between 1995 and 2001. Patient demographics were not provided.First-time recipients of deceased or living donor kidney transplantations were evaluated for a new onset of cancerICD-9-CM codes used to identify specific cancers:
Lymphoma 200.x, 202.x, 204.x
Hodgkin's 201.x
Myeloma 203.x
Leukemia 205.x, 206.x, 207.x, 208.x
All other cancers and their codes which are not relevant to this report are summarized in the Appendix of the Kasiske article[14].
McGinnis et al.[15]VA National Administrative data on inpatient files from October 1990 and in outpatient files from October 1996 through September 2004 were used to identify 197 HIV positive NHL patients and 43 HIV negative NHL patients.First diagnosis of NHL in HIV positive patientsICD-9-CM site-specific cancer codes as identified by SEER for hepatocellular carcinoma: 152.0, 155.2; and for NHL: 200.0 to 200.8, 202.0 to 202.2, and 202.8 to 202.9.22. Cancers were included if the veteran had one or more inpatient or two or more outpatient cancer diagnoses of the same type. This approach was previously validated with the HIV algorithm, but not with the NHL algorithm.
Namboodiri et al.[16]VA National Administrative data on veterans 20 years and older were evaluated for the period 1970 to 1982. The population was predominantly male and ages ranged from 20 to 84 years and older.The neoplasms that were examined included first onset of Hodgkin's disease, NHL, multiple myeloma, and leukemiaAuthors report a subset with ICD codes of 200.x to 208.x.
Shea et al.[17]Patients were Medicare beneficiaries with institutional or outpatient claims from the Centers for Medicare and Medicaid Services for the period 2003 through 2006. The mean age of patients was 75 years, and there was no significant difference in the number of males compared with females.Incident breast cancer, colorectal cancer, leukemia, lung cancer, or lymphoma who received chemotherapy in inpatient hospital, institutional outpatient, or physician office settingsICD-9-CM diagnosis codes for lymphoma included 200.00–200.88, 202.00–202.28, 202.80–202.98, V10.71, and V10.79 and leukemia included 202.40–202.48, 203.10, 204.00, 204.10, 204.20, 204.80, 204.90, 205.00, 205.10, 205.20, 205.80, 205.90, 206.00, 206.10, 206.20, 206.80, 206.90, 207.00, 207.10, 207.20, 207.80, 208.00, 208.10, 208.20, 208.80, 208.90, V10.60–V10.63, and V10.69.
Smith et al.[18]Data from the US Renal Data System for end-stage renal disease patients that were placed on the transplant waiting list from January 1990 through December 1999. There were 357 cases of lymphoma (64.4% males) more than 107 298 follow-up years. The highest rates occurred in Caucasian males.New onset of lymphoma with the first hospitalization with lymphoma diagnosis after transplantation or placement on the transplant waiting list as the date of onsetICD-9-CM codes consistent with a diagnosis of lymphoma. Hospitalizations with a primary or secondary diagnosis code of 200–208 or 238.7 were considered hospitalizations for lymphoma. Note: the authors used this set of codes, but it also contains many leukemia codes which are not appropriate for lymphoma identification.
Watanabe et al.[19]Administrative records from the VA Beneficiary Identification and Record Locator Subsystem was used to identify Vietnam era veterans that had died between 4 July 1965 and 30 June 1982 and whose military service ended after 1965 and started before 1973. There were 140 army and 42 marine NHL deaths and 116 army and 25 HL deaths.Cause specific numbers of deathsICD-8-CM codes of 200 or 202 for NHL and 201 for HL.

The one validation study by Setoguchi et al.[9] was the only validation study selected for inclusion after the adjudication process, and it was identified from the initial search strategy. Algorithm details were not in the article but were obtained from the authors. Ten studies of nonvalidated algorithms were also included because only one validation study was identified.[10-19] All were identified by the initial search strategy, but just three were correctly classified as only lacking validation. The other seven were identified during the adjudication process. No additional studies were identified through the references in the full-text articles, and none were provided by Mini-Sentinel Collaborators.

Summary of algorithms

The sole study that validated a screening algorithm for lymphoma[9] used the International Classification of Disease, Ninth Revision, Clinical Modification (ICD-9-CM) codes, Current Procedural Terminology (CPT) codes, and codes for the receipt of various chemotherapies. These codes are summarized in Table 2. The authors evaluated four different algorithms using these codes in various possible combinations. Details are provided in Table 1.

They identified patients 65 years and older from the Medicare Claims database Pharmaceutical Assistance Contract for the Elderly (PACE) in Pennsylvania (PA). Patients identified by the ICD-9 diagnosis codes and/or CPT codes were then linked to the PA cancer registry to validate the lymphoma diagnosis. The cancer registry reports that patients identified with lymphoma were confirmed by microscopic pathology in 87.6% of the cases, by laboratory test in 5.9% of the cases, by radiographic or other imaging in 1.4% of the cases, by only clinical diagnosis in 1.6% of the cases, and by other means in 3.5% of the cases.

Setoguchi et al.[9] clearly demonstrated that the ICD-9 diagnostic codes and clinical procedure codes can identify incident hematologic malignancies and solid tumors with high specificity but with relatively low to moderate sensitivity and positive predictive values (PPVs). Within the cases identified by both the registry and the claims-based definition, the agreement in the first dates of cancer diagnosis was sufficient. Most of these cases identified by both sources were identified in the claims database within 2 weeks of the diagnosis, with just a few claims processed much later than the diagnosis. The authors concluded that for most patients, the claims database is able to identify this health outcome of interest within 2 weeks of diagnosis. They further noted that possible misclassification of prevalent cases of lymphoma as incident cases could be avoided by requiring a 6-month cancer-free period in the claims data.

ICD-9 diagnosis codes and CPT codes were used to identify lymphoma according to algorithm 1, which was a combination of diagnosis and procedure codes on the same day or within the same hospitalization. With this approach, they had the fewest number of cases (564), the best specificity (99.86%), and what might be considered a moderate PPV (61.52%). Algorithm 2 required only two diagnosis codes appearing within 2 months. This increased the number of cases identified (n = 799). Specificity fell slightly to 99.81%, sensitivity improved to 79.81%, and the PPV increased to 62.83%. Algorithm 3 identified an outcome if a person met the criteria for either algorithm 1 or algorithm 2. This did further increase the number of cases (n = 926), but specificity fell further to 99.74% and PPV dropped to 56.59%. Despite the decrease in PPV, the authors described this as their preferred algorithm, largely because sensitivity increased to 88.31% and specificity was not substantially affected because of the rarity of the outcome. The fourth algorithm evaluated was the least stringent, requiring the presence of only one new diagnostic code for lymphoma. This resulted in the most cases (n = 1607), but specificity fell to 99.33% and PPV went to the lowest value of 34.72% and sensitivity increased to 88.71%.

All four algorithms had reasonable specificity (> 99%), but it would seem that definition 2 with two diagnostic codes recorded within 2 months had the best PPV while also having a reasonable sensitivity. The authors also examined the bias that would result from the less than optimal PPV, using algorithm 3. The claims-based definition resulted in relatively small bias in the example of a typical pharmacoepidemiologic study or surveillance evaluation with drug “A” possibly causing lymphoma. Finally, the authors noted that registries may not capture all cancers, so they acknowledge that the gold standard in this study (registry use) is limited compared with a medical record review of every identified case. The PPV may be underestimated because of the incomplete capture of cases in the registry. Limitations of case ascertainment methods may be more significant for less common cancers such as lymphoma compared with others such as breast or prostate cancer.

The studies with administrative or claims databases that did not report validations (Table 3) all used only ICD-9-CM diagnostic codes. Three of the studies used ICD-8-CM codes and five studies used ICD-9-CM diagnostic codes. These codes are summarized in Table 4. One of the challenges that two studies faced was that the databases spanned a broad range of years and the coding scheme sometimes changed, so one study used ICD-8-CM and ICD-9-CM codes and another one used ICD-9-CM and ICD-10-CM diagnostic codes. It should be noted that there was a substantial change in ICD-9 coding for lymphoma that took place in 2007, which resulted in the adoption of 54 new lymphoma codes. The proposed rule that identified these changes can be found at http://www.cms.gov/AcuteInpatientPPS/downloads/CMS-1533-P.pdf). These studies used administrative claims from the VA Health system (five studies), the Medicare database (four studies), or a Health Canada database (one study). The challenge for anyone developing algorithms for lymphoma detection is to identify how lymphoma is classified in the system and how changes may have been incorporated in the classification system to correct for when data collection occurs both before and after changes in classification definitions. Without validation of the study algorithms, it is difficult to compare these various claims databases.

Table 4. List and definitions of cancer codes included in additional algorithms
 Soft tissue sarcoma 171.x   
 Kaposi's sarcoma 176.x 9140 
Lymphomas200.x-202.x200.x- 202.x 9590–97109590–9720
Non–Hodgkin's lymphoma (lymphosarcoma)200.x200.xC83.x, C85.x9670–97109670–9720
 Small lymphocytic lymphoma 200.0C83.096709670
 Malignant lymphoma, diffuse 200.1C83.295929591
 Burkitt's lymphoma 200.2C83.796879687, 9826
 Marginal zone lymphoma 200.3 9680, 97109689, 9699
 Mantle cell lymphoma 200.4 9673, 96779673
 Lymphoplasmacytic lymphoma 200.6 96719671
 Diffuse large B-cell lymphoma 200.7C85.19680–96839678–9680 and 9689
 Non–Hodgkin's lymphosarcoma NOS (Not Otherwise Specified) 200.8, 200.9C83.8, C83.99591–9595, 9672, 9686, 96949591
Hodgkin's lymphoma201.x201.xC81.x9650–96679650–9667
Non–Hodgkin's lymphoma (lymphoid/histiocytic)202.x202.xC82.x, C84.x  
 Malignant lymphoma, follicular 202.0C82.x9690 - 96999690 - 9699
 Mycosis fungoides/Sezary disease 202.1, 202.2C84.0, C84.19700, 97019700, 9701
 Percutaneous and peripheral T-cell lymphomas 202.7C84.29700 - 97099700 - 9709
 Peripheral T-cell lymphomas 202.7C84.49702–9705, 9708, 9714, and 97169702, 9705, 9708, 9714, and 9716
 NK/other T-cell lymphomas 202.8, 202.9C84.597099709, 9717–9719, 9831 and 9948
 Non–Hodgkin's lymphoma lymphoid/histiocytic NOS  C85.7, C85.9 9820, 9970
Myeloma 203.xC90.x  
Lymphoid leukemia 204.xC91.x  
 Lymphoblastic leukemia (acute lymphocytic leukemia) 204.0C91.098289835
 Chronic lymphocytic leukemia 204.1C91.198239823
Leukemia 205.x-208.xC92.x-C95.x  

The validation study focused on the Medicare population and patients 65 years and older, with a mean age of 80.8 years, and only 21.6% were males.[9] Among the nonvalidated studies, two other studies also focused on Medicare patients where all subjects were 65 years and older.[14, 17] One of these studies[14] and two others[11, 18] focused on patients with end-stage renal disease who were on a transplant list or had just received transplants. Here patient ages spanned the entire age spectrum, and there was a near balance in males and females. Four studies[12, 13, 15, 16] were performed in the Veterans Administration system; two were looking at lymphoma rates during the Vietnam era[12, 13], one looked at lymphoma rates in general,[16] and the other focused on lymphoma associated with HIV.[15] One study was performed using a hospital database in Canada where lymphoma was assessed from a comorbidity perspective with irritable bowel disease.[10]

The nonvalidated studies were published from 1991 to 2008, but the time frame covered by the databases spanned from 1965 through 2006. The validation study examined data from 1997 to 2000. The validation study[9] focused on new cancer diagnosis, and all of the nonvalidated studies identified focused on the assessment of new cases of lymphoma,[10-18] except one that focused on lymphoma as a cause of death.[19] Occasionally, the lymphoma was assessed as a comorbidity,[10, 11, 14, 15, 18] but in most cases, it was the primary diagnosis.[9, 12, 13, 16, 17, 19]


Characterizing algorithms for lymphoma as an HOI is complicated from the beginning by the lack of uniformity in definitions of lymphoma. Primary lymphoid malignancies would be a term to broadly embrace the diseases of nodal or tumor forming lymphomas (Hodgkin's lymphoma and others), primarily blood-borne lymphoid leukemia (chronic or acute), and plasma cell disorders (amyloidosis, Huppert's disease, or multiple myeloma). By convention, the plasma cell disorders would very commonly be excluded from discussions of “lymphoma,” and the article by Setoguchi et al.[9]indeed excluded ICD-9 code 203.xx (personal communication). Lymphoid leukemia is categorized less consistently. The disease characterized by a clonal proliferation of B cells co-expressing surface molecules CD5, CD19, and CD23 (ICD 204.xx) is arbitrarily referred to by clinicians as either chronic lymphocytic leukemia or small lymphocytic lymphoma, often based on a predominantly nodal or blood-borne presentation. It is officially referred to by WHO as chronic lymphocytic leukemia/small lymphocytic lymphoma in all cases[20] and is classified as lymphoma but not leukemia by WHO[21] and as leukemia and not lymphoma by the National Cancer Institute in their Surveillance, Epidemiology, and End Result (SEER) database[22] (and Setoguchi et al.[9]). Similarly, lymphoblastic lymphoma/leukemia would generally be called leukemia by clinicians and SEER but is a lymphoma according to WHO. Thus, all published studies need to carefully report definitions of “lymphoma” used in the relevant databases, registries, and resulting algorithms.

A second peril to optimizing sensitivity and specificity of lymphoma capture in these described algorithms is the unfortunate use of historical terms for lymphoid diseases that were recognized as medical conditions and named before they were recognized as lymphomas.[6] Examples (with WHO terminology) include Waldenström's macroglobulinemia (lymphoplasmacytic lymphoma), gastric or orbital MALT (extranodal marginal zone lymphoma), mycosis fungoides (cutaneous T-cell lymphoma), lymphomatoid granulomatosis (angiocentric pulmonary B-cell lymphoma), and large granular lymphocytosis (T-cell granular lymphocytic leukemia). Although each represents a relatively uncommon form of lymphoma, in aggregate they account for 5% to 10% of lymphomas.

Although subtle evolution in the subclassification of lymphomas is frequent and common, only a more recent change in disease definition is worth mentioning as having potentially important effect. Chronic lymphocytic leukemia/small lymphocytic lymphoma has historically included any detectable population of clonal CD5, CD19, and CD23 expressing B cells. In 2008, however, the definition was limited to include only patients with at least 5000 such cells per microliter of blood and classifying those with fewer cells as monoclonal B-cell lymphocytosis, which is not recognized as either leukemia or lymphoma.[23] Effectively, thousands of patients in the USA alone would be reclassified as no longer having leukemia or lymphoma after the change in definitions. Dissemination of this knowledge among clinicians has been slow, however, and it is unknown how this is handled in cancer registries. This only matters, of course, for lymphoma algorithms that include chronic lymphocytic leukemia/small lymphocytic lymphoma.

Setoguchi et al.[9] offer the lack of complete capture of diseases by cancer registries as a possible explanation for low PPV. This possibility seems especially relevant in the case of lymphoma where many subtypes are diagnosed on the basis of clinical findings or blood tests and not captured on a pathologic biopsy, which is the most efficient way for registries to identify cases.[7] Examples include chronic lymphocytic leukemia/small lymphocytic lymphoma, lymphoplasmacytic lymphoma, splenic marginal zone lymphoma, primary cutaneous T-cell lymphomas, T-cell granular lymphocytic leukemia, and vitreoretinal lymphoma. Notably this is complicated by the significant overlap with the previously mentioned diseases that retain older, nonmalignant sounding clinical terminology. Thus, it seems likely that the PPV of the described algorithms may well vary some across registries as capture completeness of these subtypes is likely operator dependent. This was confirmed by Clarke et al.[24] There are many other potential causes for registries to miss the diagnosis of lymphoma for patients receiving treatment. For example, patients may travel out of state (or their registry region) to be diagnosed with cancer but then return “home” for the treatment. Setoguchi et al.[9] also noted that registry case ascertainment for rare cancers has been shown to be incomplete, such that claims data have actually been used in some registries to enhance case ascertainment.

Because of the inconsistencies of terminology and diagnostic techniques, all future algorithms and validations in the topic of “lymphoma” need to very carefully consider and report all definitions of “lymphoma” studied, and algorithm validations should ideally include more than one cancer registry as gold standard.


There are many different types of lymphoma, and they can be separated into categories on several different properties, most of which do not track well with the ICD coding methodology. In particular, they can be separated for etiologic purposes into T-cell and B-cell lymphomas, for outcome purposes into curable and noncurable lymphomas, or for purposes of tracking intensity of therapy into aggressive or nonaggressive lymphomas. The historical approach of separating into Hodgkin's or non–Hodgkin's lymphoma is perhaps not as globally useful as it once was. For purposes of pharmacoepidemiologic studies and surveillance evaluations, which provided the impetus for this Mini-Sentinel report, it is logical to cast a wide net of lymphoproliferative diseases and include lymphoid leukemia and perhaps even plasma cell disorders as they are more biologically related to “lymphomas” than their naming convention might suggest. The one validation study that was conducted excluded plasma cell diseases but found two basic algorithms with reasonable performance characteristics, although the PPVs and sensitivities were less than excellent. The combination of these algorithms led to what the authors felt was the best algorithm (algorithm 3). Despite a PPV of only 56.6%, sensitivity was 88.3%, which was notably better than either algorithm 1 or algorithm 2 alone. The first algorithm included a combination of diagnosis codes with codes for various procedures or drugs, typically on the same day. Alternatively, two diagnostic codes could occur within 12 months of one another after a diagnostic procedure with biopsy. The second algorithm required two or more lymphoma diagnosis codes within 2 months. The requirement of only one diagnostic code led to an unacceptable PPV. Thus, it is recommended that a single diagnostic code alone not be used to identify lymphoma. Finally, this study was limited in that it used only registry-identified cases as true positives, so the performance characteristics reported are dependent on the ability of the registry to identify all cases. This has been brought into question, particularly for less common cancers.

Registry data for hard to define cancers like lymphoma are often used as the gold standard to identify cases, but there must be further research to clearly establish that indeed this is the gold standard. For example, SEER–Medicare linked data must be carefully evaluated for the completeness of its ascertainment of true cases of lymphoma. It is recommended that a capture–recapture method be considered to determine whether cases that are not confirmed by registry data might be true positive cases when medical record review is conducted. Given that only one validation study of lymphoma was identified, it focused on an older population in one state, and it considered only registry identified cases as true positives, opportunities for studying the validity of administrative or claims data-based lymphoma definitions are many. Future additional research might include examining the validity of algorithms in multiple geographic settings and the inclusion of plasma cell diseases among a broader range of ages.


The authors declare no conflict of interest.


  • Lymphoma definitions have changed over the years, so it is important that the algorithm clearly specifies what it is searching.
  • It is recommended that a single diagnostic code alone not be used to identify lymphoma.
  • Do not restrict search results to confirmation by only registry-identified cases because a single registry may miss true positives, especially for less common cancers.


Mini-Sentinel is funded by the Food and Drug Administration through the Department of Health and Human Services (HHS) contract number HHSF223200910006I. The views expressed in this document do not necessarily reflect the official policies of the Department of Health and Human Services, nor does the mention of trade names, commercial practices, or organizations imply endorsement by the US government.