• oncology;
  • claims data;
  • electronic medical record data;
  • classification and regression trees



To develop algorithms to identify metastatic cancer in claims data, using tumor stage from an oncology electronic medical record (EMR) data warehouse as the gold standard.


Data from an outpatient oncology EMR database were linked to medical and pharmacy claims data. Patients diagnosed with breast, lung, colorectal, or prostate cancer with a stage recorded in the EMR between 2004 and 2010 and with medical claims available were eligible for the study. Separate algorithms were developed for each tumor type using variables from the claims, including diagnoses, procedures, drugs, and oncologist visits. Candidate variables were reviewed by two oncologists. For each tumor type, the selected variables were entered into a classification and regression tree model to determine the algorithm with the best combination of positive predictive value (PPV), sensitivity, and specificity.


A total of 1385 breast cancer, 1036 lung, 727 colorectal, and 267 prostate cancer patients qualified for the analysis. The algorithms varied by tumor type but typically included International Classification of Diseases-Ninth Revision codes for secondary neoplasms and use of chemotherapy and other agents typically given for metastatic disease. The final models had PPV ranging from 0.75 to 0.86, specificity 0.75–0.97, and sensitivity 0.60–0.81.


While most of these algorithms for metastatic cancer had good specificity and acceptable PPV, a tradeoff with sensitivity prevented any model from having good predictive ability on all measures. Results suggest that accurate ascertainment of metastatic status may require access to medical records or other confirmatory data sources. Copyright © 2012 John Wiley & Sons, Ltd.