SEARCH

SEARCH BY CITATION

Keywords:

  • oncology;
  • claims data;
  • electronic medical record data;
  • classification and regression trees

ABSTRACT

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. METHODS
  5. RESULTS
  6. DISCUSSION
  7. CONFLICT OF INTEREST
  8. ACKNOWLEDGEMENTS
  9. REFERENCES

Purpose

To develop algorithms to identify metastatic cancer in claims data, using tumor stage from an oncology electronic medical record (EMR) data warehouse as the gold standard.

Methods

Data from an outpatient oncology EMR database were linked to medical and pharmacy claims data. Patients diagnosed with breast, lung, colorectal, or prostate cancer with a stage recorded in the EMR between 2004 and 2010 and with medical claims available were eligible for the study. Separate algorithms were developed for each tumor type using variables from the claims, including diagnoses, procedures, drugs, and oncologist visits. Candidate variables were reviewed by two oncologists. For each tumor type, the selected variables were entered into a classification and regression tree model to determine the algorithm with the best combination of positive predictive value (PPV), sensitivity, and specificity.

Results

A total of 1385 breast cancer, 1036 lung, 727 colorectal, and 267 prostate cancer patients qualified for the analysis. The algorithms varied by tumor type but typically included International Classification of Diseases-Ninth Revision codes for secondary neoplasms and use of chemotherapy and other agents typically given for metastatic disease. The final models had PPV ranging from 0.75 to 0.86, specificity 0.75–0.97, and sensitivity 0.60–0.81.

Conclusions

While most of these algorithms for metastatic cancer had good specificity and acceptable PPV, a tradeoff with sensitivity prevented any model from having good predictive ability on all measures. Results suggest that accurate ascertainment of metastatic status may require access to medical records or other confirmatory data sources. Copyright © 2012 John Wiley & Sons, Ltd.


INTRODUCTION

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. METHODS
  5. RESULTS
  6. DISCUSSION
  7. CONFLICT OF INTEREST
  8. ACKNOWLEDGEMENTS
  9. REFERENCES

Cancer stage is a key predictor of survival and other clinical outcomes; patients with metastatic cancer have much poorer prognoses and typically have different treatment patterns than those with earlier-stage malignancies.[1-5] For comparative effectiveness research (CER) in oncology to be performed, it is important to be able to classify patients by cancer stage to allow appropriately comparable groups. Administrative claims databases allow formation of large cohorts of cancer patients but lack crucial information on staging. Metastatic cancer is to some extent identifiable through secondary tumor International Classification of Diseases-Ninth Revision (ICD-9) codes,[6] but the accuracy and completeness of these codes in claims are unknown.

Many studies attempting to validate cancer diagnoses in claims data have focused on identifying incident cancers rather than distinguishing stage.[7-13] These efforts have generally been more successful than studies attempting to identify metastatic cancer. Several studies have used the linked Surveillance Epidemiology and End Results and Medicare claims data to identify stage IV cancer, but so far none has succeeded in producing an algorithm with high accuracy.[14-16] Limitations specific to Medicare claims, including the restricted age range (almost entirely 65+ years) and for most years the complete lack of data on pharmacy-dispensed drugs, might in part explain the poor performance of the various models.

Comparative effectiveness research in oncology using claims data would be considerably strengthened through the use of valid algorithms that differentiate metastatic from earlier-stage cancer. This study sought to develop such algorithms using staging information from an electronic medical record (EMR) system linked to claims data.

METHODS

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. METHODS
  5. RESULTS
  6. DISCUSSION
  7. CONFLICT OF INTEREST
  8. ACKNOWLEDGEMENTS
  9. REFERENCES

Data sources

Data were obtained from the Oncology Services Comprehensive Electronic Records data warehouse, an Amgen proprietary database of EMRs from 46 outpatient oncology/hematology practices, and data from several health care claims clearinghouses maintained by SDI Health, a US-based healthcare analytics organization. All data sources are de-identified and fully compliant with the Health Insurance Portability and Accountability Act. In 2009, the EMR data captured records from approximately 173,000 cancer patients in 26 states. Claims data were derived from electronic pre-adjudicated medical claims and National Council for Prescription Drug Programs claims. The former includes claims for 2.2 million cancer patients from approximately 6000 oncologists/hematologists nationwide in 2009 and includes ICD-9 diagnosis and procedure codes as well as Current Procedural Terminology[17] and Healthcare Common Procedure Coding System (HCPCS) procedure codes. The pharmacy claims include approximately 1.6 billion dispensed prescriptions per year from approximately 55% of all pharmacies and contain National Drug Code, date written, date filled, quantity, and days' supply.

Patient Selection

Adults (≥18 years) in the EMR database diagnosed with breast (women only; ICD-9 code 174.xx), lung (162.xx), colorectal (153.xx, 154.xx), or prostate cancer (men only; 185.xx) with at least one valid cancer stage recorded between 1 January 2004 and 30 April 2010 and with medical claims data available were eligible for the study. The index date was defined as the date of the most advanced stage recorded, although most patients had only one record for the stage. The earliest qualifying entry of the most advanced stage was used as the index date if the same stage was entered more than once.

Patients were excluded if they had multiple primary tumor types, their claims ended, or their cancer stage changed sooner than 60 days after the index date. In an effort to ensure that the provider responsible for the patient's cancer treatment was consistently reporting claims into SDI's system, patients were excluded if they had no cancer diagnosis claims from an appropriate specialist within 2 days before or after the index date, fewer than two visits to the specialist who issued the qualifying claims cancer diagnosis, or if the specialist did not continuously report claims for at least 6 months before and 2 months after the patient's index date. Appropriate specialists were defined as providers with a recorded specialty in oncology or those who had administered chemotherapy to any patients; for prostate cancer, urologists were also considered appropriate specialists. Patients were excluded for lack of pharmacy claims only if they had at least one order for oral chemotherapy recorded in the EMR.

Study variables

The dependent variable was cancer stage as recorded in the EMR; the stage may have been either clinical or pathological, with the distinction not always available in the data. Each patient was classified as metastatic (stage IV) or earlier stage on the basis of the most advanced stage recorded that met all inclusion criteria. Because treatment for metastatic cancer is likely to be initiated very soon after diagnosis of metastatic disease, and cancer progression and death are more likely the longer the duration of follow-up after staging, only the first 60 days after the index staging date were used for determination of indicators of metastatic disease. For all cancer types, patient age, gender, indicators for any secondary neoplasm ICD-9 code (197.xx–198.xx), and number of visits to an oncologist during the first 60 days on or after the index date were noted.

The process of variable selection involved a combination of clinically and empirically driven approaches. All antineoplastic agents dispensed or administered from the index date to 60 days after the index date were noted, as well as supportive care medications administered through injection or infusion (i.e., identified through HCPCS codes rather than pharmacy dispensing records). All diagnoses and procedures that appeared on claims during the same time window were similarly tabulated. Following review from two oncologists, indicator variables for drugs, diagnoses, and procedures were created, with separate sets of variables for breast cancer (Table 2), lung cancer (Table 3), colorectal cancer (Table 4), and prostate cancer (Table 5). Additionally, for each cancer type the ICD-9 codes 199.xx, indicating unspecified cancers that could be either primary or secondary, were observed to cluster with the 197.xx–198.xx secondary tumor codes and hence were added to the collection of codes considered to indicate secondary tumors.

Statistical analysis

A separate algorithm for each tumor type was developed using classification and regression tree (CART) models, a non-parametric technique allowing combinations of variables without requiring complex high-order interaction terms.[18] CART analyses produce clear and easily interpreted decision trees that classify patients into groups with versus without the dependent variable of interest (in this case, metastatic cancer). Analysis file preparation was performed in SAS® v9.1.3 and CART modeling in R v2.12.1.

For overfitting to be avoided and for stable classification trees to be created, each model was constructed using variable importance results from a random forest model, which accounts for competing splits at particular nodes and offers strategies for integrating several suggested models.[19] This is an ensemble method in which many trees are constructed using bootstrap resampling of subjects and variables.

For each tumor type, we divided the full data set into a training set, which was a random sample without replacement of 60% of the full data set, and a test set composed of the remaining 40%. The following procedures were performed 5000 times. From the training set, we selected a random sample with replacement of the same size as the training set, and a random sample of 10 predictor variables. The first split of the classification tree (i.e., the best univariate predictor of metastatic status) was found using this sample of patients and variables. Splitting continued until no more splits were found. Next, taking all patients not included in this random sample, we determined their classification (metastatic versus non-metastatic) as indicated by the tree and stored this classification along with their predictor values.

Using only the classes assigned to each patient when they were not included in the random sample used to develop the tree, we counted the number of times across the different trees that the patient was classified as metastatic or non-metastatic. Each patient was assigned to a category based on a majority vote over the set of trees, and sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) calculated. The test set of patients not included in any of the aforementioned steps was taken through the tree to determine their classification. For this test set, we again computed sensitivity, specificity, PPV, and NPV. Tuning parameters such as number of trees in the forest, number of predictors chosen per tree, observation weights, and voting cutoff values were adjusted to optimize the operating characteristics of the forest, especially PPV followed by sensitivity.

Finally, we examined the variable importance results, selected the most important variables that demonstrated the greatest efficiency in identifying the two dependent variable groups (metastatic and non-metastatic), and used them to create a single classification tree for the full data set. Sensitivity, specificity, PPV, NPV, and the area under the receiver operating characteristic curve (ROC AUC) were calculated from this final tree model.

RESULTS

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. METHODS
  5. RESULTS
  6. DISCUSSION
  7. CONFLICT OF INTEREST
  8. ACKNOWLEDGEMENTS
  9. REFERENCES

Breast cancer

Of 12 601 patients with a diagnosis of breast cancer and a valid stage recorded in the EMR data during the study period, 1385 (11.0%) met all inclusion criteria and no exclusion criteria. Most patients who were excluded lacked a cancer diagnosis from an appropriate specialist near the EMR staging date, indicating that the provider responsible for the patient's cancer treatment was unlikely to have been contributing claims to SDI's claims system.

Of the 1385 qualifying breast cancer patients, 175 (12.6%) were staged as metastatic on their index date. Mean ages for both groups were 59.6 years, with standard deviation (SD) of 14.0 and 12.7 in the metastatic and non-metastatic groups, respectively (Table 1). The variables examined in the CART models included drugs, diagnoses, and procedure-based indicators (Table 2). Most variables were more frequently seen in metastatic patients, except treatment with doxorubicin or cyclophosphamide; diagnosis with benign or in situ neoplasms; breast surgery or lymph node imaging, biopsy, or excision; and a procedure code for non-metastatic cancer.

Table 1. Demographic characteristics of cancer cohorts
Patient characteristicMetastaticNot metastatic
Breast cancer  
Number of patients1751210
Age on index date, years  
18–3915 (8.6%)66 (5.5%)
40–4930 (17.1%)215 (17.8%)
50–6458 (33.1%)488 (40.3%)
65+72 (41.1%)441 (36.4%)
Gender  
Female175 (100.0%)1,210 (100.0%)
Lung cancer  
Number of patients477559
Age on index date, years  
18–394 (0.8%)3 (0.5%)
40–4938 (8.0%)28 (5.0%)
50–64184 (38.6%)195 (34.9%)
65+251 (52.6%)333 (59.6%)
Gender  
Male268 (56.2%)299 (53.5%)
Female209 (43.8%)260 (46.5%)
Colorectal cancer  
Number of patients215512
Age on index date, years  
18–397 (3.3%)21 (4.1%)
40–4942 (19.5%)65 (12.7%)
50–6472 (33.5%)176 (34.4%)
65+94 (43.7%)250 (48.8%)
Gender  
Male105 (48.8%)281 (54.9%)
Female110 (51.2%)231 (45.1%)
Prostate cancer  
Number of patients17691
Age on index date, years  
40–491 (0.6%)1 (1.1%)
50–6446 (26.1%)24 (26.4%)
65+129 (73.3%)66 (72.5%)
Gender  
Male176 (100.0%)91 (100.0%)
Table 2. Breast cancer predictor variables from index date to index date + 60 days
VariableMetastatic N = 175Not metastatic N = 1210
  • *

    Agents include capecitabine, lapatinib, pamidronate, zoledronic acid, ixabepilone, gemcitabine, navelbine, and liposomal doxorubicin.

  • ICD-9, International Classification of Diseases-Ninth Revision; CPT, Current Procedural Terminology.

Drugs  
Any metastatic agent*92 (52.6%)40 (3.3%)
Capecitabine4 (2.3%)2 (0.2%)
Lapatinib1 (0.6%)0 (0.0%)
Epoetin alfa or darbepoetin alfa43 (24.6%)75 (6.2%)
Diphenhydramine42 (24.0%)151 (12.5%)
Zoledronic acid (Zometa) or pamidronate79 (45.1%)28 (2.3%)
Doxorubicin7 (4.0%)129 (10.7%)
Bevacizumab26 (14.9%)13 (1.1%)
Cyclophosphamide7 (4.0%)248 (20.5%)
Gemcitabine11 (6.3%)8 (0.7%)
Paclitaxel55 (31.4%)52 (4.3%)
Diagnoses (ICD-9 codes)  
Secondary or unspecified lymph node neoplasm (196.xx)5 (2.9%)37 (3.1%)
Secondary neoplasm (not lymph nodes; 197.xx–199.xx)90 (51.4%)24 (2.0%)
Benign neoplasms (210.xx–229.xx)3 (1.7%)28 (2.3%)
Carcinoma in situ (230.xx–234.xx)2 (1.1%)22 (1.8%)
Neoplasms of uncertain behavior (235.xx–238.xx)3 (1.7%)19 (1.6%)
Anemia (285.xx)56 (32.0%)117 (9.7%)
Adverse effects of drugs (995.2x, E93.31)53 (30.3%)144 (11.9%)
Encounter for chemotherapy or long-term medication use (V58.11, V58.69)106 (60.6%)460 (38.0%)
Procedures (CPT codes)  
Breast surgery, including mastectomy or reconstruction (19300–19380)2 (1.1%)70 (5.8%)
Lymph node imaging/biopsy/excision (38500–38792, 78195)0 (0.0%)62 (5.1%)
Immunoassay for tumor antigen Ca 15–3 (86300)78 (44.6%)233 (19.3%)
Oncology disease status: not metastatic (G9071-G9074)0 (0.0%)45 (3.7%)
Oncology disease status: metastatic (G9075)10 (5.7%)4 (0.3%)
Three or more oncologist visits145 (82.9%)772 (63.8%)

The CART model that provided the highest PPV in conjunction with adequate sensitivity defined metastatic breast cancer patients as (Figure 1a)

  1. Secondary neoplasm = yes
  2. Secondary neoplasm = no and any metastatic agent = yes and paclitaxel = yes
  3. Secondary neoplasm = no and any metastatic agent = no and bevacizumab = yes
image

Figure 1. Classification tree for identifying metastatic (a) breast cancer, (b) lung cancer, (c) colorectal cancer, and (d) prostate cancer

Download figure to PowerPoint

Using this model, PPV = 0.75, NPV = 0.95, sensitivity = 0.62, specificity = 0.97, and ROC AUC = 0.82.

Lung cancer

A total of 4936 patients had a lung cancer diagnosis and a valid stage recorded in the EMR data during the study period, of whom 1036 (21.0%) qualified for the study (Table 1). Metastatic cancer was more common in the lung cancer cohort than in the breast cancer cohort, with 477 (46.0%) patients identified as metastatic through the EMR data. Metastatic patients had a mean age of 64.4 years (SD 10.4) and non-metastatic 66.2 years (SD 10.2). Drugs, diagnoses, and procedures were again mostly found more frequently in metastatic patients (Table 3). Only the diagnosis of chronic airway obstruction not elsewhere classified and the procedure code indicating non-metastatic cancer status were more common in non-metastatic patients.

Table 3. Lung cancer predictor variables from index date to index date + 60 days
VariableMetastaticNot metastatic
N = 477N = 559
  • *

    Agents include erlotinib, bevacizumab, and premetrexed.

Drugs  
Any metastatic agent*128 (26.8%)38 (6.8%)
Erlotinib9 (1.9%)2 (0.4%)
Bevacizumab96 (20.1%)18 (3.2%)
Premetrexed43 (9.0%)22 (3.9%)
Zoledronic acid (Zometa) or pamidronate112 (23.5%)15 (2.7%)
Pegfilgrastim100 (21.0%)57 (10.2%)
Diagnoses  
Malignant neoplasm of brain (191.xx)14 (2.9%)1 (0.2%)
Secondary neoplasm (not lymph nodes; 197.xx–199.xx)195 (40.9%)32 (5.7%)
Drug-induced neutropenia (288.03)127 (26.6%)85 (15.2%)
Chronic airway obstruction NEC (496.xx)47 (9.9%)105 (18.8%)
Encounter for chemotherapy or long-term medication use (V58.11, V58.69)355 (74.4%)351 (62.8%)
Procedures  
Parenteral infusion pump (E0791)93 (19.5%)75 (13.4%)
Oncology status: not metastatic (G9063-G9065)2 (0.4%)46 (8.2%)
Oncology status: stage IIIb–IV at diagnosis (G9066)48 (10.1%)24 (4.3%)
Six or more oncologist visits368 (77.1%)378 (67.6%)

The lung cancer classification tree identified metastatic patients as (Figure 1b)

  1. Secondary neoplasm = yes
  2. Secondary neoplasm = no and any metastatic agent = yes
  3. Secondary neoplasm = no and any metastatic agent = no and status not metastatic = no and status stage IIIB–IV = yes

For this model, PPV = 0.81, NPV = 0.72, sensitivity = 0.60, specificity = 0.88, and ROC AUC = 0.76.

Colorectal cancer

Out of 4667 patients with colorectal cancer, 727 (15.6%) qualified for the study, 215 (29.6%) of whom were staged as metastatic in the EMR (Table 1). Their mean ages were 61.4 (SD 12.8) and 62.7 years (SD 12.8) in the metastatic and non-metastatic groups, respectively. Only radiation treatment and the procedure code indicating non-metastatic cancer status were more common in non-metastatic than metastatic patients (Table 4).

Table 4. Colorectal cancer predictor variables from index date to index date + 60 days
VariableMetastaticNot metastatic
N = 215N = 512
  • *

    Agents include bevacizumab, pamidronate, zoledronic acid, panitumumab, irinotecan, and cetuximab.

Drugs  
Any metastatic agent*117 (54.4%)30 (5.9%)
Bevacizumab99 (46.0%)24 (4.7%)
Calcium gluconate50 (23.3%)74 (14.5%)
Leucovorin135 (62.8%)213 (41.6%)
Epoetin alfa or darbepoetin alfa53 (24.7%)88 (17.2%)
Dexamethasone141 (65.6%)213 (41.6%)
Diphenhydramine49 (22.8%)48 (9.4%)
Pegfilgrastim48 (22.3%)65 (12.7%)
Cetuximab10 (4.7%)2 (0.4%)
Irinotecan34 (15.8%)6 (1.2%)
Diagnoses  
Secondary neoplasm (not lymph nodes; 197.xx–199.xx)60 (27.9%)10 (2.0%)
Anemia in neoplastic disease or chemotherapy induced (285.22, 285.3)58 (27.0%)86 (16.8%)
Drug-induced neutropenia (288.03)49 (22.8%)63 (12.3%)
Nausea with vomiting (787.01)55 (25.6%)80 (15.6%)
Adverse effects of drugs (995.2x, E93.31)59 (27.4%)70 (13.7%)
Encounter for chemotherapy or long-term medication use (V58.11, V58.69)153 (71.2%)283 (55.3%)
Procedures  
Radiation treatment (77263–77470)15 (7.0%)65 (12.7%)
Chemotherapy administration (96400–96417)153 (71.2%)292 (57.0%)
Parenteral infusion pump (E0791)46 (21.4%)79 (15.4%)
Oncology status: not metastatic (G9084–G9086)0 (0.0%)26 (5.1%)
Oncology status: metastatic (G9087)12 (5.6%)0 (0.0%)
Four or more oncologist visits181 (84.2%)346 (67.6%)

The classification tree for colorectal cancer (Figure 1c) identifies metastatic patients as follows:

  1. Any metastatic agent = yes
  2. Any metastatic agent = no and secondary neoplasm = yes

This model results in PPV = 0.80, NPV = 0.87, sensitivity = 0.67, specificity = 0.93, and ROC AUC = 0.80.

Prostate cancer

For selection of prostate cancer patients, urologists as well as oncologists were considered appropriate providers for primary cancer care. Of 1404 patients with prostate cancer and a stage in the EMR data, 267 (19.0%) qualified for the cohort, 176 (65.9%) of whom had metastatic disease (Table 1). Mean (SD) ages were 71.4 years (9.4) for metastatic and 69.8 years (8.7) for non-metastatic patients. All drugs, diagnoses, and procedures from the first 60 days after the index date were more common among metastatic patients, with the exception of the procedure code indicating non-metastatic cancer status (Table 5).

Table 5. Prostate cancer predictor variables from index date to index date + 60 days
VariableMetastaticNot metastatic
N = 176N = 91
  • *

    Agents include bicalutamide, flutamide, megestrol acetate, docetaxel, leuprolide acetate, triptorelin pamoate, zoledronic acid, pegfilgrastim, gemcitabine HCl, filgrastim, and pamidronate.

Drugs  
Any metastatic agent*136 (77.3%)22 (24.2%)
Bicalutamide (pharmacy dispensed)16 (9.1%)4 (4.4%)
Epoetin alfa or darbepoetin alfa30 (17.0%)8 (8.8%)
Dexamethasone38 (21.6%)3 (3.3%)
Zoledronic acid (Zometa) or pamidronate91 (51.7%)6 (6.6%)
Palonosetron25 (14.2%)1 (1.1%)
Pegfilgrastim20 (11.4%)2 (2.2%)
Triptorelin26 (14.8%)5 (5.5%)
Docetaxel47 (26.7%)7 (7.7%)
Leuprolide27 (15.3%)5 (5.5%)
Diagnoses  
Secondary neoplasm (not lymph nodes; 197.xx–199.xx)99 (56.3%)6 (6.6%)
Anemia (285.2x, 285.3)36 (20.5%)7 (7.7%)
Drug-induced neutropenia (288.03)21 (11.9%)2 (2.2%)
Encounter for chemotherapy or long-term medication use (V58.11, V58.69)84 (47.7%)18 (19.8%)
Procedures  
Intravenous infusion for therapy, prophylaxis, or diagnosis (96360–96367)39 (22.2%)3 (3.3%)
Chemotherapy administration (96400–96417)91 (51.7%)19 (20.9%)
Parenteral infusion pump (E0791)19 (10.8%)1 (1.1%)
Oncology status: not metastatic (G9077–G9078)1 (0.6%)4 (4.4%)
Oncology status: metastatic (G9081–G9082)12 (6.8%)3 (3.3%)
Three or more oncologist visits131 (74.4%)55 (60.4%)

The CART model for prostate cancer identified metastatic patients as follows (Figure 1d):

  1. Secondary neoplasm = yes
  2. Secondary neoplasm = no and any metastatic agent = yes

For this model, PPV = 0.86, NPV = 0.67, sensitivity = 0.81, specificity = 0.75, and ROC AUC = 0.82.

DISCUSSION

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. METHODS
  5. RESULTS
  6. DISCUSSION
  7. CONFLICT OF INTEREST
  8. ACKNOWLEDGEMENTS
  9. REFERENCES

In an effort to provide the means of constructing comparable groups in CER studies using claims data, this study attempted to construct algorithms to identify metastatic cancer using data available in an open administrative claims database. Various approaches can be taken in algorithm construction depending on the intended uses. PPV is the most relevant statistic to a patient subgroup of interest (in this case, patients with metastatic cancer). Sensitivity is most important if the algorithm is intended as a screening tool to capture all possible metastatic patients. The models constructed were chosen with the former goal in mind; thus, models were selected that optimized PPV while still maintaining adequate sensitivity.

Each model consisted of clinically relevant variables as indicators of metastatic disease, including secondary neoplasm codes, use of medications typically used for metastatic disease, and in one case, procedure codes indicating metastatic or non-metastatic status. The classification tree for breast cancer obtained adequate PPV (75%) and excellent specificity (97%) but had mediocre sensitivity (62%), indicating that a fairly large proportion of metastatic cases was missed. The PPV of the lung cancer model (81%) was slightly better than the breast cancer model, but the sensitivity was modest (60%). Although the PPV for colorectal cancer was acceptable (80%), sensitivity was below 70%. The prostate cancer model performed better than the other models on the two primary metrics of interest (PPV = 86%, sensitivity = 81%).

Limitations of the study include using data from an EMR and an open claims system, both of which may be incomplete and contain erroneous entries. It is unknown whether the patients included are adequately representative of all cancer patients. Considering that the proportion of patients who were clearly linkable between the EMR and claims was fairly small, some systematic bias in the patients included could be present but would be difficult to evaluate. Despite our efforts to ensure that relevant claims data were present, the claims available for some patients may not have been from the physician providing most of the patient's oncology care. Additionally, the addition of ICD-9 codes 199.xx, indicating unspecified cancers that could be either primary or secondary, to the collection of codes considered to indicate secondary tumors could have increased our false positive rate in this study, but as these cases were limited in number, we believe that this impact would be small. The drugs examined in this study were those in use in the selected patient population and may not generalize to different populations. Also, the list of drugs to be considered as treatments for metastatic disease would need to be updated as new treatments are approved and as treatment recommendations change.

Prostate cancer is frequently treated by urologists, especially at early stages. The oncology EMR database is unlikely to provide a representative sample of early stage prostate cancer patients. Still, metastatic prostate cancer is typically treated by oncologists, and the patterns of care for metastatic patients may differ from earlier-stage patients by the same factors that would be seen if the latter group was taken from a more representative combination of oncology and urology clinics. Although 2 months of follow-up may not be ideal for describing treatment patterns, patients were included only if they survived the follow-up period. Results may not reflect the patterns of care for patients with very brief survival times after staging. Future studies should investigate the impact of modifying the time interval, which could include shorter or longer follow-up after the staging date and some time prior to the date of staging, when the initial workup was being conducted. Although CART models are flexible and highly appropriate for this type of analysis, other approaches such as machine learning techniques and various regression methods may also be used in an effort to construct the best performing algorithms.

The results of this study make it clear that a definition of metastatic cancer consisting of ICD-9 codes for a primary cancer and a secondary neoplasm will include many patients who are not truly metastatic and will identify only about half of all patients with metastatic disease. Thus, CER using claims data may be hampered by considerable stage misclassification, leading to residual confounding and likely attenuation of the exposure–outcome relationship. By examining additional indicators of metastatic disease, such as chemotherapy agents targeted for metastatic cancer, we have constructed decision rules superior to the secondary neoplasm code alone. Still, there remain many patients whose metastatic cancer is not readily identifiable. For research that depends on accurate determination of metastatic status, it may be necessary to access medical records or other data sources that include staging information. If more detailed medical coding systems such as International Classification of Diseases for Oncology or Systematized Nomenclature of Medicine – Clinical Terms become adopted for use in administrative claims systems in the USA, then the validity of claims data for oncology CER will be greatly enhanced.

CONFLICT OF INTEREST

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. METHODS
  5. RESULTS
  6. DISCUSSION
  7. CONFLICT OF INTEREST
  8. ACKNOWLEDGEMENTS
  9. REFERENCES

BLN, MS, and CM are employed by United BioSource Corporation, which received funding from Amgen, Inc. for this study. JLW and JDK are employees of Amgen, which manufactures some of the products investigated in the study, although none of those products were the focus of this research.

KEY POINTS

  • Although cancer stage is a key variable for most oncology studies, stages are not available in claims data.
  • The present study attempted to develop algorithms to identify metastatic cancer in claims, using electronic medical record stage data as the gold standard.
  • Algorithms for breast, lung, colorectal, and prostate cancers included ICD-9 codes for secondary neoplasms and drugs typically used for treating metastatic cancer.
  • The predictive value of the algorithms was insufficiently high to warrant their general use, suggesting that other data sources may be needed for accurate identification of metastatic cancer.

ACKNOWLEDGEMENTS

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. METHODS
  5. RESULTS
  6. DISCUSSION
  7. CONFLICT OF INTEREST
  8. ACKNOWLEDGEMENTS
  9. REFERENCES

The authors wish to thank Drs. Helen Collins and Kevin Knopf for their participation in the selection of clinically relevant variables.

REFERENCES

  1. Top of page
  2. ABSTRACT
  3. INTRODUCTION
  4. METHODS
  5. RESULTS
  6. DISCUSSION
  7. CONFLICT OF INTEREST
  8. ACKNOWLEDGEMENTS
  9. REFERENCES
  • 1
    Badger SA, Brant JL, Jones C, et al. The role of surgery for pancreatic cancer: a 12-year review of patient outcome. Ulster Med J 2010; 79: 7075.
  • 2
    Chi Z, Li S, Sheng X, et al. Clinical presentation, histology, and prognoses of malignant melanoma in ethnic Chinese: a study of 522 consecutive cases. BMC Cancer 2011; 11: 85.
  • 3
    Garcia-Carbonero R, Capdevila J, Crespo-Herrero G, et al. Incidence, patterns of care and prognostic factors for outcome of gastroenteropancreatic neuroendocrine tumors (GEP-NETs): results from the National Cancer Registry of Spain (RGETNE). Ann Oncol 2010; 21: 17941803.
  • 4
    Kim MK, Cho KJ, Park SI, et al. Initial stage affects survival even after complete pathologic remission is achieved in locally advanced esophageal cancer: analysis of 70 patients with pathologic major response after preoperative chemoradiotherapy. Int J Radiat Oncol Biol Phys 2009; 75: 115121.
  • 5
    Lee PC, Mirza FM, Port JL, et al. Predictors of recurrence and disease-free survival in patients with completely resected esophageal carcinoma. J Thorac Cardiovasc Surg 2011; 141: 11961206.
  • 6
    Health Care Financing Administration. International Classification of Disease, 9th Revision, Clinical Modification (ICD-9-CM). US Department of Health and Human Services: Washington, DC, 1980.
  • 7
    Freeman JL, Zhang D, Freeman DH, Goodwin JS. An approach to identifying incident breast cancer cases using Medicare claims data. J Clin Epidemiol 2000; 53: 605614.
  • 8
    Gold HT, Do HT. Evaluation of three algorithms to identify incident breast cancer in Medicare claims data. Health Serv Res 2007; 42: 20562069.
  • 9
    Leung KM, Hasan AG, Rees KS, Parker RG, Legorreta AP. Patients with newly diagnosed carcinoma of the breast: validation of a claim-based identification algorithm. J Clin Epidemiol 1999; 52: 5764.
  • 10
    Penberthy L, McClish D, Manning C, Retchin S, Smith T. The added value of claims for cancer surveillance: results of varying case definitions. Med Care 2005; 43: 705712.
  • 11
    Ramsey SD, Mandelenson MT, Etzioni R, Harrison R, Smith R, Taplin S. Can administrative data identify incident cases of colorectal cancer? A comparison of two health plans. Health Serv Outcomes Res Methodol 2004; 5: 2737.
  • 12
    Solin LJ, Legorreta A, Schultz DJ, Levin HA, Zatz S, Goodman RL. Analysis of a claims database for the identification of patients with carcinoma of the breast. J Med Syst 1994; 18: 2332.
  • 13
    Solin LJ, MacPherson S, Schultz DJ, Hanchak NA. Evaluation of an algorithm to identify women with carcinoma of the breast. J Med Syst 1997; 21: 189199.
  • 14
    Cooper GS, Yuan Z, Stange KC, Amini SB, Dennis LK, Rimm AA. The utility of Medicare claims data for measuring cancer stage. Med Care 1999; 37: 706711.
  • 15
    Nattinger AB, Laud PW, Bajorunaite R, Sparapani RA, Freeman JL. An algorithm for the use of Medicare claims data to identify women with incident breast cancer. Health Serv Res 2004; 39: 17331749.
  • 16
    Smith GL, Shih YC, Giordano SH, Smith BD, Buchholz TA. A method to predict breast cancer stage using Medicare claims. Epidemiol Perspect Innov 2010; 7: 1.
  • 17
    American Medical Association. Current Procedure Terminology (4th edn). American Medical Association: Chicago, IL, 1991.
  • 18
    Zhang H, Singer B. Recursive Partitioning in the Health Sciences. Springer Verlag: New York, NY, 1999.
  • 19
    Berk RA. Statistical Learning from a Regression Perspective. Springer Science+Business Media: New York, NY, 2008.