Correlating mammographic and pathologic findings in clinical decision support using natural language processing and data mining methods

Authors

  • Tejal A. Patel MD,

    1. Houston Methodist Cancer Center, Houston, Texas
    2. Cancer Research Program, Houston Methodist Research Institute, Houston, Texas
    3. Department of Medicine, Weill Cornell Medicine, New York, New York
    Search for more papers by this author
  • Mamta Puppala MS,

    1. Department of Informatics Development, Houston Methodist Hospital, Houston, Texas
    2. Department of Systems Medicine and Bioengineering, Houston Methodist Research Institute, Houston, Texas
    Search for more papers by this author
  • Richard O. Ogunti MBBS,

    1. Department of Informatics Development, Houston Methodist Hospital, Houston, Texas
    2. Department of Systems Medicine and Bioengineering, Houston Methodist Research Institute, Houston, Texas
    Search for more papers by this author
  • Joe E. Ensor PhD,

    1. Houston Methodist Cancer Center, Houston, Texas
    2. Cancer Research Program, Houston Methodist Research Institute, Houston, Texas
    Search for more papers by this author
  • Tiancheng He PhD,

    1. Department of Informatics Development, Houston Methodist Hospital, Houston, Texas
    2. Department of Systems Medicine and Bioengineering, Houston Methodist Research Institute, Houston, Texas
    Search for more papers by this author
  • Jitesh B. Shewale BDS, MPH,

    1. Department of Epidemiology, Human Genetics & Environmental Sciences, University of Texas School of Public Health, Houston, Texas
    Search for more papers by this author
  • Donna P. Ankerst PhD,

    1. Department of Urology, University of Texas Health Science Center at San Antonio, San Antonio, Texas
    2. Department of Mathematics, Technical University of Munich, Garching, Germany
    Search for more papers by this author
  • Virginia G. Kaklamani MD, DSc,

    1. Division of Hematology Oncology CTRC, University of Texas Health Science Center San Antonio, San Antonio, Texas
    Search for more papers by this author
  • Angel A. Rodriguez MD,

    1. Houston Methodist Cancer Center, Houston, Texas
    2. Cancer Research Program, Houston Methodist Research Institute, Houston, Texas
    3. Department of Medicine, Weill Cornell Medicine, New York, New York
    Search for more papers by this author
  • Stephen T. C. Wong PhD,

    Corresponding author
    1. Cancer Research Program, Houston Methodist Research Institute, Houston, Texas
    2. Department of Informatics Development, Houston Methodist Hospital, Houston, Texas
    3. Department of Systems Medicine and Bioengineering, Houston Methodist Research Institute, Houston, Texas
    4. Department of Radiology, Neurology, and Neuroscience, Weill Cornell Medicine, New York, New York
    5. Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, New York
    • Corresponding authors: Jenny C. Chang, MD, Houston Methodist Cancer Center, 6445 Main St, OPC 24, Houston, TX 77030; Fax: (713) 793-1642; jcchang@houstonmethodist.org; and Stephen T. C. Wong, PhD, Department of Systems Medicine and Bioengineering, Houston Methodist Research Institute, 6670 Bertner Ave, Houston, TX 77030; Fax: (713) 441-7189; stwong@houstonmethodist.org

    Search for more papers by this author
  • Jenny C. Chang MD

    Corresponding author
    1. Houston Methodist Cancer Center, Houston, Texas
    2. Cancer Research Program, Houston Methodist Research Institute, Houston, Texas
    3. Department of Medicine, Weill Cornell Medicine, New York, New York
    • Corresponding authors: Jenny C. Chang, MD, Houston Methodist Cancer Center, 6445 Main St, OPC 24, Houston, TX 77030; Fax: (713) 793-1642; jcchang@houstonmethodist.org; and Stephen T. C. Wong, PhD, Department of Systems Medicine and Bioengineering, Houston Methodist Research Institute, 6670 Bertner Ave, Houston, TX 77030; Fax: (713) 441-7189; stwong@houstonmethodist.org

    Search for more papers by this author

Abstract

BACKGROUND

A key challenge to mining electronic health records for mammography research is the preponderance of unstructured narrative text, which strikingly limits usable output. The imaging characteristics of breast cancer subtypes have been described previously, but without standardization of parameters for data mining.

METHODS

The authors searched the enterprise-wide data warehouse at the Houston Methodist Hospital, the Methodist Environment for Translational Enhancement and Outcomes Research (METEOR), for patients with Breast Imaging Reporting and Data System (BI-RADS) category 5 mammogram readings performed between January 2006 and May 2015 and an available pathology report. The authors developed natural language processing (NLP) software algorithms to automatically extract mammographic and pathologic findings from free text mammogram and pathology reports. The correlation between mammographic imaging features and breast cancer subtype was analyzed using one-way analysis of variance and the Fisher exact test.

RESULTS

The NLP algorithm was able to obtain key characteristics for 543 patients who met the inclusion criteria. Patients with estrogen receptor-positive tumors were more likely to have spiculated margins (P = .0008), and those with tumors that overexpressed human epidermal growth factor receptor 2 (HER2) were more likely to have heterogeneous and pleomorphic calcifications (P = .0078 and P = .0002, respectively).

CONCLUSIONS

Mammographic imaging characteristics, obtained from an automated text search and the extraction of mammogram reports using NLP techniques, correlated with pathologic breast cancer subtype. The results of the current study validate previously reported trends assessed by manual data collection. Furthermore, NLP provides an automated means with which to scale up data extraction and analysis for clinical decision support. Cancer 2017;114–121. © 2016 American Cancer Society.

Ancillary