Construction of an automated screening system to predict breast cancer diagnosis and prognosis

Authors

  • Sou-Young Jin,

    1. Department of Computer Science, School of Medical Science and Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon
    Search for more papers by this author
  • Jae-Kyung Won,

    1. Molecular Pathology Center, Seoul National University Cancer Hospital, Seoul
    2. Graduate School of Medical Science and Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea
    Search for more papers by this author
  • Hojin Lee,

    1. Department of Computer Science, School of Medical Science and Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon
    Search for more papers by this author
  • Ho-Jin Choi

    1. Department of Computer Science, School of Medical Science and Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon
    Search for more papers by this author

  • *Sou-Young Jin, Jae-Kyung Won, and Hojin Lee contributed equally to this study.

  • This work was accepted and presented at International Conference on Internet (ICONI) 2011 Proceeding.

Dr Jae-Kyung Won, MD, Molecular Pathology Center, Seoul National University Cancer Hospital, Seoul 110-744, Korea. Email: jkwon@kaist.ac.kr

ABSTRACT

Background and aim: Using machine learning methods can be helpful in the clinical decision processes such as pathological diagnosis with the aid of microscopic feature datasets. In the present study using the Breast Cancer Wisconsin dataset, an optimal algorithm (classifiers) which can predict both diagnosis (benign vs malignant) and prognosis (recur vs non-recur) was devised by comparing several classification algorithms. Methods: The performance of a two-step algorithm, which sequentially decides diagnosis and prognosis, was compared with that of a multi-class classifier, which divides classes simultaneously. Results: In the two-step classifier, it was discovered that the functional trees (FT) algorithm is the best for the first step of classification, and Naïve Bayes is the best for the second step of classification. On the other hand, the one-step classifier shows better accuracy and better prediction on benign and non-recurring cases than the two-step classifier, but it shows lower accuracy on predicting recurring cases, leading to lower sensitivity. Conclusions: We conclude that the two-step classifier with FT and Naïve Bayes is better than the one-step classifier. This work will be helpful in setting the automated screening system in real clinics and highlight clues to improve the accuracy by refining data and algorithm selection in data mining or machine learning processes.

Ancillary