Leveraging external knowledge on molecular interactions in classification methods for risk prediction of patients

Authors

  • Christine Porzelius,

    Corresponding author
    1. Freiburg Center for Data Analysis and Modeling, University of Freiburg, Eckerstraße 1, 79104 Freiburg, Germany
    2. Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, Stefan-Meier-Straße 26, 79104 Freiburg, Germany
    • Phone: +49-761-203-6687, Fax: +49-761-203-7700
    Search for more papers by this author
    • These authors have contributed equally to this work.

  • Marc Johannes,

    1. Unit Cancer Genome Research, Division of Molecular Genetics, German Cancer Research Center and National Center for Tumor Diseases, Im Neuenheimer Feld 460, 69120 Heidelberg, Germany
    Search for more papers by this author
    • These authors have contributed equally to this work.

  • Harald Binder,

    1. Freiburg Center for Data Analysis and Modeling, University of Freiburg, Eckerstraße 1, 79104 Freiburg, Germany
    2. Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, Stefan-Meier-Straße 26, 79104 Freiburg, Germany
    Search for more papers by this author
  • Tim Beißbarth

    1. Department of Medical Statistics, University Medical Center Göttingen, Humboldtallee 32, 37037 Göttingen, Germany
    Search for more papers by this author

Abstract

Classification of patients based on molecular markers, for example into different risk groups, is a modern field in medical research. The aim of this classification is often a better diagnosis or individualized therapy. The search for molecular markers often utilizes extremely high-dimensional data sets (e.g. gene-expression microarrays). However, in situations where the number of measured markers (genes) is intrinsically higher than the number of available patients, standard methods from statistical learning fail to deal correctly with this so-called “curse of dimensionality”. Also feature or dimension reduction techniques based on statistical models promise only limited success. Several recent methods explore ideas of how to quantify and incorporate biological prior knowledge of molecular interactions and known cellular processes into the feature selection process. This article aims to give an overview of such current methods as well as the databases, where this external knowledge can be obtained from. For illustration, two recent methods are compared in detail, a feature selection approach for support vector machines as well as a boosting approach for regression models. As a practical example, data on patients with acute lymphoblastic leukemia are considered, where the binary endpoint “relapse within first year” should be predicted.

Ancillary