Standard Article

Biomarker Discovery: Introduction to Statistical Learning and Integrative Bioinformatics Approaches

Systems Toxicology

Bioinformatics and Chemoinformatics

  1. Dirk Repsilber1,
  2. Marc Jacobsen2

Published Online: 15 SEP 2011

DOI: 10.1002/9780470744307.gat223

General, Applied and Systems Toxicology

General, Applied and Systems Toxicology

How to Cite

Repsilber, D. and Jacobsen, M. 2011. Biomarker Discovery: Introduction to Statistical Learning and Integrative Bioinformatics Approaches. General, Applied and Systems Toxicology. .

Author Information

  1. 1

    Genetics and Biometry/Bioinformatics and Biomathematics Group, Research Institute for the Biology of Farm Animals, Dummerstorf, Germany

  2. 2

    Bernhard-Nocht Institute for Tropical Medicine, Department of Immunology, Hamburg, Germany

Publication History

  1. Published Online: 15 SEP 2011


In toxicology, biomarkers are needed for use in screenings, time series and dilution series exposure studies for safety evaluation and risk assessment. They need to be easily and reproducibly measurable, and are therefore sought amongst molecular features using OMICs high-throughput technologies in assays of blood and other easily accessible tissue. This chapter conveys methods for screening OMICs datasets for candidate biomarkers for classification. We begin focussing on single biomarker detection, and survey improvements to the t-test as well as multiplicity corrections regarding this objective. Biomarker panels (biosignatures) are patterns of several combined single features. We describe their detection using three different methods of statistical learning. Here, a special focus is on avoiding overfitting through appropriate use of cross-validation. More sophisticated approaches using gene-set enrichment algorithms and steps towards integrated bioinformatics analyses are explained. Making use of a priori knowledge about regulatory structures (gene groups, correlation structures) may further improve classification efficiency of the detected biosignatures. As the red line, we exemplify analysis possibilities using the famous Golub gene expression dataset and the appropriate R-scripts – enabling the reader to reproduce every step on his own desktop.


  • biomarker;
  • feature selection;
  • multivariate signature;
  • cross-validation;
  • diagnosis;
  • prediction;
  • statistical learning;
  • integrative bioinformatics