Data Analysis and Data Mining: An Introduction Adelchi Azzalini, Bruno Scarpa Oxford University Press , 2012 , ix + 278 pages, £50.00/$79.95, hardcover ISBN : 978-0-19-976710-6

Table of Contents

  1. Top of page
  2. Table of Contents
  • 1
  • 2
  • 3
    Optimism, conflicts and trade-offs
  • 4
    Prediction of quantitative variables
  • 5
    Methods of classification
  • 6
    Methods of internal analysis
  • A. 
    Complements of mathematics and statistics
  • B. 
  • C. 
    Symbols and acronyms

Readership: Final year and graduate statistics students, and researchers and practitioners in analysis of large data sets.

This is a generally careful treatment that combines theory with detailed practical analysis. The R system is used for the analyses. Data sets and other supporting material are available from the web site

The introduction (Chapter 1) begins by noting the role of computer technology in collecting and making available, in a wide range of areas, large collections of data. The sheer size of many of the new datasets has implications for the computational tools and approaches. The authors note that data have often been ‘collected for reasons other than statistical analysis’, and are commonly not quite ideal relative to the questions that are of interest. Data on a business's customers will at best relate to present customers, not to the prospective customers who may be the intended target of a marketing exercise. This limits what can be said.

Chapters 2 and 3 discuss a variety of theoretical, practical and computational considerations. Chapter 3 notes the trade-off between bias and variance and introduces tools used in model selection – the training/test data split, cross-validation and AIC and related information statistics. The theory stays strictly within an independent errors framework.

The poor fit of the theory to many applied problems is given as a reason for careful preliminary exploration of the data. That poor fit extends however to a mismatch between the independent observations assumption and dependence structures (e.g. spatial and/or temporal and/or hierarchical) that are common in many types of data. This mismatch may be hard or impossible to detect using standard approaches to data exploration. There is no mention of how the theory that underpins variance estimates, analysis of variance, and other variance-based comparisons, may be compromised by one or other form of dependence structure. A reason for use of cross-validation or test set measures of accuracy is surely that, for prediction for the population from which the data is taken, robustness against failure to properly model the dependence structure. Why not say this?

Chapter 4 describes the extension of linear and generalised linear models to incorporate spline and other nonparametric terms, then discussing regression trees and neural networks. It introduces the use of analysis of variance for model comparisons. The case studies in the final section include Support Vector Machine and random forest models in their comparisons. The main discussion of these approaches appears in the chapter on classification.

The classification methods that are discussed in Chapter 5 include, in addition to those just mentioned, multivariate logit, multinomial regression, k nearest neighbour, trees, boosting and bagging. Test data accuracy comparisons are given for several different datasets.

Chapter 6 treats methods of ‘internal analysis’– distance measures, hierarchical and non-hierarchical clustering, and use of the graphical models perspective to examine connections between variables.

My only substantial criticism is one that applies also to most other books with ‘data mining’ or ‘statistical learning’ in their title. There is no acknowledgement of the commonly important role of the ‘independent errors’ assumption in the poor fit to many applied problems.