Coping with high dimensionality in massive datasets

Authors

  • Jon R. Kettenring

    Corresponding author
    1. Charles A. Dana Research Institute for Scientists Emeriti (RISE), Drew University, Madison, NJ, USA
    • Charles A. Dana Research Institute for Scientists Emeriti (RISE), Drew University, Madison, NJ, USA
    Search for more papers by this author

Abstract

A massive dataset is characterized by its size and complexity. In its most basic form, such a dataset can be represented as a collection of n observations on p variables. Aggravation or even impasse can result if either number is huge. The more difficult challenge is usually associated with the case of very high dimensionality or ‘big p’. There is a fast growing literature on how to handle such challenges, but most of it is in a supervised learning context involving a specific objective function, as in regression or classification. Much less is known about effective strategies for more exploratory data analytic activities. The purpose of this article is to put into historical perspective much of the recent research on dimensionality reduction and variable selection in such problems. Examples of applications that have stimulated this research are discussed along with a sampling of the latest methodologies to illustrate the onslaught of creative ideas that have surfaced. From a practitioner's perspective, the most effective strategy may be to emphasize the role of interdisciplinary teamwork with decisions on how best to grapple with high dimensionality emerging from a mixture of statistical thinking and consideration of the circumstances of the application. WIREs Comp Stat 2011 3 95–103 DOI: 10.1002/wics.141

For further resources related to this article, please visit the WIREs website

Ancillary