## 1. INTRODUCTION

High-dimensional data poses special challenges for data mining in general and outlier detection in particular. Though in recent years, several surveys on outlier detection have been published (see refs. [1–8], to name a few), the difficulties in high-dimensional data and specialized approaches in this area have not been sketched in any of those (though, notably, a recent textbook edition sketches three example approaches [9]). In fact, most approaches to this problem have been proposed just recently in the past two or three years. Since the development of unsupervised methods for outlier detection in high-dimensional data in Euclidean space appears to be an emerging topic, this survey is specialized on this topic. We hope to help researchers working in this area to become aware of other approaches, to understand the advancements and the lasting problems, and to better identify the true achievements of the various approaches. Though a lot of variants for outlier detection are around, such as supervised versus unsupervised or specialized approaches for specific data types (such as item sets, time series, sequences, categorical data, see refs. 7,10 for an overview), here we focus on unsupervised approaches for numerical data in Euclidean space.

Albeit the infamous ‘curse of dimensionality’ has been credited for many problems and has indiscriminately been used as a motivation for many new approaches, we should try to understand the problems occurring in high-dimensional data in more detail. For example, there is a widespread mistaken belief that every point in high-dimensional space is an outlier. This—misleading, to say the least—statement has been suggested as a motivation for the first approach specialized to outlier detection in subspaces of high dimensional data [11], recurring superficially to a fundamental paper on the ‘curse of dimensionality’ by Beyer et al. [12]. Alas, as we will discuss in the following section, it is not as simple as that.

Indeed, this often cited but less often well-understood study [12], has been reconsidered recently by several researchers independently in different research areas. We will, therefore, begin our survey by inspecting truths and myths associated with the ‘curse of dimensionality’, study a couple of its effects that may be relevant for outlier detection, and discuss the findings of the renewed interest in this more than 10-year-old study (Section 2). Afterwards, we will discuss different families of outlier detection approaches concerned with high-dimensional data: first, approaches that treat the issues of efficiency and effectiveness in high-dimensional data without specific interest in a definition of outliers with respect to subspaces of the data (Section 3), and second, those that search for outliers specifically in subspaces of the data space (Section 4). In Section 5, we will comment on some open-source tools providing implementations of outlier detection algorithms, and remark on the difficulties of understanding and evaluating the results of outlier detection algorithms. Finally, in Section 6, we conclude the paper.