Self‐supervised learning for outlier detection

The identification of outliers is mainly based on unannotated data and therefore constitutes an unsupervised problem. The lack of a label leads to numerous challenges that do not occur or only occur to a lesser extent when using annotated data and supervised methods. In this paper, we focus on two of these challenges: the selection of hyperparameters and the selection of informative features. To this end, we propose a method to transform the unsupervised problem of outlier detection into a supervised problem. Benchmarking our approach against common outlier detection methods shows clear advantages of our method when many irrelevant features are present. Furthermore, the proposed approach also scores very well in the selection of hyperparameters, i.e., compared to methods with randomly selected hyperparameters.


RELATED LITERATURE
Outlier detection can be turned into a classification problem and solved by any supervised classifier whenever labels are available for the data.
However, for unlabelled data, this approach is infeasible and leads to the previously mentioned problems, which we mitigate in this paper. Thus, in this paper, we focus on the situation when annotated data are not available.
The literature in the field of outlier detection is extensive. A comprehensive overview is provided, for example, by Hodge and Austin (2004), Chandola, Banerjee, andKumar (2009), or Aggarwal (2013a). In this section, we briefly discuss the methods that are particularly relevant to our work.
A variety of unsupervised algorithms are based on distances to the k-nearest neighbours (Angiulli & Pizzuti, 2002;Knorr & Ng, 1997;Ramaswamy, Rastogi, & Shim, 2000). A variation of k-nearest neighbour methods is proposed by Hautamaki, Karkkainen, and Franti (2004) by considering only the neighbourhood graph, not the absolute distances between points. The use of distances is a natural way to find outliers: large distances between points correspond to low density in the surrounding area. For this reason, the search for nearest neighbours can be interpreted as an estimate of the local density. Local variants of neighbourhood-based algorithms such as the local outlier factor (LOF) (Breunig, Kriegel, Ng, & Sander, 2000) set the distance to neighbours in relation to the usual distances of the surrounding regions. This helps to equally weight outliers in areas of different variance since the resulting outlier scores are normalized and not only dependent on the absolute distance between points.
The underlying assumption is that outliers do not conform with the general structure of the data and therefore are not well mapped when the dimension of the data is reduced. The retransformation of the data from the dimension reduced-form into the full dimension is then less successful for outliers. The distance between the original point and its restored variant provides a measure of outlyingness. This approach is also used in the deep learning community by using an autoencoder for dimensionality reduction (Sakurada & Yairi, 2014;Zhou & Paffenroth, 2017;Zong et al. 2018). Variants of the autoencoder such as the variational autoencoder have also proven to be successful in detecting outliers (An & Cho, 2015;Vasilev et al. 2018).
In addition to the approaches based on distances or dimensionality reduction, there also exist methods based on models for one-class classification. Schölkopf, Platt, Shawe-Taylor, Smola, and Williamson (2001) and Tax and Duin (2004) suggest variants of the support vector machine (SVM) for outlier detection. The deep learning community has taken up this research. Erfani, Rajasegarar, Karunasekera, and Leckie (2016) use deep learning models for feature extraction and a one-class classification for anomaly detection. Recent studies directly transfer the core idea of one-class classification into a model class of neural networks that is suitable for one-class classification (Chalapathy, Menon, & Chawla, 2018;Perera & Patel, 2019;Ruff et al. 2018).
The presence of irrelevant features can quickly lead to the curse of dimensionality such that feature selection methods are also relevant for the area of unsupervised outlier detection. Especially methods which are based on distances to other points suffer from the curse of dimensionality. Beyer, Goldstein, Ramakrishnan, and Shaft (1999) show that the distances converge to the maximum distance in the data set with increasing dimensionality; that is, the variance of the different distances decreases. This can be problematic for outlier detection methods that define outliers through the distance to other points. The authors emphasize that in data sets with 10-15 features, a concentration effect of distances can already occur; that is, the meaningfulness of distances is lost. The implication of these results for the field of outlier detection was examined in more detail by Zimek et al. (2012). They also observe a concentration effect of distances but find that the problem of high dimensionality is particularly challenging when there are features that have no information about the outlyingness of a point.
One possibility to eliminate the irrelevant features is to utilize ensembles of methods that each operate on a subset of the features. Besides mitigating the curse of dimensionality, the use of ensemble methods can also increase the accuracy of outlier detection. As a result, the success of ensemble learning for supervised applications has also influenced the unsupervised area of outlier detection (Aggarwal, 2013b;Aggarwal & Sathe, 2015;Zimek, Campello, & Sander, 2014). An autoencoder that implements an ensemble of classifiers was proposed by Chen, Sathe, Aggarwal, and Turaga (2017). The isolation forest (Liu, Ting, & Zhou, 2008) can be seen as an ensemble since it is based on a random forest. In addition to the ensemble, the isolation forest also provides options for eliminating irrelevant features: By selecting new features for each split (feature bagging), a feature selection takes place without additional costs. Our method can also be considered as an ensemble method since we base the necessary algorithms on random forests. For the same reason, we also have a built-in feature selection mechanism.
The bagging of features can also be conducted without random or isolation forest as suggested by Lazarevic and Kumar (2005). They draw random subsets of features in different iterations and perform outlier detection on that subset. Afterwards, the outlier scores from the different iterations are aggregated for each point. This way, the authors obtain an ensemble of methods that improves the results in the experiments compared to a simple model. Instead of selecting the features randomly, different weights can be assigned to them. For this purpose, assumptions are made about which features may be relevant for the detection of outliers and which may not. Kriegel, Kröger, Schubert, and Zimek (2009b) work with a weighted distance function that assigns each feature a different weight. The weights are determined for each point from a set of reference points, measured by the variance of the features. The assumption is that features with high variance have little meaning about outliers.
The observation that certain feature distributions are less meaningful about outliers also motivates the work of Müller, Schiffer, and Seidl (2011).
They use statistical tests to check whether the feature is uniformly distributed. If so, the feature is removed from the set of relevant features for outlier detection.
While these methods directly address the problem of irrelevant features, it remains open how the hyperparameters can be optimally chosen.
Our work is motivated by the fact that the problems of feature selection and the choice of hyperparameters are easier to solve in supervised learning. For this reason, we aim to transform unsupervised outlier detection into a supervised problem. To our knowledge, self-supervised learning (Doersch, Gupta, & Efros, 2015), previously also self-taught learning (Raina, Battle, Lee, Packer, & Ng, 2007), has so far rarely been used in the field of outlier detection. Self-supervised learning addresses the problem that for many machine learning applications a great amount of unlabelled data is available, which, however, cannot be used for supervised learning. Self-supervision aims at exploiting the unlabelled data for the training of the classifier. For this purpose, labels are automatically created by the computer and guide the training. How the label is created depends on the underlying data and the task to be solved.
Previous applications of self-supervised learning mostly use image data. Isola, Zoran, Krishnan, and Adelson (2015)  Recently, outliers have also been associated with adversarial attacks on existing classifiers. In the deep learning community, this is called out-of-distribution detection. It aims at the identification of data points that do not stem from the distribution of the training data. For this purpose, Hendrycks, Mazeika, and Dietterich (2019) and Hendrycks, Mazeika, Kadavath, and Song (2019) use self-supervised learning for outlier detection. These approaches are enhancements of supervised learning to make the classifier more robust to changes in the input. It further requires that a classifier is already available, which should be made more robust against outliers at the time of inference. These approaches belong to the field of supervised learning, and the goal is therefore different from our method.
Our work builds upon the work of Abe, Zadrozny, and Langford (2006), who also use synthetic points to transform outlier detection into a supervised classification problem. Our work provides essential extensions in at least three points. First, we propose a refined algorithm to generate the artificial outliers. Second, as we find that the choice of hyperparameters can have a strong influence on the performance of outlier detection methods, we propose an algorithm for optimizing the choice of hyperparameters, such that a poor ad hoc selection of hyperparameters is avoided. Third, we show that the proposed method, in contrast to all the other outlier detection methods considered, does not suffer from the curse of dimensionality when noninformative features are present.

Overview
A key element of our method is the transformation of the unsupervised problem of outlier detection into a supervised one. For this purpose, we assume that a data set X exists, which may already be contaminated by outliers. As we focus on the common situation that labels are unavailable, a supervised learning method cannot be used to train the classifier, and, thus, it is unknown which point x i ∈ X represents an outlier. The raw data set X does therefore not allow to solve a classification problem already.
To transform the problem into a supervised learning problem, pointsÑ ∶= {ñ i , i = 1, … , k} which do not follow the distribution in X are generated. For this, we draw random points from a uniform distribution in different iterations and remove those points that are too similar to the actual points in X. By knowing which points belong to X and which points belong toÑ, we can generate labels y that allow a classifier to learn the difference between X andÑ. The details for generating the pointsÑ are described in Algorithm 1.
If the classifier has learned to detect differences between X andÑ, this implies that the same classifier can also be used to determine whether a certain point fits into the structure of X. This transfer of the unsupervised to the supervised problem is therefore suitable for identifying outliers in X. If a classifier has difficulties deciding for a point x i whether it belongs to X orÑ, then this point does not follow the natural distribution in X and can, therefore, be identified as an outlier.
To ensure that the classification problem is suitable for outlier detection, the classifier must be chosen appropriately. In our case, we use a random forest because it has hyperparameters that are beneficial for outlier detection and learning decision boundaries, especially on tabular data. The choice of the hyperparameters is based on an adjusted objective function for a random search. The details are given in Sections 3.3 and 3.4 and Algorithm 2.
In summary, our method consists of the following three steps: 1. creation of pointsÑ, which are suitable for the discrimination between X andÑ, 2. search for suitable hyperparameters to make the classification problem between X andÑ useful for outlier detection and 3. learning of the decision boundaries for the classification problem and inference of outlier scores for each point.

Generation of data for self-supervised task and inference of outlier scores
The basic idea of our method is to learn a classifier that can distinguish between real points X of the data set and artificially created pointsÑ. The pointsÑ should fill the feature space evenly and ideally should also be similar to outliers in the data set. This works especially well if the outliers occur in isolation from other observations, which is by definition a property of outliers.
In the following, it is assumed that the data X is normalized within an interval [0, 1]. The artificial dataÑ is generated in iterations by repeatedly sampling and removing points from a uniform distribution. The first step is to find hyperparameters for a classifier that can separate the data X from random noiseñ. The classifier is necessary to remove points in later iterations if they are too similar to the actual data. Each featureñ j ofñ is filled with values from a uniform distribution over the interval [0, 1].
After an initial classifier is obtained to separate points X fromñ, further noiseñ is drawn in iterations. After each iteration, those points are removed fromñ, which the classifier has classified as points in X. This prevents that existing clustered structures in X are destroyed by the artificially created points. The remaining points are added toÑ. For our experiments, we repeated the sampling loop 20 times. However, since undesired points are removed after every iteration, one can also repeat the sampling more often. This does not hurt the outlier detection but will take more time to estimate outlier scores. Therefore, we recommend not setting this value too high. The points inÑ are suitable to convert the outlier detection into a classification. Algorithm 1 describes the details for generating the data. Figure 1 illustrates the results of Algorithm 1. The data were taken from the publication of Kriegel, Kröger, Schubert, and Zimek (2009a). The blue dots belong to the original data X, and the orange dots represent the artificial pointsÑ. The points inÑ are sampled around the clusters on X, while some regions, where both points from X and points fromÑ fall close together, mark the border of clusters for inlier in X. The outliers are isolated from the rest of the points in X and well surrounded by pointsÑ. This makes the points ofÑ suitable for the classification and the inference of outlier scores.
The data generated according to Algorithm 1 is suitable to solve a normal classification problem. The classification between X andÑ is suitable for the inference of outlier scores. For this purpose, the classifier is asked for the probability that a point x i is a point fromÑ. Thus, it is classified according to whether the point x i was artificially created or whether it actually corresponds to the distribution in X. The greater the classifier estimates the probability P

(Ñ |
) that x i ∈Ñ, the higher is the outlier score of that point.

Hyperparameter optimization
The optimization of the hyperparameters for the inference of the outlier scores is based on a random search. A random search tests different combinations of hyperparameters, whereby the specific value for a hyperparameter is chosen randomly. The user specifies a probability distribution from which random values for hyperparameters are drawn. The best combination of hyperparameters is found using a scoring function. In many cases, this can be the accuracy or the area under the curve (AUC) value. The final choice is the combination of hyperparameters that has achieved the highest value in the scoring function.
For the optimization of the classifier's hyperparameters in Algorithm 2, we use an adjusted scoring function. The reason is that the classifier should be chosen so that it is not able to distinguish perfectly between X andÑ. A perfect classification is useless for outlier detection since the outliers in X are also correctly assigned to X. However, our goal is to classify the outliers in X as points inÑ in order to achieve outlier detection. This is illustrated in Figure 2.
The scoring function for optimizing the hyperparameters follows these principles and is shown in Algorithm 2. For effective optimization, an assumption is necessary about the percentage of points in X representing outliers. We note this value as c. For the illustration here, we assume the value of c = 5%. First, we draw a new combination of hyperparameters and classify the points according to their belonging to X orÑ. Then, FIGURE 1 2D illustration for X andÑ generated according to Algorithm 1. The data contains inliers in the clusters and outliers isolated around the clusters. The blue dots belong to the original data (X), and the artificial orange dots (Ñ) were generated according to Algorithm 1. No generated point falls near the centre of already existing clusters in X, while the points ofÑ fill the whole space of the features evenly. The even distribution of the points is important for the classifier's ability to interpolate the space between points well. The data were taken from Kriegel et al. (2009a) we consider the outlier score for the points in X: predicted_scores ∶= f i (X). For the predicted_scores, we look at the 95% (= (1 − c)) quantile and call it the p-score. In the same way, we also consider the minimum outlier score that was predicted (m-score). The score for the present combination of hyperparameters is the difference between the p-score and the m-score.
This scoring is motivated by the objective to achieve heterogeneous outlier scores. While a minimum outlier score of 0 is desirable for many points, some points should also have large outlier scores. This is represented by the minimum and quantile of outlier scores. The bigger the difference between minimum and quantile, the more heterogeneous are the outlier scores, and we may assume that some points are isolated and can, therefore, be considered as outliers (Figure 3).

Choice of hyperparameters to optimize
A random forest has hyperparameters that describe on the one hand the growth of the trees and on the other hand the composition of the individual trees to form the random forest. In our experiments, we found some hyperparameters to be important for performance, while the FIGURE 2 A perfect classifier would classify x j as a point in X, whereas from the outlier perceptive, the classifier should predict x j to be part ofÑ. Therefore, we need a custom scoring function for the random search The trees in the forest are fully grown until the required minimum number of samples is undercut. The pruning in the following describes a cost-complexity pruning and weighs the depth of the tree in relation to the generated accuracy. Details can be found in Breiman, Friedman, Olshen, and Stone (1984). Defining T to be the set of all leaves of the tree, is the value of the pruning parameter and R(T) is the accuracy, a subtree of the tree is searched during pruning that minimizes the following criterion: The hyperparameter is the one that is determined by hyperparameter optimization. In other words, pruning leads to decision trees with the highest possible accuracy, while model complexity is low measured by the number of leaves.  Results on different benchmarking data sets. We report the AUC-value for different settings of hyperparameters. For every method, 10 different choices of hyperparameters are randomly drawn, whereas for our method, we search for optimal hyperparameters based on Algorithm 2. One can clearly see that the choice of hyperparameters highly affects the detection rate of outliers

Data and benchmark algorithms
In our experiments, we use different data sets that are publicly available. The data sets are taken from the collection of benchmarks published by Campos et al. (2015). 2 The details of the data sets are listed in Table 1. We test different scenarios. The first scenario is the pure detection of outliers without modifying the data set. The second scenario simulates the detection of outliers when features are present that have no relevance for outlier detection. The modelling of irrelevant features is based on the work of Zimek et al. (2012). They keep some meaningful features for outlier detection and fill the remaining features with values of a uniform distribution. We follow this setup in our work. We take the data set with its full features and then add more features. We fill these features with values that follow a uniform or normal distribution in equal parts. Each time we extend the dimension of the data set with irrelevant features, we measure the detection rate of outliers.

Results without artificial features
First, we look at the results without additional features. In Figure 4, 10 different runs are shown, with different hyperparameters selected randomly for each run. For KNN, the number of nearest neighbours k was varied, as well as for the LOF. For the isolation forest, the number of trees available to isolate the points is varied. The one-class SVM (OC-SVM) takes different values for the kernel bandwidth. The kernel used is the radial basis function. The autoencoder varies the number of hidden layers and, thus, the dimension in the bottleneck. Our method does not vary the hyperparameters but searches for the hyperparameters as described in Algorithm 2 initialized with different seeds.
It can be seen that the detection rate of outliers for many methods depends on how the hyperparameters are chosen. Especially when outliers are detected using the k-nearest neighbours, it is noticeable that the variance of the results is relatively small. This can be explained by the fact that larger k values lead to more robust estimates of the distance. The distance to the nearest neighbour is exposed to a larger variance than the distance when choosing larger k. Although the LOF is likewise based on distances, the additional normalization of densities that the method The additional features reduce the ability to detect outliers in most of the cases. Our method, due to the self-supervised strategy to detect outliers, remains robust in the ability to detect outliers. The experiments were repeated 10 times, and the median of the runs is shown here performs on these data sets is not beneficial. Choosing the correct hyperparameter for LOF is much more important than for KNN. The isolation forest also exhibits relatively low variance in results, just like the OC-SVM and the autoencoder. The variance in our method can be explained mainly by the fact that for each iteration, a new set of random points is generated, which can have varying usefulness for outlier detection.
Nevertheless, the detection of outliers is successful, and the random effects in the proposed algorithm do not matter much. Overall, the different methods, including ours, are on a similar level in terms of outlier detection.
An exception is the data set WPBC. This data set presents serious difficulties for a wide range of methods; our method clearly outperforms the other methods on this data. It is noticeable that the proportion of outliers for WPBC deviates significantly from the assumed 5%. So, although the hyperparameters are optimized for 5% proportion of outliers, we can find very satisfactory results. This confirms our observations from different data sets, which we do not report here, that the search of hyperparameters within reasonable limits is not sensitive to the suspected percentage of outliers.

Results with artificial features
In the next step, we simulate the detection of outliers if there are irrelevant features in the data set. For this purpose, we add features to the data sets in several iterations. We fill these features with values that are drawn alternately from a Gaussian and a uniform distribution. Using the HeartDisease data set as an example, this results in the following. The data set originally consists of 13 features. In the first iteration, we add a new feature and fill the values with values from a Gaussian distribution. In the second iteration we add another feature and fill the values with values from a uniform distribution. In the third iteration, the values again follow a normal distribution, and so on ( Figure 5).
The ability of the different methods to find outliers in this setup decreases as the number of noise features increases. It can be observed that the different methods suffer differently in their ability to identify outliers. In the setup without artificially added features, the different methods have approximately similar levels of outlier detection performance. This is independent of the fact that the methods are based on different assumptions: the autoencoder works with dimensionality reduction, while KNN and LOF detect outliers using nearest neighbours.
On the one hand, the accuracy of the outlier detection in this noisy setup is dependent on which method is used. On the other hand, it also has an influence on how the data is formed: the behaviour of the methods is also different for different data sets. This can be seen from the two data sets, HeartDisease and Lymphography. While the detection rate on HeartDisease gets worse for all methods, it is especially the OC-SVM on Lymphography that rapidly shows an undesired behaviour in the detection of outliers. The isolation forest also suffers from the additional dimensions, but less severely. Other methods show robust behaviour towards the additional noise in the data. Our method benefits from the design to detect outliers: due to the self-supervised learning, the additional noisy features do not reduce the ability to detect outliers. This is because the noisy features do not contain information about the clusters in the data set. Consequently, our method clearly outperforms the competitors in this setup, even on the challenging data set WPBC.

CONCLUSIONS
In this article, we present a method to transform unsupervised outlier detection into a supervised problem. The classifier from the supervised problem is suitable to make predictions about the outlier score of points. Furthermore, we present an algorithm to determine the hyperparameters for the classifier appropriately.
In benchmarking with other state-of-the-art methods for tabular data, our method provides at least comparable results. In the relevant case, when noninformative features are available, for example, big data applications, the strengths of the proposed method become apparent. Due to the self-supervised learning, the classifier is able to identify common structures in the data. This makes it robust against features that do not provide information about clusters in the data. In the benchmark with other methods under the presence of noninformative features, our method is clearly superior. We have made the implementation of our code publicly available and published it as an installable package on Python's package index.
The data that support the findings of this study are openly available on GitHub at https://github.com/JanDiers/self-supervised-outlier. It is taken from benchmarks used by Campos et al. (2015) and may also been accessed here: https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/. Campos et al. (2015) preprocessed the data according to the steps described in their paper. Originally, the data are taken from the UCI Machine Learning Repository at http://doi.org/10.17616/R3T91Q.

ACKNOWLEDGEMENT
Open access funding enabled and organized by Projekt DEAL.