Fully automated unsupervised artefact removal in multichannel electroencephalogram using wavelet ‐ independent component analysis with density ‐ based spatial clustering of application with noise

Electroencephalography (EEG) is a method for recording electrical activities arising from the cortical surface of the brain, which has found wide applications not just in clinical medicine, but also in neuroscience research and studies of Brain ‐ Computer Interface (BCI). However, EEG recordings often suffer from distortions due to artefactual components that degrade the true EEG signals. Artefactual components are any unwanted signals recorded in the EEG spectrum that originate from sources other than the neurophysiological activity of the human brain. Examples of the origin of artefactual components include eye blinking, facial or scalp muscles activities, and electrode slippage. Techniques for automated artefact removal such as Wavelet Transform and Independent Component Analysis (ICA) have been used to remove or reduce the effect of artefactual components on the EEG signals. However, detecting or identifying the signal artefacts to be removed presents a great challenge, as EEG signal properties vary between individuals and age groups. Techniques that rely on some arbitrarily defined threshold often fail to identify accurately the signal artefacts in a given dataset. In this study, a method is proposed using unsupervised machine learning coupled with Wavelet ‐ ICA to remove EEG artefacts. Using Density ‐ Based Spatial Clustering of Application with Noise (DBSCAN), a classification accuracy of 97.9% is achieved in identifying artefactual components. DBSCAN achieved excellent and robust performance in identifying arte-factual components during the Wavelet ‐ ICA process, especially in consideration of the low ‐ density nature of typical artefactual signals. This new hybrid method provides a scalable and unsupervised solution for automated artefact removal that should be applicable for a wide range of EEG data types.


| INTRODUCTION
Electroencephalography (EEG) is a non-invasive method to record dynamics of brain activity using electrodes attached loosely to the scalp. Typically, EEG signals are contaminated by signal artefacts arising from various sources, notably due to eye blinking or the activity of facial and scalp muscles, or any electrode movements. Filtration of these artefacts from the EEG signals is necessary prior to further processing in brain computer interface (BCI) or other neuroscience research applications. However, filtering away artefacts in either the time or frequency domains often results in substantial loss of information relevant to cerebral activities [1,2]. Therefore, it is advisable to employ filtration methods optimised to retain as much as possible of the signal arising from the cerebral activities of interest. This is important to ensure that only EEG This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. signals of cerebral origin are the inputs for analytical applications, particularly those requiring an uninterrupted time series of EEG signals.
Methods based on Blind Source Separation (BSS), especially those using Independent Component Analysis (ICA) coupled with Discrete Wavelet Transform (DWT), comprise the standard approach for separating artefactual components from multichannel EEG signals [3][4][5][6][7]. This general approach takes advantage of the independence between artefacts and cerebral sources, which facilitates the separation of EEG signals into their independent components (IC). The artefact removal process then identifies ICs containing artefactual signals, which can then be filtered out [8,9]. However, thresholding or visual inspection of the processed ICs is required to distinguish the artefactual IC from the ICs actually containing the cerebral activities of interest [10]. Unfortunately, the thresholding methods commonly used to separate artefactual ICs are rather ineffective, because EEG noise properties vary between individuals, and furthermore, there are many permutations for the admixture of artefacts with real cerebral signals into the raw EEG recording [11].
Supervised machine learning using a Support Vector Machine (SVM) is an approach to determine the optimum hyperplane for separating different classes of data such as the artefactual and non-artefactual ICs [3,12,13]. This method significantly improves the classification accuracy and robustness of the artefact detection procedure [14]. However, supervised machine-learning algorithms such as SVM require a large amount of labelled training data to support a reliable classification model. In practice, acquired EEG data can be too sparse to suffice in training the model, or alternately it may be prohibitively difficult to label manually sufficient EEG training data.
In this study, we propose a fully automated method to detect artefactual ICs using an unsupervised machine-learning algorithm, without any requirement for training a classification model. In particular, we employed the Density-Based Spatial Clustering of Application with Noise (DBSCAN) as the unsupervised machine-learning algorithm to detect automatically the artefactual ICs. DBSCAN is robust for this particular application, since the targeted artefactual ICs in the EEG signals often have a lower density compared with the nonartefactual ICs. This study evaluates the performance of DBSCAN relative to that of a conventional supervised machine-learning algorithm and thresholding method, as described in previous studies [3,11].

| EEG recording
EEG signals were acquired using g.USBamp (g.tec) from 11 healthy volunteers, who had given informed consent to participate in a protocol approved by the local ethics committee (University Malaya Medical Centre Ethical Clearance: 20,156-1404). The electrodes were placed in accordance with the 10-20 system, with a total of 16 electrodes corresponding to FP1, FP2, F3, Fz, F4, T7, C3, Cz, C4, T8, P3, Pz, P4, O1, Oz, and O2. The ground electrode was fixed at FPz, while the reference point was fixed at the left earlobe (A1). The scalp impedance was kept below 5 kΩ and the recordings were conducted with a sampling rate of 256 Hz. A notch filter of 50 Hz (Butterworth, order 4) and a band pass filter of 0.5-100 Hz (Butterworth, order 8) were applied by default during the recording. The subjects were instructed to maintain a natural upright sitting position with eyes open for a recording session lasting up to 30 min. Eye blink artefacts were registered in the EEG signals following involuntary eye blink activities. An example of a recorded eye blink artefact is shown in Figure 1.

| Wavelet multiresolution analysis
Wavelet Multiresolution Analysis (WMA) describes the process of applying dyadic Discrete Wavelet Transform (DWT) to decompose a signal into its wavelet components and subsequently reconstruct the denoised signal [15,16]. We applied WMA to the EEG signals and retained only the frequency bands of main interest, that is, the delta (0.5-4 Hz), theta (4-8 Hz), alpha (8)(9)(10)(11)(12), and beta (12-32 Hz) bands [17,18]. DWT entails the sequential applications of low-and high-pass filters to decompose the EEG signal into multiple wavelet components as described by where g and h represent the low-and high-pass filters, respectively. The process of DWT is also shown schematically in Figure 2(a) and that of inverse DWT in Figure 2(b). In Figure 2(a), x[k] represents a channel of discrete EEG signals simultaneously passed through a low pass filter, g[n], and a high pass filter, h[n]. This process is repeated n times to decompose the EEG signal into n levels of wavelet detail, D 1 ; D 2 ; ⋯; D n and a mother wavelet of A n . Inverse DWT is applied in a similar but reversed sequence by recombining the wavelet details and mother wavelet into a channel of discrete EEG signal, x'[k] as shown in Figure 2(b). During the WMA process, only the wavelet details and mother wavelet corresponding to the frequency bands of interest are ultimately retained.

| Independent component analysis
ICA separates multivariate signals, x into their statistically independent and non-Gaussian components, b s. ICA exploits the independency between the artefactual components and the true EEG signals that originate from cerebral activities of interest to separate the artefactual components [19]. This approach has been widely used in separating artefactual components from the EEG signals for further removal processes [6,20,21]. Valid application of ICA requires several assumptions be met: The relationship of a multivariate signals, x and source components, b s is then satisfied by the equation where A is the unknown mixing matrix, which is to be estimated by using the ICA algorithms. These same source components,ŝ; are also known as the independent components (IC). The reconstruction of source components into the multivariate signals is known as inverse ICA, which is accomplished by using the equation

| Density-based spatial clustering of application with noise
Density-Based Spatial Clustering of Application with Noise (DBSCAN) is an unsupervised machine-learning algorithm for clustering of data into multiple sets of clusters [22]. DBSCAN exploits the differing density distributions of data within the entire data set to separate it into different clusters. The data points are clustered based on density, whereby any set of data points with similar density distribution is classified as one cluster. Data points in high-density regions are typically classified as a group, while data points in lowdensity regions are typically classified as noise. The input parameters of DBSCAN include radius of the circle, ε, and the minimum number of data points, minPTS that fall within the radius to be classified as a cluster. A cluster consists of core and border points that satisfy the following properties: (1) The core points within a cluster are mutually connected, that is, have data points of magnitude at least as high as minPTS falling within the defined radius.
(2) Any border point fallings within the defined radius of any core point of the cluster area is assigned as a member of the cluster.
The unsupervised machine-learning algorithm presents the advantage that it does not require training using manually labelled data, in contrast to the supervised machine-learning algorithm.

| Proposed automated artefact removal process
The process of automated artefact removal is depicted in Figure 3, wherein the EEG signals were recorded as described in Section 2.1. Initially, the EEG signals are passed through a WMA using DWT with eight levels and the mother wavelet of db8, this being the eighth member of the Daubechies family wavelets [17]. During the inverse DWT process, we retain only the frequency bands of interest falling in the range of 0.5-32 Hz, corresponding to well-established and physiologically relevant frequency bands noted above, that is, the delta (0.5-4 Hz), theta (4 to 8 Hz), alpha (8 to 12 Hz) and beta bands (12 to 32 Hz). These bands of activity have particular importance in studies of human pathophysiology such as the neuropsychiatric disorders [23]. The frequency bands of interest correspond to the wavelet details from D3 to D8 of the decomposed wavelet components.
After passing through the WMA process, the EEG signals are decomposed into their ICs using the ICA algorithm [24]. Statistical features of interest as detailed in our previous work [3], in particular the kurtosis, variance, range and Shannon's Entropy are extracted from the ICs using the following algorithms: These features of interest were selected based on their potential for discriminating artefactual and non-artefactual components of the complete signal [3]. The artefactual components arise from eye blinking and other muscle activities typically have a high amplitude, exceeding the amplitude range of EEG signals. Therefore, the features corresponding to the changes in amplitude are selected as the input to the machine learning algorithm to distinguish the ICs with and without artefactual components.
Unsupervised machine learning algorithm using DBSCAN is then applied to identify the ICs with artefactual components to be removed. The parameters of the DBSCAN algorithm are determined beforehand using a subset of the data with the minPTS value adjusted to fit with the number of data points accordingly.
Identified ICs with artefactual components are passed through a wavelet artefact removal process as described previously [6,11,25]. Here, the wavelet components with coefficient exceeding the universal value of wavelet denoise are deemed to be artefacts and are removed. The universal value, K for wavelet denoise is calculated as F I G U R E 3 Proposed automated artefacts removal process using Wavelet-independent component analysis and density-based spatial clustering of application with noise K ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi 2 log n p σ ð2:10Þ where n is the length of data and σ ¼ medianð|W ðj; kÞ|Þ 0:6745 ð2:11Þ represents the magnitude of the neuronal wide band signal [17,26]. W ðj; kÞ refers to the wavelet components with scale factor k and time parameter j. Finally, inverse DWT and inverse ICA are applied to reconstruct the filtered EEG signals with artefacts removed.

| Filtered clean EEG signals
After passing through the proposed automated artefacts removal process, clean EEG signals are obtained as the final output. An example of a five-second epochs of filtered, clean EEG signals corresponding to the five-second epoch in Figure 1 is shown in Figure 4. It can be seen that the eye-blink artefacts in the second and fifth segments of the five-second epochs are effectively removed, while retaining only the frequency bands of interest from 0.5 to 32 Hz.

| Evaluation on the accuracy of DBSCAN clustering algorithm
The classification accuracy of the DBSCAN algorithm in identifying ICs containing artefactual components are evaluated by the following formulae where TP represents the number of True Positive, TN the True Negative, FP the False Positive, and FN the False Negative events. Figure 5(a) and (b) show the instances of classification for an EEG dataset containing 704 data points pooled from the 11 subjects. In this example, the parameters of DBSCAN were selected as ε = 2.6 and minPTS = 19. The symbol * denotes the instances of misclassification.
We noticed that the artefactual ICs typically exist in low density regions and were detected as noise by the DBSCAN algorithm. The classification result of the DBSCAN algorithm is summarised in Table 1. The DBSCAN algorithm achieved a sensitivity of 94.1%, specificity of 98.1%, and an overall accuracy of 97.9%.
We made a comparison with the corresponding classification accuracy achieved by supervised machine-learning algorithm using a Support Vector Machine (SVM) and likewise using an arbitrary defined thresholding algorithm. The results, summarised in Table 2, show that the classification accuracy of the DBSCAN algorithm was superior to that obtained with the thresholding algorithm, but was marginally lower than results of the supervised machine-learning algorithm using SVM. However, we emphasise that the application of DBSCAN as an unsupervised machine-learning algorithm does not require any previous manual labelling of the data, nor does it require a training process using a large amount of data. Table 2 shows the performance of the DBSCAN algorithm as compared with the SVM and thresholding algorithms.

| Correlation coefficients
We computed the correlation coefficients between the filtered EEG signals and their raw signal before filtering to compare the robustness of the filtering processes in retaining the structure of the EEG signals. The correlation coefficient is described by the following equation: where μ x and μ y are the expected values of x and y, while σ x and σ y are the standard deviations of x and y, respectively. Table 3 shows the result of average correlation coefficient (±standard deviation) for the proposed hybrid method compared against the other methods. We observe that the proposed method achieved an average correlation coefficient of 0.947, thus achieving similar results with the thresholding algorithm and the SVM. The DBSCAN algorithm successfully filtered most of the target artefactual components while retaining a high correlation coefficient with the unfiltered raw EEG signals. Indeed, the DBSCAN algorithm had a lower sensitivity and specificity as compared with the SVM, thereby missing some of the marginal artefactual ICs with features that were at the borderline with non-artefactual ICs. As a result, the DBSCAN algorithm retained a marginally lower proportion of the true EEG signals while still being successful in removing the major artefactual components contaminating the EEG signals. We conclude that the proposed method using DBSCAN has performed very well and retained a reasonable proportion of the raw EEG signals compared with standard methods.

| DISCUSSION
We have described a fully automated system for the removal of EEG artefacts using wavelet-ICA and an unsupervised  Abbreviations: DBSCAN, density-based spatial clustering of application with noise; IC, independent components; SVM, support vector machine.

-
machine learning algorithm with DBSCAN. The DBSCAN method is particularly suitable for this application as the ICs with artefactual components typically exist at lower density as compared with the ICs that specifically pertain to the cerebral activities of interest. We selected statistical features that best describe the discrepancy between ICs with and without artefactual components. In this case, the features were range, variance, kurtosis and Shannon's Entropy, as they typically have higher amplitudes when an artefactual component is present in the contaminated ICs [3]. Although DBSCAN achieved a slightly lower accuracy as compared with the SVM in this test of real EEG data, DBSCAN is an unsupervised machine learning algorithm for data clustering and consequently does not require prior labelling of the data or availability of a training data set. This property can bring a logistic advantage in the handling of new types of data. The appropriate parameters of DBSCAN are selected by testing on a similar set of sample data. Here, the pair of input values of ε and minPTS are selected by screening through several permutations to achieve maximisation of the classification accuracy. Following this procedure, the value of minPTS is linearly scaled to the sample size of the data. For example, in the present instance of 704 samples, we used 19 as the value of minPTS; doubling of the samples size to 1408 samples would require doubling minPTS to a value of 38.
Classification using our unsupervised machine learning algorithm proved to have significant advantages over the supervised machine learning algorithm. As noted above, an unsupervised machine learning algorithm does not require labelled data, which is a potentially useful consideration when handling a large data volume, for which manual labelling may be impractical. Furthermore, the unsupervised machine learning algorithm does not require a training step that depends upon a large amount of training data. This is particularly important when data are scarce, as might occur for EEG recordings in certain clinical populations, which would limit the amount of unused data remaining for analysis by the trained model. Therefore, our proposed method lends itself for largescale applications, such as in the task of filtering multichannel EEG signals with large sample size.
The DBSCAN algorithm also achieved a high correlation coefficient between the filtered signals and raw EEG signals, despite removal of the artefactual components. One might argue that the DBSCAN algorithm has a lower sensitivity and specificity as compared with the SVM, and thereby had a lower performance in the selective removal of artefactual components. However, upon further examination, we noticed that most of the eye blink artefacts were indeed removed, while only those artefacts under borderline classification remained in the filtered EEG signals. These borderline cases are typically difficult to distinguish by visual inspection in the present application of EEG signal. It remains to be determined if retaining these borderline signals would be detrimental to the interpretation of EEG signals, or if it might rather be helpful to retain them as much as possible in the EEG signals. However, resolving this debate is beyond the scope of this study, which aims only to introduce the method of filtering EEG signals using Wavelet-ICA and DBSCAN.

| CONCLUSION
We propose and test a fully automated EEG artefact removal system using an unsupervised machine learning algorithm in conjunction with DBSCAN and wavelet-ICA. In the analysis pipeline, wavelet-ICA is used to separate and remove the artefactual components within the ICs, while the DBSCAN is used to identify the artefactual components in an unsupervised manner. Although the proposed method achieved a slightly lower classification accuracy than did its supervised counterpart, its adequately high performance was obtained without any requirement for data labelling or training of a classification model. The proposed method also achieved a high correlation coefficient comparable with the other methods presented. This method should be readily scalable to filter EEG signals of substantial sample size.