Deep learning with convolutional neural networks for EEG decoding and visualization

Abstract Deep learning with convolutional neural networks (deep ConvNets) has revolutionized computer vision through end‐to‐end learning, that is, learning from the raw data. There is increasing interest in using deep ConvNets for end‐to‐end EEG analysis, but a better understanding of how to design and train ConvNets for end‐to‐end EEG decoding and how to visualize the informative EEG features the ConvNets learn is still needed. Here, we studied deep ConvNets with a range of different architectures, designed for decoding imagined or executed tasks from raw EEG. Our results show that recent advances from the machine learning field, including batch normalization and exponential linear units, together with a cropped training strategy, boosted the deep ConvNets decoding performance, reaching at least as good performance as the widely used filter bank common spatial patterns (FBCSP) algorithm (mean decoding accuracies 82.1% FBCSP, 84.0% deep ConvNets). While FBCSP is designed to use spectral power modulations, the features used by ConvNets are not fixed a priori. Our novel methods for visualizing the learned features demonstrated that ConvNets indeed learned to use spectral power modulations in the alpha, beta, and high gamma frequencies, and proved useful for spatially mapping the learned features by revealing the topography of the causal contributions of features in different frequency bands to the decoding decision. Our study thus shows how to design and train ConvNets to decode task‐related information from the raw EEG without handcrafted features and highlights the potential of deep ConvNets combined with advanced visualization techniques for EEG‐based brain mapping. Hum Brain Mapp 38:5391–5420, 2017. © 2017 Wiley Periodicals, Inc.


A.2 FBCSP implementation
As in many previous studies (Lotte et al., 2007), we used regularized linear discriminant analysis (RLDA) as the classifier, with shrinkage regularization (Ledoit and Wolf, 2004). To decode multiple classes, we used one-vs-one majority weighted voting: We trained an RLDA classifier for each pair of classes, summed the classifier outputs (scaled to be in the same range) across classes and picked the class with the highest sum (Chin et al., 2009;Galar et al., 2011).
FBCSP is typically used with feature selection, since few spatial filters from few frequency bands often suffice to reach good accuracies and using many or even all spatial filters often leads to overfitting (Blankertz et al., 2008;Chin et al., 2009). We use a classical measure for preselecting spatial filters, the ratio of the corresponding power features for both classes extracted by each spatial filter (Blankertz et al., 2008).
Additionally, we performed a feature selection step on the final filter bank features by selecting features using an inner cross validation on the training set, see published code 1 for details.
In the present study, we designed two filter banks adapted for the two datasets to capture most discriminative motor-related band power information. In preliminary experiments on the training set, overlapping frequency bands led to higher accuracies, as also proposed by Sun et al. (2010). As the bandwidth of physiological EEG power modulations typically increases in higher frequency ranges (Buzsáki and Draguhn, 2004), we used frequency bands with 6 Hz width and 3 Hz overlap in frequencies up to 13 Hz, and bands of 8 Hz width and 4 Hz overlap in the range above 10 Hz. Frequencies above 38 Hz only improved accuracies on one of our datasets, the so-called High-Gamma Dataset (see Section 2.7, where we also describe the likely reason for this difference, namely that the recording procedure for the High-Gamma Dataset -but not for the BCI competition datasets -was specifically optimized for the high frequency range). Hence the upper limit of used frequencies was set at 38 Hz for the BCI competition datasets, while the upper limit for the High-Gamma Dataset was set to 122 Hz, close to the Nyquist frequency, thus allowing FBCSP to also use information from the gamma band.
As a sanity check, we compared the accuracies of our FBCSP implementation to those published in the literature for the same BCI competition IV dataset 2a (Sakhavi et al., 2015), showing very similar performance: 67.59% for our implementation vs 67.01% for their implementation on average across subjects (p>0.7, Wilcoxon signed-rank test, see Result 1 for more detailed results). This underlines that our FBCSP implementation, including our feature selection and filter bank design, indeed was a suitable baseline for the evaluation of our ConvNet decoding accuracies.

A.3 Residual network architecture
In total, the ResNet has 31 convolutional layers, a depth where ConvNets without residual blocks started to show problems converging in the original ResNet paper (He et al., 2015). In layers where the number of channels is increased, we padded the incoming feature map with zeros to match the new channel dimensionality for the shortcut, as in option A of the original paper (He et al., 2015).  Table S2: Residual network architecture hyperparameters. Number of kernels, kernel and output size for all subparts of the network. Output size is always time x height x channels. Note that channels here refers to input channels of a network layer, not to EEG channels; EEG channels are in the height dimension.

Layer/Block Number of Kernels Kernel Size Output Size
Output size is only shown if it changes from the previous block. Second convolution and all residual blocks used ELU nonlinearities. Note that in the end we had seven outputs, i.e., predictions for the four classes, in the time dimension ( 7x1x4 final output size). In practice, when using cropped training as explained in Section 2.5.4, we even had 424 predictions, and used the mean of these to predict the trial.

A.4 Optimization and early stopping
Adam is a variant of stochastic gradient descent designed to work well with high-dimensional parameters, which makes it suitable for optimizing the large number of parameters of a ConvNet (Kingma and Ba, 2014). The early stopping strategy that we use throughout this study, developed in the computer vision

A.5.1 EEG spectral power topographies
To visualize the class-specific EEG spectral power modulations, we computed band-specific envelope-class correlations in the alpha, beta and gamma bands for all classes of the High-Gamma Dataset. The groupaveraged topographies of these maps could be readily compared to our input-feature unit-output network correlation maps, since, similar to the power-class correlation map described in Section 2.6.2, we computed correlations of the moving average of the squared envelope with the actual class labels, using the receptive field size of the final layer as the moving average window size. Since this is a ConvNet-independent visualization, we did not subtract any values of an untrained ConvNet. We show the resulting scalp maps for the four classes and did not average over them. Note that these computations were only used for the power topographies shown in Figure 14 and did not enter the decoding analyses as described in the preceding sections.

A.6 Dataset details
The BCI competition IV dataset 2a is a 22-electrode EEG motor-imagery dataset, with 9 subjects and 2 sessions, each with 288 four-second trials of imagined movements per subject (movements of the left hand, the right hand, the feet and the tongue) . The training set consists of the 288 trials of the first session, the test set of the 288 trials of the second session.
The BCI competition IV dataset 2b is a 3-electrode EEG motor-imagery dataset with 9 subjects and 5 sessions of imagined movements of the left or the right hand, the latest 3 sessions include online feedback . The training set consists of the approx. 400 trials of the first 3 sessions (408.9±13.7, mean±std), the test set consists of the approx. 320 trials (315.6±12.6, mean±std) of the last two sessions.
Our "High-Gamma Dataset" is a 128-electrode dataset (of which we later only use 44 sensors covering the motor cortex, (see Section 2.7.1), obtained from 14 healthy subjects (6 female, 2 left-handed, age 27.2±3.6 (mean±std)) with roughly 1000 (963.1±150.9, mean±std) four-second trials of executed movements divided into 13 runs per subject. The four classes of movements were movements of either the left hand, the right hand, both feet, and rest (no movement, but same type of visual cue as for the other classes). The training set consists of the approx. 880 trials of all runs except the last two runs, the test set of the approx. 160 trials of the last 2 runs. This dataset was acquired in an EEG lab optimized for non-invasive detection of highfrequency movement-related EEG components (Ball et al., 2008;Darvas et al., 2010). Such high-frequency components in the range of approx. 60 to above 100 Hz are typically increased during movement execution and may contain useful movement-related information (Crone et al., 1998;Hammer et al., 2016;Quandt et al., 2012 and (4.) full optical decoupling: All devices are battery powered and communicate via optic fibers.
Subjects sat in a comfortable armchair in the dimly lit Faraday cabin. The contact impedance from electrodes to skin was typically reduced below 5 kOhm using electrolyte gel (SUPER-VISC, EASYCAP GmbH, Herrsching, GER) and blunt cannulas. Visual cues were presented using a monitor outside the cabin, visible through the shielded window. The distance between the display and the subjects' eyes was approx. 1 m. A fixation point was attached at the center of the screen. The subjects were instructed to relax, fixate the fixation mark and to keep as still as possible during the motor execution task. Blinking and swallowing was restricted to the inter-trial intervals. The electromagnetic shielding combined with the comfortable armchair, dimly lit Faraday cabin and the relatively long 3-4 second inter-trial intervals (see below) were used to minimize artifacts produced by the subjects during the trials.
The tasks were as following. Depending on the direction of a gray arrow that was shown on black background, the subjects had to repetitively clench their toes (downward arrow), perform sequential finger-tapping of their left (leftward arrow) or right (rightward arrow) hand, or relax (upward arrow). The movements were selected to require little proximal muscular activity while still being complex enough to keep subjects involved. Within the 4-s trials, the subjects performed the repetitive movements at their own pace, which had to be maintained as long as the arrow was showing. Per run, 80 arrows were displayed for 4 s each, with 3 to 4 s of continuous random inter-trial interval. The order of presentation was pseudo-randomized, with all four arrows being shown every four trials. Ideally 13 runs were performed to collect 260 trials of each movement and rest. The stimuli were presented and the data recorded with BCI2000 (Schalk et al., 2004).
The experiment was approved by the ethical committee of the University of Freiburg.
The Mixed Imagery Dataset (MID) was obtained from 4 healthy subjects (3 female, all right-handed, age 26.75±5.9 (mean±std)) with a varying number of trials (S1: 675, S2: 2172, S3: 698, S4: 464) of imagined movements (right hand and feet), mental rotation and mental word generation. All details were the same as for the High Gamma Dataset, except: a 64-electrode subset of electrodes was used for recording, recordings were not performed in the electromagnetically shielded cabin, thus possibly better approximating conditions of real-world BCI usage, and trials varied in duration between 1 to 7 seconds. The dataset was analyzed by cutting out time windows of 2 seconds with 1.5 second overlap from all trials longer than 2 seconds (S1: 6074 windows, S2: 21339, S3: 6197, S4: 4220), and both methods were evaluated using the accuracy of the predictions for all the 2-second windows for the last two runs of roughly 130 trials (S1: 129, S2: 160, S3: 124, S4: 123).

A.7 EEG preprocessing
We resampled the High-Gamma Dataset to 250 Hz, i.e., the same as the BCI competition datasets, to be able to use the same ConvNet hyperparameter settings for both datasets. To ensure that the ConvNets only have access to the same frequency range as the CSPs, we low-pass filtered the BCI competition datasets to below 38 Hz. In case of the 4-f end -Hz dataset, we highpass-filtered the signal as described in 2.7.1 ( for the BCI competition datasets, we bandpass-filtered to 4-38 Hz, so the previous lowpass-filter step was merged with the highpass-filter step). Afterwards, for both sets, for the ConvNets, we performed electrode-wise exponential moving standardization with a decay factor of 0.999 to compute exponential moving means and variances for each channel and used these to standardize the continuous data. Formally, where x t and x t are the standardized and the original signal for one electrode at time t, respectively. As starting values for these recursive formulas we set the first 1000 mean values µ t and first 1000 variance values σ 2 t to the mean and the variance of the first 1000 samples, which were always completely inside the training set (so we never used future test data in our preprocessing). Some form of standardization is a commonly used procedure for ConvNets; exponentially moving standardization has the advantage that it is also applicable for an online BCI. For FBCSP, this standardization always worsened accuracies in preliminary experiments, so we did not use it. We also did not use the standardization for our visualizations to ensure that the standardization does not make our visualizations harder to interpret. Overall, the minimal preprocessing without any manual feature extraction ensured our end-to-end pipeline could in principle be applied to a large number of brain-signal decoding tasks.
We also only minimally cleaned the datasets to remove extreme high-amplitude recording artifacts. Our cleaning method thus only removed trials where at least one channel had a value outside ±800 µV . We kept trials with lower-amplitude artifacts as we assumed these trials might still contain useful brain-signal information. As described in Sections 2.6 and 3.5, we used visualization of the features learned by the ConvNets to verify that they learned to classify brain signals and not artifacts. Furthermore, for the High-Gamma Dataset, we used only those sensors covering the motor cortex: all central electrodes (45), except the Cz electrode which served as the recording reference electrode. Interestingly, using all electrodes led to worse accuracies for both the ConvNets and FBCSP, which may be a useful insight for the design of future movement-related decoding/BCI studies. Any further data restriction (trial-or channel-based cleaning) never led to accuracy increases in either of the two methods when averaged across all subjects. For the visualizations, we used all electrodes and common average re-referencing to investigate spatial distributions for the entire scalp.

A.8 Software implementation and hardware
We performed the ConvNet experiments on Geforce GTX Titan Black GPUs with 6 GB memory. The machines had Intel(R) Xeon(R) E5-2650 v2 CPUs @ 2.60 GHz with 32 cores (which were never fully used as most computations were performed on the GPU) and 128 GB RAM. FBCSP was computed on an Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60 GHz with 16 cores and 64 GB RAM. We implemented our ConvNets using the Lasagne framework (Dieleman et al., 2015), preprocessing of the data and FBCSP were implemented with the Wyrm library (Venthur et al., 2015). The code used in this study is available under https: //github.com/robintibor/braindecode/.