The goal of explaining black boxes in EEG seizure prediction is not to explain models' decisions

Abstract Many state‐of‐the‐art methods for seizure prediction, using the electroencephalogram, are based on machine learning models that are black boxes, weakening the trust of clinicians in them for high‐risk decisions. Seizure prediction concerns a multidimensional time‐series problem that performs continuous sliding window analysis and classification. In this work, we make a critical review of which explanations increase trust in models' decisions for predicting seizures. We developed three machine learning methodologies to explore their explainability potential. These contain different levels of model transparency: a logistic regression, an ensemble of 15 support vector machines, and an ensemble of three convolutional neural networks. For each methodology, we evaluated quasi‐prospectively the performance in 40 patients (testing data comprised 2055 hours and 104 seizures). We selected patients with good and poor performance to explain the models' decisions. Then, with grounded theory, we evaluated how these explanations helped specialists (data scientists and clinicians working in epilepsy) to understand the obtained model dynamics. We obtained four lessons for better communication between data scientists and clinicians. We found that the goal of explainability is not to explain the system's decisions but to improve the system itself. Model transparency is not the most significant factor in explaining a model decision for seizure prediction. Even when using intuitive and state‐of‐the‐art features, it is hard to understand brain dynamics and their relationship with the developed models. We achieve an increase in understanding by developing, in parallel, several systems that explicitly deal with signal dynamics changes that help develop a complete problem formulation.


Feature Details
We justify here the use of our features: statistical moments, Hjorth parameters, relative spectral power, wavelet coefficient energy, and spectral edge frequency. We only used linear features. These mathematical techniques use the signal's phase/frequency and amplitude information and comply with the linearity property. When one extracts linear features, assumes the quasi-stationarity of the EEG signal within each time window from the data segmentation step in the signal pre-processing stage.

Statistical Moments
Several studies [1][2][3][4][5][6][7][8] report the use of statistical moments for characterising the amplitude distribution of the EEG time series. The first moment is the mean, the second moment is the variance, the third moment is the skewness, and the fourth moment is the kurtosis. The skewness is zero for symmetric amplitude distributions and non-zero for asymmetric distributions. The kurtosis measures the relative peakedness or flatness of an amplitude distribution [2]. These statistical measures have shown significant changes during the pre-ictal period compared to the inter-ictal state [3,5]. A decrease in the variance and an increase in the kurtosis were observed in the pre-ictal period when compared with interictal-based data [4,9].

Hjorth Parameters
Hjorth parameters are three time-domain measures of brain activity that authors have been using in seizure anticipation. These parameters intend to constitute a clinically helpful tool for quantitatively describing the electroencephalogram (EEG) [2]. The activity, mobility, and complexity are measures of mean power, root-mean-squared frequency, and root-mean-square frequency spread, respectively. Authors have reported a significant increase in the mobility and complexity of the EEG during the pre-ictal period [3,5].

Relative Spectral Power
Different physiological and pathological processes are reflected by activity in different frequency ranges of the power spectrum of the EEG [2]. Relative spectral power band can be calculated by firstly computing the Power Spectral Density (PSD) of the time series within a time window. It is essential to mention that the calculation of the PSD assumes the signal in each window is short enough to be considered quasistationary and long enough to capture the brain's low-frequency activity. Although there are several methods for spectral estimation, the simplest form can be by applying the Fast Fourier Transform (FFT) and then averaging its squared coefficients that belong to the frequency band of interest. A normalised spectral power provides a more robust measure of the fluctuations in a patient's daily life to reveal the pre-ictal state [3]. Authors have reported that brain activity increases or decreases at specific frequency bands during certain events. Before seizure onset, there may be a power transfer from the lower to the higher frequencies [3][4][5][10][11][12]. For example, Mormann et al. [5] showed a decrease in Delta band power, coupled with a relative power decrease in the other sub-bands. Bandarabadi et al. [10] showed that relative combinations of sub-band spectral powers across channel pairs might track gradual changes preceding seizures.

Wavelet Coefficients Energy
Wavelet transform is a time-frequency domain transform that can be an alternative to the FFT as it decomposes the signal in different resolution levels according to different frequency ranges [1,3,4]. In other words, wavelets provide a time-variant decomposition adapted to the signal, capturing minor details and sudden changes by providing higher frequency resolution for lower frequencies and higher time resolution to higher frequencies. With the signal decomposed, it is possible to compute several measures using the wavelet coefficients, as in the case of signal energy. By computing the energy of the signals originated by the decomposition, a measure of the energy in different frequency ranges can be achieved [3,4,13].

Spectral Edge Frequency
Usually, most EEG signal's power is within the frequency band from 0 Hz up to 40Hz [2]. Spectral edge frequency quantifies the power contained in lower frequencies concerning a given threshold. Spectral edge frequency (SEF) is the measure that indicates the frequency below x per cent of the overall signal power is contained and is frequently used in seizure prediction [1,[3][4][5]. As studies have found a power transfer from low to high frequencies during the pre-ictal stage, SEF may also be capable of capturing these dynamics.

Decorrelation time
The autocorrelation function is essential in the description of a system's dynamic as it attempts to find repeating patterns as the presence of periodic signals that may be hidden behind noise or to identify the fundamental frequency hidden by its harmonic frequencies. The autocorrelation function compares each point of the time series with other points that are lagged. When the autocorrelation function has a value close to zero, the time series is uncorrelated, which means there is no similarity between observations in the function of the time lag. When the opposite occurs, its value is close to one. Decorrelation time concerns the first zero-crossing of the autocorrelation function and provides information concerning the typical time scale of the data variability.
Decorrelation time is defined as the first zero-crossing of the autocorrelation function, which can be interpreted as a way to detect time-series stationarity [3]. Authors found that a decrease in the decorrelation time may detect a preictal period [3,5].

Machine Learning Methodologies
We performed a grid search with training seizures during training to find the preictal period (from 30 to 60 minutes). For the logistic regression and the SVMs pipeline, we included the search for the k number of selected features. For the logistic regression, we selected the best k features through the F-test, a filter method that calculates the ratio between variances values [14]. For the SVMs, we used an embedded forest of trees for selecting the best set of features for the following reasons: i) it is computationally light when compared to other embedded methods, and ii) since it is stochastic, it adds another layer of complexity which is desired as we wanted to retrieve explanations from complex methodologies. Finally, in the SVMs pipeline, we also searched the C cost-value. For each set of search parameters (preictal period, number of features, and C value) and each of the three training seizures, we obtained a fitness value by using the geometric mean of sample sensitivity and sample specificity: √ sensitivity * specif icity. We obtained the geometric mean of each seizure as follows: training seizures #1 and #2 and validating #3, training seizures #2 and #3 and validating #1, and training seizures #1 and #3 and validating #2. Then, we averaged the three geometric means to get the fitness value of a given set of search parameters. We then selected the set having the highest fitness value.
In Figures 1, 2, and 3 we show the full details of all pipelines. For the logistic regression feature selection, ANOVA F-test, we used scikit-learn's SelectKBest(f classif, k = n f eatures) function. For the feature selection in the SVMs pipeline, we used scikit learn's function RandomF orestClassif ier with max depth=10, random state=42, n estimators=100). We used linear model.LogisticRegression and svm.LinearSV C functions from scikit-learn to train our logistic regression and SVM models.

CNN architecture
We provide here the code to construct our network architecture. We also present a graphical version in Figure 4. We constructed three distinct convolutional parts, each constituted by two convolutional 2D layers. The use of three convolutional components was a common architecture found in the literature [15][16][17]. We found the filter size and values with an a priori grid search procedure. We used convolutional layers with stride instead of max pooling layers as it would help the model learn automatically to reduce dimensionality instead of just performing a fixed operation.
After each convolutional part, we applied a 2D spatial dropout layer, an activation layer, and batch normalisation. Our dropout ratio was high (0.50) to avoid overfitting. We used spatial dropout instead of regular dropout since pixels localisation is essential. In other words, neighbouring pixels correlate with each other as there is a spatial relationship. Therefore, spatial dropout drops entire feature maps instead of just one element, which helps the model to generalise. Batch normalisation layers are for stabilising training by re-centring and re-scaling. We used swish [18] for activation functions instead of Rectifier Linear Units (ReLU) to handle the dying neuron effect (when many ReLU neurons output a zero value, which may mainly happen to a learned negative bias).
We use a global average pooling 2D layer, followed by a dropout layer and a densely-connected layer. Lastly, we used a softmax layer to convert the vector of values to a probability distribution.

Other developed explanations and/or more details
In the interviews, we also provided seizure onset times, seizure classification, EEG brain activity pattern and vigilance state at seizure onset. On the feature level, we presented: i) beeswarm summary plots of SHAP Values [19] (Figure 2.b)), which displayed an information-dense summary of how the features impacted the model's output; ii) Partial Dependence Plots (PDPs) [20] and Individual conditional expectation (ICE) [21] plots (Figure 2.c)), which told us the interaction between target response and each input feature; and also iii) logistic regression coefficients, for the logistic regression model (Figure 2.d)). We also discussed the chosen electrodes and features' expected behaviour.
We developed a feature-based explanation inspired by calibration [22] and scatter plots (Figure 2.e)). While a calibration plots the average predicted probability in each bin (x-axis) against the ratio of positive predictions (y-axis), we plotted the feature value (x-axis) against its probability of seizure occurrence (Firing Power value y-axis). In this plot, we also coloured interictal and preictal samples differently. This way, this plot allowed us to visually inspect the classifier behaviour on discriminating individual interictal and preictal windows, and to observe their temporal relation due to the Firing Power values.
For the logistic regression model, we plotted the Firing Power in black. For the 15 SVMs and three CNNs, we plotted all models' Firing Power in grey and the voting decision in black. Then, for moments we considered to be of interest (Figure 2.g)), we plotted the respective points over the Firing Power scatter plot and provided counterfactual explanations [23].

More details about some lessons
Here we provide more details about some of the lessons. We also present another lesson: "Explain your system at different levels". Although important, we decided to move this rule to Supplementary Material as we believe it was intuitive to all readers.
Additional lesson: explain your system at different levels We advise researchers to divide and order their explanation reasoning into levels according to an increasing granularity or, in other words, from feature and model levels to explaining specific events.
• Feature level: show and discuss important features, namely their expected added value, time-window, and correspondent electrode. Also, show each feature's model influence quantitatively. We provided feature influence through beeswarm summary plots of SHAP values, PDP and ICEs, and by analysing the regression coefficients for the logistic regression model. We could also analyse the models' hyperparameters, such as the support vectors from the trained SVM, as suggested during the interviews.
• Model level: show how well the model differentiates independent data samples. Graphical explanations are more appealing than only providing fitting metrics and confusion matrixes. We depicted an adapted version of calibration and scatter plots to present the data points' geometric distribution and model discrimination behaviour.
• Overall system functioning: provide a visual overview of the system functioning across time and integrate complementary information. Besides providing performance metrics, such as SS and FPR/h, a visual overview provided more information, such as the alarm distribution and the Firing Power over time.
We also provided information such as the sleep-wake cycle and time-stamps, which helped to interpret classifier decisions.
• System functioning over specific moments: provide deeper explanations concerning potentially decisive moments. We could not inspect explanations about all moments as we dealt with long recordings. People tend only to pay attention and inspect some moments, particularly when models fail: i) false alarms, ii) not predicted seizures, or iii) Firing Power peaks that almost led to false alarms.
Discussing features is important: more details Spectral bands' relative power was the most discussed feature. Although these were the most appropriate measures to discuss with clinicians, we noticed differences in their conceptualisation between data scientists and clinicians. While clinicians obtain spectral band power by visually inspecting the EEG and looking for spikes, ripples, and others, data scientists use mathematical tools, such as Fourier decomposition. Despite these differences in their computation concerning the clinician interpretation, spectral bands' relative power enables extracting relevant information from long-term recordings.
Interictal and preictal concepts differ between data scientists and clinicians: more details Understanding these differences is extremely important, as ML model development is based on the provided data. Data scientists deliver data samples to ML algorithms, in which a sample is usually a 5-second window or mathematical measures (features) computed in that window. Each window is labelled as either preictal or interictal. Then, the ML algorithm trains a model that best discriminates these data samples. Furthermore, other technical aspects also need to be considered. Although the data scientists' preictal period may be more extended than the clinicians' one (a fixed period of usually several minutes until some hours before a seizure), it is still a rare event causing an extreme data imbalance. Also, although this approach is established in literature, it is hard for clinicians to understand that data scientists must define a fixed preictal period for each patient.

Interview script
During the presentation to the interviewees, they were allowed to ask questions about any technical aspects or more details. We wanted these interviews to be organic and informal. Thus, we only asked some questions when the interviewees had not spoken about some topics. In the following list, we show some of the questions asked when the participants did not mention these topics beforehand. We did not approach these topics necessarily in this order.
• What do you think about the presented features?
• Do these features give you enough information?
• What do you think about the sequence of explanations? What do you think about their grouping and order?
• What do you think about the provided explanations? Which ones were more useful? Which ones were less useful?
• Are all these explanations too many?
• What do you think about the explanations about false alarms (patients 8902 and 93402)?
• What do you think about providing circadian and sleep-wake cycle information?
• What do you think about analysing the different 15 SVMs' curves?
• Is there any issue or mistrust related to the models when we go from logistic regression for an ensemble of SVMs or CNNs?
• What are the limitations of the presented explanations?
• Would you like to have any other explanation?
In the end, we always asked the participants if they wanted to say anything else before we finished. Here we can see that no alarm was raised during the preictal period due to the refractory period of each alarm. When analysing all available hours from that seizure, we can also see that the firing power only started to rise when the patient went to sleep. Thus, all false alarms are close to each other and occur in the last four hours before the seizure onset.  We can see an influence on these transitions with small firing power peaks. When reaching almost midnight, a seizure occurred.

Statistical validation
We had to find a way to verify when a certain number of patients having a determined characteristic had statistical significance. In other words, when it was superior to a theoretical chance-level value. We used the binomial distribution [24] to model the number of successes in a sample of size n drawn with replacement from a population of size N . The binomial distribution may be a good approximation for an N much more extensive than n. The binomial distribution is the basis for the binomial test of statistical significance. We inspired our use of the binomial distribution in previous works [25][26][27].
Thus, to understand if the number of patients i) whose EEG forecasting was better/equal/worsen than the circadian forecasting or ii) whose circadian cycle influenced alarms and seizures, we used the cumulative binomial distribution (binomcdf) to verify the maximum number of successes given by chance-level. As we do not know the probability of success of these events, we reasonably assumed a p=0.05. As we looked for a significance α=0.05 and assumed a p=0.05, we looked for a number of successes whose probability was under 0.025 as we used a Bonferroni correction [28] (inspired from multiple comparison testing). Although we did not perform multiple comparison tests, we felt we needed to apply this correction since we assumed the p value.
By inspecting Figure 11, we can consider that four successes out of 40 are the maximum regarding chance level. We considered that more than four successes were above chance. for y(5)=0.014 and y(6)=0.003 S 11. Graph used to access statistical significance where y=1-binomcdf(x,n,p). Y is the obtained probability (0-1), binomcdf is the cumulative binomial distribution, n is the number of patients (40), p is the probability of success, and x is a vector from 0 to 40 with step 1.
Concerning patients with at least one seizure presenting one determined characteristic (good firing powerbut did not predict the seizure or predictive sleep-wake transitions when the EEG model failed), we made some adaptations to this method. For each patient, we had to account for the possibility of occurring a particular characteristic in at least one seizure due to luck (by chance). Thus, for each patient, we assumed a p=0.05 and calculated the probability of having success in at least one seizure by chance. The n was the number of tested seizures per patient. We obtained an average interpatient probability of 0.123.
In Figure 12, we provide an example for a patient with three tested seizures, where a probability of success (by chance) is 0.142. By using a similar rationale to the one in Figure 11, including an additional factor in the Bonferroni correction (probability should be lower than 0.0125), with n=40, x=0:1:40, and p=0.123, we obtained Figure 13. By inspecting this figure, we can consider that nine successes out of 40 are the maximum regarding chance level. We considered that more than nine successes were above chance. for y(9)=0.020 and y(10)=0.007 S 13. Graph used to access statistical significance where y=1-binomcdf(x,n,p). Y is the obtained probability (0-1), binomcdf is the cumulative binomial distribution, n is the number of patients (40), p is the probability of success(0.123), and x is a vector from 0 to the number of patients (40) with step 1.