Vowel Sound Synthesis from Electroencephalography during Listening and Recalling

Recent advances in brain imaging technology have furthered our knowledge of the neural basis of auditory and speech processing, often via contributions from invasive brain signal recording and stimulation studies conducted intraoperatively. Herein, an approach for synthesizing vowel sounds straightforwardly from scalp‐recorded electroencephalography (EEG), a noninvasive neurophysiological recording method is demonstrated. Given cortical current signals derived from the EEG acquired while human participants listen to and recall (i.e., imagined) two vowels, /a/ and /i/, sound parameters are estimated by a convolutional neural network (CNN). The speech synthesized from the estimated parameters is sufficiently natural to achieve recognition rates >85% during a subsequent sound discrimination task. Notably, the CNN identifies the involvement of the brain areas mediating the “what” auditory stream, namely the superior, middle temporal, and Heschl's gyri, demonstrating the efficacy of the computational method in extracting auditory‐related information from neuroelectrical activity. Differences in cortical sound representation between listening versus recalling are further revealed, such that the fusiform, calcarine, and anterior cingulate gyri contributes during listening, whereas the inferior occipital gyrus is engaged during recollection. The proposed approach can expand the scope of EEG in decoding auditory perception that requires high spatial and temporal resolution.

surgery. In addition, there might be the possibility of revealing subtle but important differences between presented and recalled sound stimuli if the neural activity could be sampled over the entire brain, including areas and circuits extending vastly beyond the primary auditory cortex but still operating synergistically with it. Electroencephalography (EEG) and magnetoencephalography (MEG) are noninvasive methods for recording whole-brain activity that attain high temporal resolution while sacrificing spatial resolution. Due to the coarse-grained acquisition even in their highest-density implementations, it is generally assumed that EEG and MEG do not convey information sufficiently specific for synthesizing speech or other sounds. The inherent limitations, however, can be alleviated via solving the inverse problem under optimal constraints. For example, we have shown the efficacy of a distributed source localization method based on a variational Bayesian approach. [6] Using EEG cortical current sources (EEG-CCS) estimated by the method, we succeeded in decoding differences in wrist electromyographical activity, [7] vowels, [8] and finger movements [9] from the scalp-recorded EEG, and further revealed the contribution of physiologically plausible brain areas to decoding. Therefore, there might be the possibility of successfully synthesizing presented and recalled speech sounds from the EEG for the first time.
In this proof-of-concept study, we demonstrate a method for synthesizing vowel-speech sounds that were physically presented or recalled by means of combining EEG-CCS estimation, a suitable sound representation via the mel-cepstral coefficients (MCC) and a convolutional neural network (CNN), a machine learning method. We hypothesized that, insofar as the synthesized sounds are sufficiently clear to allow discriminating between the vowels, the obtained decoders (i.e., synthesizer) would extract information primarily from the auditory cortical areas encompassed in a functional stream known as the "what" stream, which is a ventral auditory pathway that mediates the recognition of auditory information. [10][11][12][13][14][15][16][17] In addition, obtaining viable CNN decoders could provide novel insights into neural representation differences between presented and recalled sounds. The eventual synthesis of intelligible speech from the EEG would potentially deepen our understanding of auditory perception and open the way to a range of hypothetical future applications such as a brain-computer interface.

Successful Synthesis of Presented and Recalled Auditory Stimuli
We recorded EEG signals from human participants during listening (0-0.3 s from stimulus onset) and recalling (1.0-2.0 s from onset) of speech sounds for the vowels /a/ and /i/. Thereafter, EEG-CCS signals were estimated for each segment (i.e., vertex) on the cortical surface based on relatively low-density EEG time-series (32 channels) combined with a structural magnetic resonance imaging (MRI) image, sensor position data, and hierarchical priors from fMRI activations ( Figure 1). Vocal tract filter parameters known as the MCC alongside the vocal cord frequency (i.e., pitch) were extracted from the original speech sounds to provide a compact and perceptually motivated representation (see "Speech Representation and Synthesis" section in Experimental Section and Figure 2A). [18,19] Differences between vowel phonemes are robustly encoded by the MCC, and the vocal cord frequency (i.e., pitch) was matched as the speech was sampled from one voice. Therefore, the MCC parameters for presented and recalled sounds were estimated from the EEG-CCS Figure 1. Overview of the EEG-CCS estimation workflow. Vertices for EEG-CCS are shown as pink dots on the cortical surface in the MRI image (left-bottom panel). Corresponding time-series were estimated by entering EEG sensor signals into an inverse filter. The latter was estimated combining structural MRI, fMRI, EEG sensor position data, and variance information from resting-period EEG time-series. The number of EEG-CCS vertices effectively used varied depending on the activated areas delineated by the fMRI analysis, which supplied hierarchical priors to the VBMEG method. EEG-CCS during the listening and recalling periods were separately estimated for 150 trials.
www.advancedsciencenews.com www.advintellsyst.com using a 5-layer CNN. We eventually synthesized both sounds based on the estimated MCCs and the original vocal cord frequency (see "Speech Representation and Synthesis" section in Experimental Section and Figure 2B). As shown in Figure 3A, the temporal dynamics of the first 5 MCC orders were compared between the original and synthesized sounds through the coefficient of determination (R 2 ). Histograms were drawn to study the distribution of its value over the population of 10 participants ( Figure 3B). In addition, white noise was delivered as a nonspeech condition, during which participants were asked not to recall anything after the sound was presented. The median R 2 values across all participants were 0.92 for both the listening and recalling periods. Higher R 2 values tended to be obtained for the vowel /a/ (blue bars) than for the vowel /i/ (yellow bars). In addition to attaining high R 2 values at the MCC dynamics level, the synthesized sounds were remarkably natural, as shown in Movie S1, Supporting Information. However, a minority of trials did not yield clear sound, although their R 2 values were relatively high (e.g., around 0.7), and the MCC patterns of the original and synthesized sounds were quite similar. Consequently, to auditorily evaluate the synthesis performance, we asked 19 further participants to listen to the synthesized sounds and classify them as /a/ or /i/ (see "Experimental Design" section in Experimental Section). More than 77% of the listening and 79% of the recalling trials yielded R 2 values >0.85 and >0.80, respectively; the corresponding recognition rates exceeded 80%, as shown in Figure 3B. Interestingly, the recognition rate for synthesis based on EEG-CCS during recollection reached 96%, which was higher compared with listening.

Differences in Brain Area Involvement during Listening and Recalling
The high performance attained allowed us to identify the brain areas that contributed to MCC estimation by analyzing the feature values assigned to each EEG-CCS during CNN training (see "Feature Value Extraction from the CNN" section in Experimental Section). For EEG-CCS estimation, %20 000 vertices ( Figure 1) were initially allocated at 2-3 mm spacing on the cortical surface as identified on the structural MRI image. The corresponding spatially deconvolved time-series were estimated from the 32 EEG sensor signals by solving the inverse problem given suitable priors and under optimal constraints. [6,20] This method allows enhancing the spatial resolution enough to identify the brain areas contributing to MCC estimation. The number of estimated time-series varied between 13 and 663 depending on the extent of individual fMRI activation areas, but crucially, as shown in Table 1, this did not systematically affect the estimation performance (r ¼ 0.1 for listening, r ¼ -0.2 for recalling). Figure 4 shows exemplary comparisons of regional contributions to the listening-and recalling-phase sound estimation in the form of normalized feature values, which were extracted from the fifth layer of the CNN ( Figure S1, Supporting Information). We assumed that the participants with higher R 2 values would best reveal the regional activations purposefully subserving auditory perception of the stimuli, separating them from epiphenomenal (i.e., ancillary rather than essential) engagement. As shown in Table 1, Participant 5 had the highest R 2 value of 0.96 during both the listening-and recalling-phase estimations, whereas Participant 9 had the lowest R 2 value in the listening and the third-highest R 2 in the recalling-phase estimation. , leave-one-out cross-validation was carried out to evaluate the estimated MCCs: 149 trials were input into the five-layer CNN, and an estimated MCC was calculated using the remaining trial. Based on the MCC estimation, all sounds were synthesized assuming the original pitch from the vowel /a/ to perform the human sound discrimination task. All of the synthesized sounds were used in for the sound discrimination task.
www.advancedsciencenews.com www.advintellsyst.com Considering that these performance levels would imply a physiologically significant engagement of multiple cortical areas, based on data from all participants, we anecdotally investigated the relationship between performance level and activation topography. During the listening period, the right fusiform gyrus (FFG), calcarine fissure and surrounding cortex (CAL), together with the anterior cingulate and paracingulate gyri (ACC), showed high feature values in the participants attaining higher R 2 www.advancedsciencenews.com www.advintellsyst.com (see Figure 4, left-upper panel, P5, pink-colored areas), but not in those with lower R 2 (see Figure 4, left-lower panel, P9, pinkcolored areas). On the other hand, during recollection, the inferior occipital gyrus (IOG) coherently showed contributions in participant with higher R 2 (see Figure 4, right panels, P5 and P9, blue-colored area). The superior temporal gyrus (STG), middle temporal gyrus (MTG), and Heschl's gyrus (HES) provided a comparable contribution under both conditions (see Figure 4, yellow-colored areas). Despite the binaural stimulation, the hemispheric lateralization varied among participants, and individual differences in the topography of the representation were visible even between participants yielding comparable R 2 values. We next queried which regions would, on average, seem to preferentially represent the speech sounds as opposed to white noise. To address this question, the ratio of the cumulative feature values between the vowels and all three sounds was calculated, pooled over all participants and spatially integrated over each brain region. The areas over which it was highest (top 10%) were, identically for listening and recalling, the right STG, right MTG, right middle frontal gyrus (MFG), left precentral gyrus (PreCG), left MTG, left Rolandic operculum (ROL) and left PreCG, in descending order.
An additional noteworthy finding is that we could elucidate the temporal transitions in the contributing areas due to the high temporal resolution of EEG. Figure S2, Supporting Information, shows feature-value spatiotemporal transitions for each EEG-CCS, presented as a topographical map for Participant 5. The spatial patterns were similar among the first three periods (0-0.3, 0.3-0.6, and 0.6-0.9 s), but the layout drastically changed for the last period (1.0-2.0 s), hallmarking the recalling period.

Discussion
We succeeded in synthesizing auditory presented and imagined sound of two vowels and white noise from scalp-recorded EEG for the first time. The obtained vowel sounds were sufficiently natural to be distinguished by naïve listeners, allowing us to further investigate the differences in the brain areas that contributed to their representation during listening and recalling. While at present only two vowels, namely /a/ and /i/, were considered, the methodology could be generic and well-suited for extension to a broader set of phonemes.

Successful Extraction of the Brain Areas Contributing to Auditory Processing
The areas attracting high feature values for both the presented and recalled vowel sounds encompassed the STG, MTG, and HES, consistent with the previous fMRI study that classified vowels. [14] The HES contains the primary auditory cortex [21] and accordingly appeared to respond with overall more similar current to the vowels and the nonspeech sound, whereas the STG [16,22] and MTG [23] are constituent parts of the "what" stream. [17] Therefore, our finding that the three areas contribute to sound synthesis during listening as well as recalling is physiologically plausible and demonstrates the effectiveness of the proposed method in the investigation of auditory processing.
We sought to further understand the differences in regional involvement between listening and recalling. The FFG, CAL, and ACC showed high contributions during the listening period in participants with high R 2 as seen in the representative result of P5 in Figure 4. On the other hand, the IOG contributed more significantly during recollection considering both the consistent first position of P5 and the significant rise of P9 in R 2 rank from the lowest to the third position. The existing literature support the role of the FFG in the processing of phonological information [24,25] and speech perception, [26] by noting the existence of connections between the HES and the CAL, ACC, and STG [27,28] that in turn point to plausible interactions between the auditory and visual cortices. These interactions suggest that the performance of the experimental task in our study could have recruited a cross-modal sensory system. We speculate that the more substantial contribution of the IOG during recalling indicates a need for auxiliary visual processing during the recollection and imagination of vocalized sounds. On the other hand, a potential reason for CAL involvement may be down to task design, wherein, before and after the period of listening and recall, the participants fixated a cross, whose color turned from white during the rest period to red during the pretrial period. The ACC engagement plausibly reflects preparation for recollection, as this area is knowingly involved in attention, [29] memory, [30,31] and cross-modal sensation, [32] as well as auditory hallucinations [33] and auditory-verbal memory function. [34]

Combining CNN and EEG-CCS for Accurate Sound Synthesis
Accurate MCC estimation provided fruitful information with which to investigate auditory perception processing in individual brains, and was, in turn, brought by the combinational use of EEG-CCS and CNN. In this study, we used a 5-layer CNN to decode neural activity, estimate the time-course of the MCC, and thereafter synthesize speech. Although there are no previous studies of this kind that succeeded in synthesizing listened and recalled sounds, previous ECoG investigations have resorted to the mel-frequency cepstral coefficient (MFCC) and other deeplearning methods for synthesizing words or sentences while participants read them aloud. [3][4][5] These encompass RNNs, [5] bidirectional long short-term memory (bLSTM) RNN, [4] and densely connected CNNs (DenseNets). [3] It, therefore, appears commonplace to use RNNs in decoding time-series signals.
Here, however, we used CNNs, that are generally preferred for processing images and other stationary patterns: this choice was motivated by the fact that our decoding targets were sounds having a fixed time-length, and corresponding to individual vowels. For synthesizing longer and more complex sounds such as words or entire sentences, other deep-learning methods might be preferable.  The other key technique enabling the high synthesis performance was EEG-CCS. To explicitly illustrate the importance of EEG-CCS in this context, we also attempted to directly supply EEG sensor signals as input features ( Figure S3, Supporting Information). Although many trials attained R 2 values >0.80, this approach lead to performance levels systematically inferior to EEG-CCS. [7,8] One might trivially attribute this finding to the limited number of EEG sensors (i.e., 32-ch). However, after comparing the number of selected vertices for each participant's EEG-CCS and the corresponding speech synthesis performance (Table 1), we observed that Participants 2 and 8 had a superior performance with only 47 and 31 vertices, respectively. In line with expectations, this confirms that the importance of EEG-CCS estimation was in spatially deconvolving the signals, thus providing a specificity of association between time-series and brain areas/functions inherently not available to the raw scalprecorded signals. [6][7][8]

Limitations of the Computational Method
The high synthesis performance points to a potential role of the method in enabling the kind of brain-computer interfaces (BCIs) that the previous ECoG studies aimed at. [3][4][5] However, several limitations have to be overcome.
First, there is a need to clarify whether the methodology also works for other vowels and consonants, as well as for entire words, and eventually phrases using EEG signals recorded from larger population. In addition, there is a need for robust decoders that, once trained, consistently provide high performance reproducible over sessions separated by days or even weeks. Other possible limitations hindering generalization and deployment may reside in the requirement for MRI data and the number of EEG sensors to estimate EEG-CCS. However, for Variational Bayesian Multimodal Encephalography (VBMEG), fMRI-based priors are optional, and one could estimate EEG-CCS even without structural MRI information, by assuming a predetermined, standard brain-based leadfield matrix. As an example, we succeeded in decoding the differences between eight finger movements without fMRI-based priors. [9] Therefore, one could assess the necessity of MRI data depending on the specific application scenario.

Future Practical Applications
The information shown in Figure 4 suggests the existence of individual differences in the representation of auditory stimuli. One could synthesize sounds based on selected cortical regions of interest (ROI): the quality of the resulting material replayed from the MCC time-courses would provide a practical and viable index of whether the corresponding neural activity contains relevant information or not. Therefore, by changing the ROIs and the timing of the EEG signals, one may investigate individual auditory processing in a spatiotemporal manner, for example, tracing how the auditory information flows along the "what" stream. The method might be applicable to uncommunicative people with disorders of consciousness such as vegetative and locked-in states. Using EEG to synthesize sounds from neuroelectrical activity may also provide new windows on the brain areas involved in central hearing impairment, auditory hallucinations, and tinnitus.

Conclusion
This study demonstrated the synthesis of individual vowels from the EEG, which is a precursor to eventually being able to fully synthesize speech. The synthesized speech sounds during listening and recalling were sufficiently natural to achieve recognition rates >85%, and the decoder extracted signals from brain areas encompassed in the "what" auditory stream. The physiologically plausible decoders further revealed different localization between the listening and recalling-phase auditory processes. This approach may enable future applications toward investigating individual auditory processing.

Experimental Section
Experimental Design: This study leveraged the combined EEG and MRI datasets of ten healthy participants recorded in our previous research. [8] Based on the data in Table 1, a Cohen's d of 14.6 and 15.9 was observed for the listening and recalling phases, respectively. As this indicated a very large effect size, even this relatively small population yielded a power close to unity. [35] Nineteen healthy and normoacousic listeners (4 females and 15 males, 23-61 years old, average 33 years) who were not the participants of the EEG experiment participated in the sound discrimination task. They listened via headphones (PX/H, Bowers & Wilkins, Worthing, UK) to the synthesized sounds one by one and answered which vowel (/a/ or /i/) they thought they had heard. The synthesized sounds from the listening and recalling period were merged, shuffled, and presented to them. White noise epochs were excluded as their classification was trivial. Each listener experienced the sounds drawn from a random sample of five individual datasets, ensuring that all datasets were equally included while minimizing mental fatigue.
The experimental protocols for the EEG and sound discrimination experiments were approved by the ethics committee of the Tokyo Institute of Technology, Japan (approval no. 2014040). The experimental protocols for the fMRI experiments were approved by the ethics committee of the National Center of Neurology and Psychiatry, Japan (Approval No. A2014-020). Written informed consent was obtained from each participant prior to the experiment, and the experiments were conducted in accordance with the Declaration of Helsinki.
EEG and fMRI Data Acquisitions: The continuous EEG was recorded while ten participants listened to two vowel sounds, /a/ and /i/, and subsequently recalled them. White noise was also delivered as a nonspeech condition, and the participants did not recall anything under this condition; the purpose of its inclusion was to investigate and confirm the CNN performance in dissociating widely diverse MCC patterns. The vowel sounds of a male speaker were sampled from the Tohoku University-Matsushita Isolated Word Database (Speech Resources Consortium, National Institute of Informatics, Tokyo, Japan). The sound duration was 400 ms, including 100 ms of silence. During the EEG experiment, the participants listened to and recalled the sound only once per epoch, followed by a 2-3 s rest period, whereas during the fMRI experiment, they listened to and recalled the sounds six times in a block (Figure 1). The three sounds were presented in pseudorandom order, and the number of trials was identical for all the three conditions (i.e., 50 trials each for the EEG experiments and 30 trials each for the fMRI experiments). To reduce artifacts in EEG recording, eye blinking was allowed only during the rest period while a white fixation cross was presented. To reduce the risk of electromagnetic contamination of the EEG, the auditory stimuli were presented exclusively by means of in-ear headphones (Image S4i, Klipsch Inc., Indianapolis, USA), whose cables were routed well away from those of the electrode cap. The raw EEG signals were visually inspected, and the absence of macroscopic transduction artifacts was confirmed. [36] Furthermore, the zero-lag correlation between the averaged EEG responses at all electrodes, up-sampled to the audio sample rate, and the actual auditory stimuli waveforms was assessed to evaluate whether www.advancedsciencenews.com www.advintellsyst.com it was statistically significantly higher for the vowel sound being listened to than the other one. To enhance the probability of detecting a possible transduction artifact, this operation was repeated considering each raw audio track, low-pass filtered as the EEG signals, as well as its envelope. No significant effects were found (paired t-test, p > 0.28). The EEG signals were recorded at a sampling rate of 256 Hz from 32 locations according to the extended international 10-20 system using g.LADYbird active sensors connected to a g.USBamp amplifier/digitizer system (G.tec Medical Engineering, Graz, Austria). These locations were: Fp1, Fp2, AF3, AF4, Fz, F3, F4, F7, F8, FC1, FC2, FC5, FC6, Cz, C3, C4, T7, T8, CP1, CP2, CP5, CP6, Pz, P3, P4, P7, P8, PO3, PO4, O1, O2, and Oz. The signals from both earlobes were also recorded. All channels were band-pass filtered from 0.5 to 100 Hz during recording. After the EEG experiments, the 3D coordinates of the sensors on the scalp were measured using a Polaris Spectra optical tracking system (Northern Digital Inc., Ontario, Canada) with three reference points (nasion alongside left and right preauricular points).
As discussed, MRI acquisition is not essential to estimate EEG-CCS but was carried out to maximize accuracy. Using a 3 Tesla Magnetom Trio MRI scanner equipped with an 8-channel array coil (Siemens, Erlangen, Germany), functional images were acquired with a T2*-weighted gradient-echo, echo-planar imaging sequence having the following parameters: repetition time (TR) ¼ 3 s; echo time (TE) ¼ 30 ms; flip angle (FA) ¼ 90 ; field of view (FoV) ¼ 192 Â 192 mm; matrix size ¼ 64 Â 64; 43 slices; slice thickness ¼ 3 mm; 118 volumes. Two series of anatomical images were acquired using T1-weighted magnetization prepared rapid gradient echo sequences with sagittal and axial scans using the following parameters: for the sagittal scans, TR ¼ 2 s; TE ¼ 4.38 ms, FA ¼ 8 ; FOV ¼ 256 Â 256 mm; matrix size ¼ 256 Â 256; 224 slices; slice thickness ¼ 1 mm; for the axial scans: TR ¼ 2 s; TE ¼ 4.38 ms; FA ¼ 8 ; FOV ¼ 192 Â 192 mm; matrix size ¼ 192 Â 192; 160 slices; slice thickness ¼ 1 mm. The axial scan coverage was precisely the same as the functional volumes, whereas the sagittal images ensured coverage of the entire brain and head, fulfilling the requirements for specifying individual brain models (see next section). The functional images were coregistered to the sagittal image via the axial image to enhance coregistration accuracy.
Cortical Current Source Estimation: We used the VBMEG toolbox version 1.0 (ATR Neural Information Analysis Laboratories, Japan; available at http://vbmeg.atr.jp/?lang¼en) running on MATLAB R2012b for estimating the EEG-CCS from the sensor signals. As shown in Figure 1, the EEG-CCS signals were calculated by applying an inverse filter to the EEG sensor signals. As detailed in the study by Yoshimura et al., [8] the inverse filter was estimated, separately for each participant, by conjointly considering the variance in the baseline part of the EEG signals, the fMRI activations, the EEG sensor coordinates, and the structural MRI.
Each T1-weighted sagittal scan was used to define positions of cortical surface vertices as a brain model and a three-layer head model. The former consisted of the 3D coordinates for 20 000 CCS vertices defined on a polygon model, so that the dipole vertices were equidistantly distributed on and perpendicular to the cortical surface, as dictated by the geometrical structure of the sulci and gyri. The three-layer head model was used in estimating a forward model (i.e., a "leadfield matrix") that projected the 20 000 CCS to the 32 sensors by considering not only geometry but also conductivity differences between the scalp, skull, and cerebrospinal fluid. The EEG sensor positions were also coregistered as XYZ coordinates of the sagittal image, and the leadfield matrix was estimated. This process was completed using functions from the VBMEG toolbox.
The areas activated during the fMRI experiments were identified using the SPM8 software (Wellcome Department of Cognitive Neurology, UK; available at http://www.fil.ion.ucl.ac.uk/spm). T-values for two contrasts, namely vowel /a/ > vowel /i/ and vowel /i/ > vowel /a/, were calculated applying thresholds which ranged between 1.3 and 3.1, depending on the individual participants. The threshold was the same as the one chosen in our previous research because it performed best in the classification of covert vowel articulation. [8] The brain area activity information from the two contrasts was merged using an "OR" criterion and entered the inverse filter estimation process to VBMEG in the form of a Bayesian prior.
Prior to source localization, all EEG signals were further low-pass filtered at a cutoff frequency of 45 Hz, and re-referenced to the average of the earlobes. Fifty epochs per sound were extracted, time-locked to the auditory stimulation onset. The brain model, the leadfield matrix, the fMRI activations, and the variance in baseline part of the epoched EEG signals were conjointly used to estimate the inverse filters within a hierarchical Bayesian framework. [6,20] The conditions and parameters used to estimate these filters were the same as specified in a previous study through the use of leave-one-out nested cross-validation. [8] Because we included fMRI activation information as a hierarchical Bayesian prior, the locations of the estimated CCSs and the number of vertices differed across participants, [8] as shown in Table 1.
Speech Representation and Synthesis: We used the Speech Signal Processing Toolkit (SPTK, available at http://sp-tk.sourceforge.net) to perform speech parameter extraction and synthesis. The SPTK software allowed extracting pitch values and low-order mel-cepstral data from the original waveforms, as well as synthesizing the estimated sounds given the original pitch values (Figure 2A). Because the two vowel sounds were sampled from the same speaker's speech, we only estimated the MCC via the CNN. During preliminary analyses, we determined that fourth-order MCCs were sufficient for auditory sound discrimination (refer to Movie S1, Supporting Information).
Convolutional Neural Network: We used a 5-layer CNN, whose structure is shown in Figure S1, Supporting Information, as a regression model for MCC estimation. This network was trained and evaluated in Keras, a neural network library, with a Tensorflow backend, [37] running on a Linux workstation with dual CPUs (Intel Xeon E5-2687W) and dual GPGPUs (NVIDIA Tesla K40). Root-mean-square error was chosen as the loss function, and an adaptive moment estimation (Adam) [38] was used as an optimizer with 0.0001 of the initial learning rate. As we instantiated a rectified linear unit (ReLU) for the activation function of each layer, He initialization [39,40] was used for the corresponding weights. The model was trained in a leave-oneout cross-validation manner, and its performance was finally evaluated using the coefficient of determination.
Feature Value Extraction from the CNN: One of the benefits of using EEG-CCS is that we could identify the brain areas corresponding to the individual signals. To leverage this opportunity, we calculated mean feature values for individual CCSs from the output of the 5th convolution layer across all 150 cross-validation times (i.e., 50 trials for the three sounds). Because the size of this layer was the number of vertices Â 67 features of convoluted timepoints Â 32 filters, the feature values were calculated by averaging over these dimensions for each vertex per sound. The vertices were labeled according to the automated anatomical labeling atlas (AAL), [41] and examples of the mean feature values are shown in www.advancedsciencenews.com www.advintellsyst.com