#### Spike detection

The signals were recorded with multi-channel electrodes at the sampling frequency *ω*_{s} of 20 kHz. They first underwent a band-pass filter to remove slowly changing local field potential and high-frequency fluctuations. In this study, we compared two types of band-pass filters. The classical window method (CWM) employed a finite impulse response filter that was derived by taking a difference between two sampling functions with different frequencies. We used finite impulse response filters rather than infinite impulse response filters. The latter filters are generally faster than the former but they show frequency-dependent phase responses that make the accurate detection of spike peaks difficult. Figure 2A shows the CWM filter for the sampling rate *ω*_{s} (inset) and its frequency–response property. The band-pass range, order and window function of the filter are 800 Hz–3 kHz, 50 and Hamming type, respectively. Figure 2B displays the frequency–response property of our finite impulse response filter constructed from a Mexican hat (MXH)-type wavelet for the same sampling frequency (inset). The filter has band-pass frequencies around *ω*_{p} = 2 kHz and the order is only 26. The wavelet is given as with *s* = 0.25 ×*ω*_{s}/*ω*_{p}, where *s* is the time length normalized by *ω*_{s} and *l* is the sampling index (integer). As the two filters are symmetrical with respect to time 0, they do not show phase delays. We note that the MXH filter with 27 sampled values (including the origin) is computationally less costly than the CWM filter with 51 sampled values. Nevertheless, the MXH filter works as efficiently as the CWM filter in low-cut filtering.

After the band-pass filtering, spikes were detected by amplitude thresholding. As the recorded spikes have negative peaks, the threshold was set to −4σ unless otherwise stated, where the SD of noise was estimated to be from the band-passed signal *x* (Hoaglin *et al.*, 1983; Quiroga *et al.*, 2004). The discrete spike waveform detected by each channel was interpolated with quadratic splines and the precise spike-firing time was defined as the time of the greatest negative peak among all detected spikes in all channels. A spike in general exhibits slightly different peak times at different channels. To avoid detecting the same spike more than once, the waveforms detected within a time window of 0.5 ms were regarded as the same spike.

Spike detection is the first step in spike sorting and is considered to affect the quantity of sorted spikes. Lowering the detection threshold enables the detection of more spikes. However, most of the detected spikes with small amplitudes are finally grouped into a contaminated cluster, hence adding no valid spike trains. Therefore, detecting more spikes does not necessarily increase the number of spikes that are suitable for further analysis.

#### Feature extraction

At first glance, spike clustering is more efficient in a greater dimension, as more information on the spike waveforms is available. In practice, however, the number of clusters is underestimated, as the dimension is increased beyond a certain value. The difficulty arising from the high-dimensionality of the data space is called ‘the curse of dimensionality’ (Bishop, 2006) and it should be mitigated by eliminating redundant data information.

In this study, we reduced the dimension of the feature space by either extracting the principal components or selecting the coefficients of WT of spike waveforms. In the PCA, the raw data were first filtered by a 300th order 200 Hz high-pass finite impulse response filter with Hamming window function. The high order of filtering effectively eliminated the DC component from the filtered signals, which becomes a potential obstacle in spike clustering, at a relatively small cost of computations. The filtered data were resampled at 20 kHz, from –0.5 ms ahead to 1.05 ms behind each detected peak time (equivalently, sampling points in the interval [−10 : 21]), such that point 0 may coincide with the peak time. Thus, 128-dimensional (four electrodes of 32 points) data were available for each spike. We then extracted 12 principal components from these 128-dimensional data by using PCA.

The PCA, however, is not necessarily useful for clustering, as PCA merely extracts the dimension exhibiting a large variance in data distribution, whereas clustering is most effectively executed in the dimensions in which the data distribution exhibits multiple sharp peaks rather than a single broad peak. Therefore, another spike-sorting algorithm employed WT for extracting the characteristic features of spike waveforms. The raw unfiltered data were resampled at 20 kHz, from −0.5 ms ahead to 1.05 ms behind each detected peak time (equivalently, sampling points in the interval [−10 : 21]), such that point 0 may coincide with the peak time. Note that WT requires no preparatory filtering that depends on an empirical choice of cut-off frequency. We then applied the multi-resolution analysis to the spike waveform (Halata *et al.*, 2000; Quiroga *et al.*, 2004) obtained from each channel and derived its time–frequency coefficients. We used Harr’s wavelet (Harr, 1910; Mallat, 1998) and the Cohen-Daubechies-Feauveau 9/7 (CDF97) wavelet (Cohen *et al.*, 1992; Daubechies, 1992). After the multi-resolution analysis, we obtained a one-dimensional distribution of each coefficient over the ensemble of spikes recorded with each channel.

A feature is only useful for separating units if it has a multi-modal distribution, i.e. a distribution with more than one peak. We reduced the dimensionality of the data by selecting the wavelet coefficients with multi-modal distributions. We evaluated each coefficient by applying the RVB clustering algorithm to the distribution of that coefficient. We computed *F*_{2} − *F*_{1} for each coefficient, where *F*_{1} and *F*_{2} are objective functions that rate the goodness-of-fit of the model with one or two clusters, respectively (see Fig. 3). We then selected the 22 coefficients with the largest values of *F*_{2} − *F*_{1}. Note that knowing the explicit number of peaks is not necessary for the purpose discussed here, even if the distribution is better modeled with more than two peaks. To remove the redundancy of the extracted features, we further reduced the number of the coefficients by using PCA. Our analysis of simultaneous extracellular/intracellular recording data suggested that the present spike clustering is most accurate in the feature dimension of about 8–20 (data not shown). In this study, the dimension was fixed at 12. On the electrophysiological datasets that we analyzed, these coefficients accounted for 98% of the variance of the selected wavelet coefficients. The above reduction was crucial for suppressing the computational load and the error rate in spike clustering. Thus, spikes of the individual neurons were represented in the 12-dimensional feature space spanned by these coefficients.

The mixture of factor analyzer is known to be a powerful method of solving the curse of dimensionality. This method enables feature extraction and clustering in the original data dimension (Görür *et al.*, 2004). In our preliminary studies, however, solving the mixture of factor analyzer was time consuming and required accurate estimation of many parameters, which often deteriorated reliable convergence to a reasonably good solution. Therefore, we do not consider the mixture of factor analyzer in the present study. Our open software ‘EToS’, however, provides the mixture of factor analyzer as an option so that users can test it with their data.

#### Clustering

Let *p*(*x*_{n}, *z*_{n} = *k|θ*, *m*) be the conditional probability that the *n*-th data takes a value *x*_{n} and belongs to the *k*-th cluster with probability *α*_{k}, where *θ* = {*α*_{1},..., *α*_{m}, *β*_{1},... *β*_{m}} represents the set of parameters characterizing the clusters and *m* is the number of clusters. In this study, we fit the clusters with a normal mixture model *p*(*x*_{n}, *z*_{n} = *k*|*θ*,*m*) = *α*_{k}*N*(*x*|*β*_{k}) and Student’s *t* mixture model *p*(*x*_{n}, *z*_{n} = *k*|*θ*, *m*) = *α*_{k}*T*(*x* | *β*_{k}), where *N*(*x*|*β*_{k}) and *T*(*x*|*β*_{k}) represent normal and Student’s *t*-distributions, respectively, and the normalized cluster size *α*_{k} should satisfy . For the normal distribution, , where *v*_{k} and *μ*_{k} are the mean and variance of the distribution to fit cluster *k*, respectively. For the Student’s *t*-distribution, *β*_{k} = {*v*_{k}, *μ*_{k}, ∑_{k}}, where *v*_{k} is the number of degrees of freedom of the distribution. EM and VB methods were tested in parameter estimation. Thus, we compared the performance of the following four combined algorithms: normal EM (NEM), Student’s *t* EM [robust EM (REM)], normal VB (NVB) and Student’s *t* VB (RVB).

Basic algorithms of NEM, REM, NVB and RVB were described in Dempster *et al.* (1977), Peel & McLachlan (2000), Attias (1999) and Archambeau & Verleysen (2007), respectively. The correct number of clusters is usually unknown. In the conventional EM method, we first calculate and for fixed *θ*^{(t)} at step *t*, and then determine the revised parameter *θ*^{(t + 1)} by maximizing for given data set {*x*_{1}, ..., *x*_{N}}, where *N* is the number of data points. *θ*^{(t + 1)} is then set to this value and the above procedure is repeated until a stable solution is obtained for a given value of *m*. Data *x*_{n} is classified into the cluster that has the largest value of . If, however, this value is smaller than a critical value *z*_{th}, the spike is regarded as not belonging to any cluster and is discarded. The solutions obtained for various values of *m* are examined with the minimum message length (MML) criterion (Wallace & Freeman, 1987; Figueiredo & Jain, 2000; Shoham *et al.*, 2003). Namely, we calculate the following penalized log-likelihood for different values of *m*

- (1)

where *N*_{p} is the number of parameters per component distribution (see Supporting information, Appendix S1). The second term penalizes solutions with large *m*, i.e. many clusters. The value of *m* that maximizes *F*_{m} is chosen.

The VB is a general technique to solve for the posterior probability distribution of continuous variables. It calculates an approximate distribution of the posterior, assuming that the probability variables are mutually independent. This assumption significantly reduces the cost of computations. Thus, in VB, we alternately renew the probability distributions of parameters *z* and *θ* independently according to

- (2)

- (3)

for a given prior distribution *p*(*θ*|*m*). Here, *q*(*z*) and *q*(*θ*) are estimates of the probability distributions *p*(*z*|*x*, *m*) and *p*(*θ*|*x*, *m*) that we are iteratively improving. The most adequate model is the one with the number of clusters that maximizes the lower bound of the log-evidence, which can be approximated by a penalized log-likelihood as (Takekawa & Fukai, 2009)

- (4)

where defined in Equation (2) is the generalized likelihood that the *n*-th data point *x*_{n} is likely to belong to cluster *k*, and the Kullback-Leibler divergence of the prior distribution and the test function is defined as

- (5)

To avoid the convergence to local maxima in solving NEM, REM, NVB and RVB, we introduced another trick, i.e. the deterministic annealing method (Ueda & Nakano, 1998; Katahira *et al.*, 2008). In this method, we introduce the ‘temperature parameter’*β* to replace *ρ*_{nk} with in the above calculations. Initially, *β* < 1. Small values of *β* eliminate valleys of the local minima that may trap the iterative solution and thus make the convergence to the global minimum easier. The value of *β* was renewed at each step *t* according to *β* = 0.01 × 1.05^{t} until *β* > 1 and thereafter was kept as *β* = 1. Initially, the number of components *m* should be sufficiently large and the algorithm may subsequently eliminate redundant components until this number convergences. To ensure the convergence to an optimal solution, we erased the smallest cluster and compared the penalized log-likelihood between the eliminated and previous models. Calculations for a given *m* were repeated until was satisfied and the model with a larger log-likelihood was employed. This process was repeated until the eliminated model was rejected. The details of the algorithm and the prior for VB are described in the supporting Appendix S1.