via deep pyramid wavelet convolutional neural network

In this letter, a pyramid wavelet convolutional neural network for audio super resolution is presented. Since the audio signal is non-stationary, previous convolutional neural network based approaches may fail in capturing the details, these method usually focus on the global approximation error and thus produce over smooth results. To cope with this issue,itissuggestedtopredictthewaveletcoefﬁcientsoftheaudiosignal, andreconstructthesignalfromthesecoefﬁcientsstagebystagerather.Thepredictionerrorsofthewaveletcoefﬁcientsareincludedtotheloss functiontoforcethemodeltocapturethedetailcomponents.Exper-imentalresultsshowthattheapproach,trainingontheVCTKpublic dataset,achievesmoreappealingresultsthanstate-of-the-artmethods.

Related work: ASR is a basic problem in audio signal processing. Various methods are proposed to solve the ASR problem, suck like Gaussian mixture model based approaches [10][11][12], hidden Markov model based approaches [13,14]. As the DNN shows powerful generalisation performance in many machine learning applications, such like computer vision and natural language processing, many DNN based approaches have been developed for ASR [1, 7 15, 16].
CNN is a typical structure of DNN which shows powerful performance in many computer visio tasks, and was first employed in image super resolution (ISR). Chao et al. [17] used a fully convolutional neural network SRCNN to perform an ISR and achieves state-of-the-art performance, they also proved that SRCNN methods is equivalent to sparse-coding based methods. Kim [18] et al. adopted residual-learning to build a much deeper network to make use of context information. Huang et al. [19] showed that image super resolution can benefit from wavelet information.
For ASR, the CNN based ASR approaches are also reported to outperform traditional methods. The CNN based ASR methods can be roughly divided into two classes: one is time domain based and another is frequency domain based. The former one always adopt an architecture similar to ISR, and use the raw data of the audio signal as input to predicts the high resolution signal. Kuleshov et al. [1] proposed a CNN based model with bottleneck architecture, they adopt the skip connection for residual-learning, and use the pixel shuffling layer for upsampling. Kim et al. [7] adopted a network similar to [1] as the generator and proposed a GAN based method, and use the unsupervised feature loss to improve the quality. The latter one focus on the reconstruction of the spectrogram. Schmidt et al. [16] proposed long short term memory (LSTM) based method, which employed the LSTM to estimate the spectral envelope. Our model is time domain based, and use the wavelet coefficient to improve the estimation of detailed information.

Method:
Discrete wavelet transform: Given a discrete signal x ∈ R L , the Mallat algorithm [20] gives a fast filtering implementation of wavelet decomposition, which is formulated as ( 1 ) whereg andh are low pass and high pass filters of chosen wavelet transform, respectively, and c, d are the approximation and detail component respectively. (↓ 2) denotes subsampling with stride 2. Figure 2 illustrates the haar decomposition of an audio segment. Similarly, the Mallat algorithm inverse discrete wavelet transform is formulated as Network architecture: To perform more precise super resolution of a low resolution signal x, we propose to predict the wavelet coefficients of the desired high resolution signal stage by stage, Figure 1 shows the architecture of our method when the scaling factor is 4. We first use a resblock to extract features of the original low resolution signal x, then pass the feature maps to two independent convolutional layers to predict the approximation and detail components of the next scale, finally we reconstruct the signal of the next scale, repeat the previous steps we can reconstruct the high resolution signal recursively. The PWCNN is a fully convolutional neural network, it is composed by three component: • Embedding net. The embedding net consists of resblocks, and the architecture of resblocks are shown in Figure 1. The embedding nets are employed to extract features of the input raw data and introduce the non-linearity to our model. • Convolutional layer. Generally an audio signal has at most two channels, but we always need more feature maps to model the input signal, the convolutional layers are located between the input and the embedding nets to match the channels. • IDWT net. As shown in Equation (3), to perform IDWT, we first upsamples the input signal with zeros filled, then convolve it with fixed filters. It can simply implemented by transposed convolution.
We empirically set the size of convolutional kernels to three except the IDWT nets, and we take zero-padding to keep the size of feature maps unchanged.

Loss function: Given a dataset
, where x i are low resolution signals and y i are corresponding ground truth high resolution signals, and we denotex as the predicted high resolution signal. Letc j i and d j i be the j-level wavelet coefficients ofx i , c j , d j be the j-level wavelet coefficients of y i . The loss function of our method are combined of two parts where J = log 2 S ∈ N + is the level of wavelet decomposition, S is the scale of super resolution, ν j > 0 are the balance coefficients. L rec measures the error of reconstruction, L wave measures the error of the prediction of detail component and force the network to predict a correct wavelet coefficients.

Experiments:
Dataset: To evaluate our method, we perform several experiments on the VCTK dataset [21]. The VCTK dataset contains speech recordings of 109 native English speakers, all the recordings are collected by a 96 kHz microphone and downsampled to 48 kHz via STPK. Note that the length of the original data L may not be divisible by the scale S, thus we simply crop the first L − mod(L, S) samples of the original data, then use the cropped audio as the high resolution signal, and we obtain the low resolution data by low-pass filtering and downsampling.  Follow the similar procedure of [1], we construct two data set from the original VCTK dataset for single speaker and multi-speaker tasks. For single speaker task, we use the first 223 recordings of the first speaker for training and the last 8 recordings for testing. For multi-speaker task, we use the recordings of the first 99 speakers for training and test on the remaining recordings.
Metrics: In this letter, we adopt two frequently-used metrics to evaluate the quality of super resolution. The signal-to-noise-ratio (SNR) is defined as SNR = 20 log 10 SNR measures the quality of super resolution in time domain.
In frequency domain, we take the log-spectral distance (LSD) for metrics.
where X andX denote the STFT coefficients of x andx respectively. In this letter, we use the Hamming window and the window length is set to 2048, and the number of overlaps is set to 0.
Results: We conduct two experiments to compare our model and other baseline methods. We implement our model in PyTorch, we empirically set ν j = 1 for all j. We use the Adam optimiser to train our model, the learning rate is set to 10 −5 . The data in the same mini-batch should be identical, hence we randomly crop segment with a length of 8000 for training. In the testing phase, we can use the full audio to evaluate the model. Table 1 illustrates the performance comparison for multiple speaker task. The experimental results shows that our approach outperforms other methods. Table 2 shows the performance comparison for single speaker task, we observed that the SNR of method is lower than SSR-GAN when the scaling factor is 4, but the LSD of our method is more appealing. The reason that our model achieves better LSD is that our method focus on recovering the high frequency component of the ground truth audio signal, hence the recovered spectrogram is more precise. Figure 4 illustrates the spectrograms of the original audio and the SR audios on ×4 SR tasks. We found that our method has a stronger estimation spectrogram capability, while SSRNet tends to produce an oversmooth spectrogram, which is why our model achieves a better LSD. Figure 5 compares the convergence curve of L 1 loss and L 2 loss with the same Conclusion and future work: In this letter, we propose a pyramid wavelet CNN (PWCNN) to solve the ASR problem. We propose to use the pyramid architecture to estimate the wavelet decomposition coefficients progressively, and use the estimation error of wavelet coefficients as a part of loss function to learn the estimation of detailed information. Experimental results show that our method achieves better performance than previous methods, for both ×2 and ×4 ASR tasks.