Hyperspectral image classiﬁcation using 3D 2D CNN

Recent works have shown that deep-learning-derived methods based on convolutional neural network can achieve high performance in terms of accuracy when applied to computer vision task such as object detection, segmentation and classiﬁcation particularly on hyperspectral image. However, the existing methods have long training times. To reduce the training time and increase the accuracy, this paper proposed a new 3D2D convolutional neural network combined for hyperspectral image classiﬁcation. For this purpose, a 3D fast learning block (depthwise separable convolution block and a fast convolution block) followed by a 2D convolutional neural network was introduced to extract spectralspatial features. Four datasets were used for the experiment purpose and the results showed that the proposed method achieved excellent result on both small and large training data compared with existing methods. The proposed method increased the overall accuracies by 2% on UP and KSC datasets while signiﬁcantly reducing the training time on IP and KSC datasets, respectively. The proposed method increased all accuracies for at least 6% on IP, KSC and UP datasets when compared to some state-of-the-art methods. Also, it reduced considerably the training and testing time on IP and KSC


INTRODUCTION
Hyperspectral images (HSI) are images acquired by sensors technologies. The advances of these technologies are the main reason of its recent large applicability in various field such as agriculture [1], healthcare [2], road segmentation [3], urban planning, vehicle navigation [4] and ocean research [5]. The mostly used technology by these applications is the HIS classification. However, images captured by these sensor technologies contain large spatial-spectral resolution which leads to the curse of dimensionality problem [6] that have great effects on supervised classification methods, in which the size of the training set may be insufficient to accurately derive the statistical parameters, thus leading the classifier to quickly overfit [7]. For these reasons, the classification process is challenging.
Recent survey has shown that HSI classification can be tackled by using handcrafted feature extraction and learning-based feature extraction techniques [8]. In the past few decades, the handcrafted feature description has been used for many HSI classification approaches. Yang and Qian [9] proposed a joint collaborative representation model with the uses of a locally adaptive dictionary. Zhang and Qi [10], however, employed This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2020 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology a mathematical morphological method for extended multiattribute profiles extraction with the sparse multinomial logistic regression as classifier. Li et al. [11] employed local binary patterns to extract local image features and used the efficient extreme learning machine with a simplified structure as classifier. The result achieved better performance than previous methods using support vector machine (SVM) [12] in which the authors applied SVM to detect aflatoxin in corn kernels for fluorescence HSI.
Most of the traditional feature extraction mentioned above are based on handcrafted feature engineering and shallow architecture; therefore, they mainly focus on features such as edge, texture and rotation angle. However, the invariance and robustness of such features are limited and cannot handle high intraclass variability and low interclass variability easily [13,14]. To solve this problem, deep learning (DL), a sub-set of machine learning, has been used for HSI classification for its capability to learn feature automatically. Among this DL architecture, the convolutional neural network (CNN) has shown promising result on computer vision [15] task such as image classification, object detection and image segmentation. As a result of these progresses, it has been introduced in HSI classification [16]. Chen et al. [17] first proposed a DL framework to jointly extract spatial-spectral feature by using stacked auto-encoders, principal component analysis and logistic regression to obtain classification result. The main drawback of their work was the long training time. But more importantly, they proved the potential of DL for HSI classification. The following year Hu et al. [18] introduced the CNN into hyperspectral classification by using only the spectral information. However, in 2016, Chen et al. [19] presented a deep feature extraction technique based on 3D CNN with combined regularization for effective spectralspatial feature extraction of HSI.
In 2017, Zhong et al. [20] proposed a deep spectral-spatial residual (SSRN) block by using ResNet with different depth and width for HSI classification. In 2018, Wang et al. [21] proposed a fast dense spectral-spatial convolutions (FDSSC) for HSI classification. They used different convolution kernel size to separately extract spectral-spatial feature with a dimensional reduction layer to reduce the high dimensions. However, the drawback of some above method is the increase of the computational complexity. To solve this problem, Roy et al. [22] proposed a hybrid spectral CNN (HybridSN) for HSI classification. They used a spectral-spatial 3D CNN followed by a 2D CNN for spatial-spectral feature extraction. Therefore, inspired by papers [21,22] and to tackle some problems mentioned above, this study is aimed to build a new 3D-2D CNN model deeper, lighter, more accurate and faster for HSI classification. It took advantages of depthwise separable convolution proposed in 2017 by Howard et al. [23] for building a deep, light neural networks and the 1 × 1 × 1 convolution proposed by Lin et al. [24] in their work "Network In Network". The proposed 3D-2D CNN model includes a fast learning block, a reducing dimensional block and 2D convolution layer.
The proposed model presents the following characteristics which make it different from the models mentioned above: • It uses a 3D fast learning block which makes the model more robust and efficient by introducing 3D depthwise separable convolution block and the fast convolution block.
• The network parameters are fewer compared to the existing methods which reduce the overfitting. • The training time is much lesser while producing state-ofthe-art performance by using the proposed fast convolution block.
The rest of this paper is organized as follows. Section 2 presents some theories behind the proposed model. Section 3 presents the proposed method for HSI classification. The experiments done are reported in Section 4. The HSI classification results of the proposed model are presented and the performance and training time compared with other classification methods are discussed in Section 5. Section 6 provides a summary, as well as suggestions for future work.

BACKGROUND
This section introduces some terms used in our work.

2D CNN/3D CNN
The standard 2D convolution only uses features from spatial dimension. The neuron value v xy i, j at a spatial position (x, y) in the jth feature map of the ith layer is obtained by the following equation: where a denotes the activation function, b ij is the bias parameter of the jth feature map in the ith layer, m indexes over the set of feature map in the (i − 1)th layer connected to the current j th feature map, w pq ijm is the weight of position (p,q) connected to the mth feature map and H i and W i are the height and the width of the spatial convolution kernel (Figure 1).
In contrast, the 3D convolution captures both spatial and spectral features by convolving a 3D kernel along with 3D data. The neuron value v xyz i,j at a spatial position (x, y, z) of the jth feature map in the ith layer is obtained by using the following equation: where a denotes the activation function, b ij is the bias parameter of the j th feature map in the ith layer, m indexes over the set of feature map in the (i − 1)th layer connected to the current jth feature map, w pqs ijm is the weight of position (p, q, s) connected to the mth feature map, H i and W i are the height and the width of the spatial convolution kernel and S is the size/depth of 3D kernel along the spectral dimension.
The computational cost of a 3D convolution operation is are the input and output size, respectively, with l, w and h representing the length, width and height; and c F , c G are the number of channels before and after the convolution. The kernel k here is of the following size: k × k × k × c G × c F , where k is the filter side length.

3D depthwise separable convolution
As in the original paper [23], the 3D depthwise separable convolution is a factorized convolutions which divides the normal 3D convolution into a 3D depthwise convolution and a 1 × 1 × 1 convolution called 3D pointwise convolution to combine the output of the 3D depthwise convolution later. This process significantly reduces the computation and the model sizes.
The computational cost of such decomposition is where l F × w F × h F × c F and l G × w G × h G × c G are the input and output size, respectively, with l, w and h representing the length, width and height; and c F , c G are the number of channels before and after the convolution. The kernel K here is of size Compared with the standard 3D CNN, the computational cost becomes The amount of calculation is reduced in Equation (8) by about eight to nine times.

PROPOSED METHOD
2D CNN is not a good spectral features maps extractor and 3D CNN can extract both spatial and spectral features maps [25], but due to the parameters increases, it faces sometimes multiple optimization problems, such as gradient explosion and overfitting, so large training sample is needed which limits the application of 3D CNN [26]. The model proposed in this study combines the 3D CNN and 2D CNN to extract good spectral and spatial features maps from the HSI at a cheap cost. The proposed model uses 3D stacked convolution layers (a Conv3D -fast learning block) followed by a reducing dimension block which consist of a Conv3D + reshaping operation + a Conv3D, and then the output features maps from that block is then reshaped and fed to a Conv2D to learn more spatial features. The output of the Conv2D layer is flattened and passed to the first fully connected layer in which a dropout layer was added before the last fully connected layer. The 3D fast learning CNN block of the proposed model is with much less computational cost and faster than the normal 3D CNN block because of the presence of depthwise separable convolution and the fast convolution block in the fast learning block. The architecture of the proposed method is shown in Figure 2.

3D Fast learning block
As mentioned by LeCun et al. [15], the deeper a CNN is, the better its performance becomes. However, this operation is costly.
To solve this problem, a 3D fast learning block was proposed, which consists of a depthwise separable convolution block and a fast convolution block. The depthwise separable convolution block contains a depthwise separable 3D convolution + followed batch normalization + ReLU activation function + (1 × 1 × 1) 3D convolution + batch normalization + ReLU activation function. The fast convolution block consists of a maxpooling layer with kernel size a × a × a, where a = 3 and a 1 × 1 × 1 convolution (idea inspired by the inception module which uses a 1 × 1 convolution before applying 3 × 3, 5 × 5 kernel sizes and maxpooling). This fast learning block extracts important spectral as well as spatial feature from the HSI data faster due to the block design and kernel size choice

Handling high dimensionality and overfitting
Stacking 3D convolution led to the extraction of highdimensional feature (curse of dimensionality). In the proposed model, this problem is solved at two levels: the first level is the use of 1 × 1 × 1 convolution before each of the proposed two convolution blocks, and then the output feature map from the fast learning block is passed through a dimensional reduction block proposed by [21] with ReLU activation function at the second level. The dimensional reduction block contains two 3D conv layers with padding valid and a reshaping operation. Let us assume the feature map of the preceding block be of size r × r × b, which is passed through the first 3D conv layer with kernel size 1 × 1 × b and a kernel number p (b > p); this operation generates p feature maps of size r × r × b. After the reshaping operation, r × r × b feature maps become one channel of size r × r × p. Finally, the second 3D conv layer of kernel size a × a × p and a kernel number n change the feature map to an s × s × 1. This process reduces the space size, the number of channels and the high dimensionality of the data blocks. To reduce overfitting, batch normalization [27] and ReLU activation function have been applied in each block of the proposed model and dropout applied before the final fully connected layer.
The model architecture is detailed in Table 1.

Datasets
The first one is Indian Pines (IP) dataset. It has a spatial dimension of size 145 × 145 pixels and 220 spectral bands in the wavelength range of 400-2500 nm, in which 20 spectral covering the water absorption area has been removed. The dataset was acquired in 1992 (North-Western of Indiana, USA) by the AVIRIS spectrometer with 16 classes of ground truth. For more information on these classes, refer to Table 2. The second one is the Kennedy Space Center (KSC) dataset. It was acquired by the AVIRIS spectrometer over Florida in 1996. This dataset has 512 × 614 × 176 pixels spatial-spectral dimension and 13 classes of ground truth. Refer to Table 3 for more information about these classes.
The third one is University of Pavia (UP) dataset. The dataset was acquired by the ROSIS sensors in 2001 (Pavia, Italia). It has 610 × 340 × 103 pixels spatial-spectral dimension and nine classes of ground truth. More information about these nine classes is described in Table 4.
The fourth one is the Salinas Scenes (SA) dataset. The dataset was acquired by the AVIRIS spectrometer in the Salinas valley (California, USA). It has 512 × 217 × 224 pixels spatial-spectral dimension. It has 16 classes of ground truth. More information is described in Table 5.

Experiment setup
Four datasets namely Indian Pines, Kennedy Space Center, University of Pavia and SA were used in this study. To build a stable architecture, 5%, 10% and 20% of training samples and a window size of 7 × 7, 9 × 9, 11 × 11, 13 × 13 and 15 × 15 were tested in all datasets excluding SA which was not tested with 15 × 15 because of hardware limitations. The effect of different parameters (training sample, window size, model without the depthwise separable block) on the model performance was assessed. Based on the results reported in Figure 3, a training sample of 20% was chosen as all the method achieved better results using it; a window size of 11 × 11 for IP, UP, SA and 15 × 15 were used as the comparison study with FDSSC [21] and HybridSN [22] parameters. The batch size of the model was 32 and RMSprop optimizer was adopted to make the proposed model converge quickly. Besides, the model was trained for 50 epochs. In order to obtain the best model, the initial data were divided in training set, test set and validation set. All the training and validation processes were under the control of early stopping and dynamic learning rate. The overall accuracy (OA) was used in the experiment; average accuracy (AA) and kappa coefficient [28]

RESULT AND ANALYSIS
In these experiments, the same parameters (window size, training sample) were selected for all the methods for a fair comparison.
First, the proposed model performance accuracy was evaluated with some state-of-the-art model and the comparison results are shown in Table 6.
From Table 6, it can be seen that the proposed model is outperforming the other two models in terms of accuracies on  Table 6 showed that the HybridSN achieved the best result in terms of training and testing time on the whole datasets; meanwhile, the proposed method achieved the best performance regarding the OA (greater than 99%), AA and kappa coefficient (kappa greater than 99%) on the whole datasets (except on SA where the FDSSC achieved the best performance).    Second, the impact of training sample on the proposed model is stated in Table 7 and Figure 4. From Table 7, it can be seen that with the increase in training sample, the accuracy also increases, which makes the training sample an important parameter for the model. From Table 7, it can be seen that the proposed model achieved more than 99% of OA in all datasets with 20% of training sample (and this is the rea- son behind its selection as training for the other experiments). Also, the results in Table 7 show that the proposed method is capable of achieving excellent result on UP and SA datasets with both small and large training samples while performing good result on IP and KSC datasets with large training sample (20%). From Table 8, it can be seen that the training time doubled when doubling the training sample on UP and SA. Then, we can deduct that more the training samples used, more training time is needed and less testing time. In addition, less training sample uses less training time and long testing time.
In HSIs, the neighbouring pixels were considered to extract the spatial feature representation [29]. However, the neighbouring pixels selection is challenging, because smaller neighbourhood will cause insufficient receptive fields [30]; however, with larger window size, too many neighbouring pixels are included in the calculation, resulting in over-smoothed classification maps (especially when the pixels are in the corners or edges of a category) [31], and category spatial boundaries are not preserved. In Table 9, different window sizes on the proposed model were assessed. The result on UP and KSC datasets showed that the accuracy increased with the increase of window size; however, for IP and SA, the accuracy increased with the increase of window size till some point and then decreasing.
In Table 10, the model without the depthwise separable convolution block was assessed. For that experiment, a training sample of 20%, a window size of 11 × 11 for UP and SA and 23 × 23 for KSC and IP are selected taking in consideration the conclusion of paper [21,22] and the results in Figure 1. With only the fast convolution block, better classification accuracy was achieved on both UP and KSC by increasing accuracies to 2% at least than FDSSC and HybridSN. Meanwhile, we obtained less training and testing time on IP and KSC datasets. The results proved that, without the depthwise separable convolution, the proposed model with only the fast convolution block is capable of considerably reducing the training time while achieving high classification accuracy.
In Table 11, the performance of the proposed model without the dimensional reduction block and depthwise separable convolution block on IP and KSC datasets was assessed. The same parameters used for Table 10 were used in this experiment as well. The results present in Table 7 proved that our proposed model still performed better compared to FDSSC and HybridSN on IP in terms of OA and kappa accuracy and training time in both IP and KSC even without the dimensional reduction block and depthwise separable convolution block, but  It has also shown the robustness of the fast convolution block. The proposed method increased the OA, AA and kappa coefficient for at least 6% on IP, KSC and UP when compared to some state-of-the-art methods. Also, the proposed fast convolution block alone is capable of achieving higher accuracy in less training time on IP and KSC when compared to FDSSC and HybridSN.  6 are the classification maps of the proposed model on three datasets with 23, 9 (IP, UP) spatial neighborhood size, respectively. For UP dataset, it can be seen that the small spatial neighborhood size resulted in some spatial information loss.

CONCLUSIONS
In this work, a new 3D-2D CNN was proposed for HSI classification using four mostly used datasets. Specifically, a fast  learning block was proposed which allowed us to build a deep model with less computation cost. The experimental results over four datasets showed that the proposed model outperforms some recent state-of-the-art model. On the one hand, the proposed fast convolution block has less computation cost and can achieve better results on both large and small training data. On the other hand, it increased the overall accuracies by 2% on UP and KSC while significantly reducing the training time on IP and KSC, respectively. The proposed model increased the OA, AA and kappa for at least 6% on IP, KSC and UP when compared to some state-of-the-art methods. Also, it reduced considerably the training and testing time on IP and KSC when fast convolution block alone is involved.
For future work, the accuracy of the proposed model can still be improved by using more training data and exploring some pretrained model in order to still reduce the computational cost.