One‐dimensional convolutional neural networks for high‐resolution range profile recognition via adaptively feature recalibrating and automatically channel pruning

High‐resolution range profile (HRRP) has obtained intensive attention in radar target recognition and convolutional neural networks (CNNs) are among predominant approaches to deal with HRRP recognition problems. However, most CNNs are designed by the rule‐of‐thumb and suffer from much more computational complexity. Aiming at enhancing the channels of one‐dimensional CNN (1D‐CNN) for extracting efficient structural information oftargets form HRRP and reducing the computation complexity, we propose a novel framework for HRRP‐based target recognition based on 1D‐CNN with channel attention and channel pruning. By introducing an aggregation‐perception‐recalibration (APR) block for channel attention to the 1D‐CNN backbone, channels in each 1D convolutional layer can adaptively learn to recalibrate the extracted features for enhancing the structural information captured from HRRP. To avoid rule‐of‐thumb design and reduce the computation complexity of 1D‐CNN, we proposed a new method incorporated withthe global best leading artificial bee colony (GBL‐ABC) to prune the original network based on the lottery ticket hypothesis in an automatic and heuristic manner. The extensive experimental results on the measured data illustrate that the proposed algorithm achievesthe superiorrecognition rate by combing APR and GBL‐ABC simultaneously.


| INTRODUCTION
High-resolution range profile (HRRP) is the superposition of wideband radar echoes of target scattering centers on the corresponding range cell, representing a one-dimensional projection of the scattering intensity distribution onto the radar line-of-sight (RLOS). It not only provides the geometric and structural characteristics of the target but also contains abundant discriminative information for target recognition. Compared with twodimensional synthetic aperture radarand inverse synthetic aperture radar images, which have obtained intensive attention in radar target recognition as well, HRRP has the advantages of easy acquisition and storing as well as low computation complexity. Therefore, it is widely used in recognition of a variety of radar targets, such as ballistic missiles, [1][2][3][4] ships, 5-7 tanks, 8,9 airplanes, [10][11][12][13][14] and so on, it has become a hotspot in the community of radar automatic target recognition (RATR).
In recent years, deep learning is among prevalent techniques to handle with target recognition tasks and extensively utilized for HRRP recognition due to their powerful feature expressive ability. Du et al. 15 utilize variational auto-encoder to learn discriminative latent features of HRRP by incorporating the label information and take the multilayer perception (MLP) as a sufficient statisticto improve recognition performance. Pan et al. 16 take advantage of t-distributed stochastic neighbor embedding (t-SNE) and synthetic sampling to acquire balanced HRRP data set, and then propose a discriminant deep belief network for recognition. Considering the dependence across each range cell, Liu et al. 17 introduce an attention mechanism to a bidirectional self-recurrent neural network to take advantage of temporal information of HRRP, thus achieving accurate identification in the circumstance of time-shifting.
Although aforementioned models based on fully connected (FC) networks have made remarkable progress in extracting effective and robust features, they have weakness in capturing the structural information among the range cells layer by layer, since HRRP reflects the distribution of scatters in target along the range dimension. In FC networks, feature maps collected in the posterior layers are just the weighted and biased summation of that in the anterior layers. The connectivity of convolutional neural networks (CNNs) 18 XIANG ET AL.
(2) The proposed APR block for channel attention utilizes MLP to adaptively learn the correlations between channels in each conv1d layer and then enhances the structural information of targets in range cells by layer-wise feature recalibrating, thus improving the recognition rate. The APR block has low computation complexity as well. (3) The proposed GBL-ABC for channel pruning can not only reduce the computational complexity of 1D-CNN but also achieve the goal of improving the recognition ratefor HRRP recognition tasks by automatically searching for an optimal subnetwork from the original network, even though the pruning phase is relatively timeconsuming. (4) By introducing channel attention and channel pruning to 1D-CNN simultaneously, the proposed method achieves superior recognition performance and has good generalizability in some extent.
The rest of this paper is organized as follows. In Section 2, we first review related works and provide preliminaries. Then the structure of the proposed model and the GBL-ABC algorithm for automatic channel pruning are presented in detail in Section 3. In Section 4, several experiments are conducted on the simulated HRRP data set to evaluate our proposed method and we report experimental results. Finally, we draw concludes for our work in Section 5.

| HRRP-based target recognition
For an electrically large target, for example, the decoys and warheads in the midcourse of a ballistic missile, the target size of which is far larger than the wavelength of highresolution radar, and the microwaves scatter at high frequencies, 30 so we can use the scattering center model to describe the electromagnetic characteristic of the target. As shown in Figure 1, an HRRP sample consists of the amplitude of the coherent summations of the complex returns from target scatters in each range cell along RLOS. Concretely, the coherent summation of the complex returns of mth range cell in the nth HRRP sample can be described as follows: where V m denotes the number of target scatters in the mth range cell, λ denotes the wavelength of the high-resolution radar, σ m i , and φ m i , represent strength and the initial phase of the ith scatter in the lth range cell, respectively, and R n ( ) m i , denotes the radial distance between the radar and ith range cell of the nth returned echo. Furthermore, for the nth returned echo with D range cells, we descript the corresponding HRRP sample, for the sake of simplicity as follows: Besides, to avoid the amplitude-scale sensitivity, the amplitude of each range cell in this paper is normalized as follows: For a data-driven radar HRRP target recognition task, which is also regarded as a cooperative target recognition, 1,5,6,10,11,15,17,[20][21][22][31][32][33][34][35][36][37][38] we are provided with an independently and identically distributed set of HRRP samples X whereỹ i is the predicted label in the evaluating stage and ∈ x D i ′ test .

| Optimal 1D-CNN structure
Channel pruning is among the predominant approaches to reduce the computation complexity of model inference and widely used to find an optimal structure for a deep neural network. [39][40][41] The recent research reveals that the most important step of channel pruning is to find an optimal number of channels of each layer in CNN. 27,28 In this paper, we use a deep 1D-CNN interleaved with conv1d layers and channel attention blocks for HRRP recognition and try to find an optimal number of channels in an automatic manner.
as the network structure and filter set of N′ respectively, obviously ≤ c c ′ j j . In this paper, the aim of pruning is finding a subnetwork N′ with an optimal combination C′ trained on D train to achieve the best accuracy on D test . To this end, we formulate the problem of pruning 1D-CNN as follows: where acc (·) denotes the evaluating accuracy on D test for the pruned model N′ with structure C′. The evaluating accuracy is defined as the ratio of correctly classified samples to the total tested samples.

| METHOD
The framework of the proposed method 1D-CNN with channel attention and channel pruning (CNN1D-CACP) for HRRP recognition is illustrated in Figure 2. As mentioned above, our method includes two stages, a.k.a., training stage and evaluating stage. There are three procedures in the training stage, namely pretraining, pruning, and fine-tuning, we will talk about them for detail in Section 3.4. In the evaluating stage, we utilize a softmax classifier for classification and evaluation.

| 1D-CNN
A deep CNN usually stacks a series of convolutional layers with nonlinear activation functions and down-sampling pooling layers, so that it can extract hierarchical features with theoretically global receptive fields. A kin to this manner, we take advantage of deep 1D-CNN to capture hierarchical features of HRRP for classification and the basic stacked block of deep 1D-CNN, termed as conv1d block in this paper, is shown in Figure 3. The conv1d block consists of three operations, namely, conv1d, batch normalization, and pooling operation.
is a multichannel input of the lth conv1d layer, where d l ( ) represents the dimensionality of feature maps and c l ( ) denotes the number of channels in the lth layer. For kth filter at the lth layer, ∈ k c {1,2, …, } l , a conv1d operation with stride length r l ( ) applies a filter where ) denotes the convolution operator, b k l ( , ) denotes the bias for kth feature map, and δ (·) is a nonlinear activation function. In this paper, Mish 42 function is applied to the 1D-CNN as the activation function, which is defined as follows: where ς x e ( ) = ln (1 + ) x refers to the softplus 43 activation. More specially, the value of the conv1d operation at position i is given by where ⊙ refers to Hadamard product and m l ( ) is the window size of filters in the lth conv1d layer. Accordingly, the dimensionality of the output feature map X k l ( , +1) after conv1d operation is where p l ( ) is the amount of implicit zero-paddings on both sides in the input feature map for padding number of points. From Equation (10), we can see that all conv1d operations in each channel share the same conv1d filter, which can fuse structural information along the range dimension of HRRP samples and channel-wise information within local receptive fields.

| Batch normalization
Batch normalization (BN) 44 is widely deployed in a deep neural network to accelerate training by reducing internal covariateshift and improve generalization by regulating the distribution of the inputs in each layer. We deploy batch normalization after each conv1d layer to make each layerhave a mean of 0 and a variance of 1. Suppose the output of each conv1d layer is ( ) , we will normalize each dimension as follows: where the expectation and variance are calculated over the mini-batch from the training data set and ε is a constant added to the mini-batch variance for numerical stability. After that, two hyper-parameters, γ k ( ) and β k ( ) , are introduced to scale and shift the normalized value:

| Pooling
The pooling layer can downsample the feature map, reducing dimensionality and retaining useful information layer by layer, the max-pooling is adopted in this paper. For the kth input feature map X k l ( , ) in the lth pooling layer, the value of max-pooling operation at the position i can be calculated by where w l ( ) is the window size of the pooling operator. Therefore, the dimensionality of the output feature map X k l ( , +1) after pooling operation is

| Channel attention
Different channels attain different feature maps in the conv1d block, traditional deep models take into consideration those features of equal importance independently. In fact, as different feature map reflects unique structural information of targets in HRRP, there exits different importance between them. 45 Hence, the features should be recalibrated to achieve the goal of improving the quality of representations. To this end, we impose an APR block to the tail of the conv1d block, through which it is capable of learning to make use of global information to selectively and adaptively highlight representative features and suppress less informative ones. Suppose the output of lth conv1d layer is is calculated by Equation (15), for simplicity, X l ( +1) is replaced by χ . As shown in Figure 4, the APR block which consists of three operations, including aggregation, perception, and recalibration, is embedded into the tail of the conv1d block.

| Aggregation
The mission for APR block is to figure out the correlations between different channels, to this end, the output feature χ of the anterior conv1d block is exploited. In consideration of that χ is generated by conv1d block within local receptive fields, so that it does not have enough global spatial information outside of this region in a corresponding feature map.
To tackle this problem, we aggregate the global spatial information outside of the local receptive fields by a global max-pooling operation. For cth channel, we can generate a statistic by where d means the dimensionality of the cth feature map χ c , z c is an aggregated statistic expressive for cth channel. By doing this, we manage to acquire a global statistic, that is, the max value in this paper, of the whole feature map in the corresponding channel.

| Perception
To perceive the dependencies between different channels based on the statistics gained from the aggregation operation, we take advantage of MLP to deal with it. Perception operation can learn a nonlinear relationship between different channels with MLP. We further divide the MLP into three layers for the convenience of parameter setting description, these three layers include dimensionality-reduction (DR) layer with a reduction ratio μ, a representative enhancing (RE) layer and a dimensionality-increasing (DI) layer. The perception operation can be described as follows: where δ means Mish activation function, DI layer returns to the channel dimension of global statistics z. There is a series of FC layers in the RE layer sharing the same number of parameters to enhance the representation of the interaction between channels, and the number of FC layers in the RE layer is parameterized by η. The parameter choice of η as well as μ will be discussed in Section 4.3. We opt to deploy the Mish activation function between every two FC layers as well.

| Recalibration
To recalibrate features extracted by each conv1d block, a simple gating mechanism with a sigmoid activation to the output of the perception operation is first employed. After that, the final output of the APR block is obtained by rescaling χ where σ refers to the sigmoid activation function, F rec refers to channel-wise multiplication and̃̃̃χ The APR block transforms the original output of the conv1d block to a set of weights which is tightly related to different channels and can be dynamically adjusted during training. Since the recalibration of each channel depends on its correlation to the others, the APR block can be deemed to be a kind of self-attention mechanism. On the existence of the LR layer, the APR block can learn how to recalibrate the channels by perceiving its latent relationship, thus achieving the goal of adaptively channel calibrating. candidates for the pruned network, walking through these candidates can be very time-consuming, thus shrinking the combinations of the network should be considered to limit structure searching. At the same time, the number of network connections after pruning should be as small as possible based on the principle of Occam's Razor. To that effect, we consider imposing constrain to Equation (7) ≤

| Channel pruning
where ∈ α (0,100%], is a pre-given max threshold proportion of the preserved channels in the pruned model N′, when αc < 1 i , we set c ′ i to preserve 1 channel. To avoid randomly reinitializing the weights of N′ to reduce computation cost, the weights of N′ are directly inherited from N , that is, . With the hyper-parameter α, the candidates for searching can be significantly decreased to ∏ αc . Thus, the aim of pruning can be transformed into finding an optimal channel number c ′ i , which is not more than αc i in ith layer, to maintain or exceed the performance of the pretrained model. Furthermore, to avoid enumerating each candidate in the pretrained network, we consider integrating a heuristic algorithm, that is, GBL-ABC, to prune the pretrained network by searching for an optimal structure automatically. More evocatively, the network structure C is deemed to be a nectar source in GBL-ABC. 1: Set counters to zero: , for each of them, randomly select a subnetwork according to C′ j as a pruned network N′ j with its filter set W′ j ; via Algorithm 2; 3: for generate a new nectar source G′ j via Equation (20); 6: select a subnetwork from N according to G′ j with its filter set W′ j generated via Algorithm 2; 7: training C′ j and G′ j for Ω epochs to calculate the fitness of them via Equation (21); 8: if The pseudocode of the proposed method for automatically channel pruning is described in Algorithm 1, which is further refined into four phases below.

| Initialization phase (Line 1-2)
In this phase, a set of nectar sources that represent K candidates of the pruned structure are

| Employed bee phase (Line 4-15)
An employed bee hunts a new nectar source G′ j for each nectar source C′ j . In particular, the ith element of G′ j is generated from three randomly selected neighbor elements. The search equation, which is originated from the differential evolution algorithm, is denoted as follows: where ≠ ≠ ≠ a b c j, F i is a mutation scale related to each element. In this paper, F i is dependently generated by a Gaussian distribution with a mean value of 0.5 and a standard deviation of 0.1.
After producing a new nectar resource, the GBL-ABC applies a greedy selection principle between nectar source C′ j and G′ j according to their fitness, which is defined as below 3.3.3 | Onlooker bee phase (Line [16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31] In this phase, we use the global best position of nectar source to act as a guide for the onlooker bees to update C { ′} j j n =1 , to replace the Equation (20), the search equation is updated as follows: where gbest ji is the current global best position, c′ ai and c′ bi are randomly selected neighbors of c′ ji and ≠ ≠ a b j, and M i is generated by Gaussian distribution with a mean value of 0.0 and a standard deviation of 2.0. Besides, a nectar source C′ j is chosen with a probability related to its fitness via a roulette wheel selection strategy. This probability in our implementation is defined as follows: Obviously, the better fitness of C′ j is, the higher probability of C′ j being selected. This will make the algorithm find an optimal nectar source automatically and progressively.

| Scout bee phase (Line 32-38)
A scout bee will prevent the algorithm from being trapped in a local optimal. On condition that the nectar source C′ j has not been updated more than M times, the scout bee will reinitialize it to further generate a new nectar source.

| Construction of the proposed model
As shown in Figure 3, the proposed method CNN1D-CACP uses four stacked conv1d-APR blocks to extract hierarchical features of HRRP samples, that is, L = 4. Afterward, a Flatten operation is imposed to covert the multidimensional tensor of hierarchical features into a 1D vector, which is suitable for the input of the following softmax classifier. Given an input HRRP sample x i , the softmax function is adopted to predict the label vector, which is where θ θ θ θ = , ,…, ⎦ represents all the parameters to be trained in the network. In the evaluating stage, we use Equation (6) for predictions.
As for the training stage, we adopt the cross-entropy loss as the cost function, where m is the sample number of each mini-batch and ⋅ 1{ } is an indicator function with 1{true} = 1 and 1{false} = 0.
As shown in Figure 2, the training stage is divided into three phases:

| Pretraining phase
We first initialize the network N by Kaiming initialization 46 and then train it for several epochs to conduct pruning. The RAdam 47 optimizer, a state-of-the-art advanced minibatch stochastic gradient descent algorithm, is further incorporated for training according to Equation (5).

| Pruning phase
Note that the fitness used by GBL-ABC algorithm refers to the evaluating accuracy on the test data set, we also train the pruned model N′ for several epochs to calculate the fitness with RAdam. As elaborated in Algorithm 2, to avoid training subnetwork from scratch, we randomly pick up c′ j filters from the pretrained model N , which serve as the initialization for the jth conv1d layer of the subnetwork N′.

| Fine-tuning phase
To achieve better accuracy for the pruned model, we lastly fine-tune the pruned model for more epochs using RAdam as well.
As mentioned above, all pruned networks in the four phases of the pruning phase are trained for Ω epochs to calculate their fitness via Equation (21), we can estimate the lower bound of the extra training epochs during the pruning phase to analyze the complexity of GBL-ABC. Since each nectar source represents a pruned network, the extra training epochs in the initialization phase, employed bee phase, and onlooker bee phase are all K × Ω during one cycle of the pruning phase, the extra training epoch in the scout bee phase is hard to calculate precisely, we use {1,2, …, } i scout to denote it. As a whole, the total extra training epoch in the pruning phase is scout , hence we should train equivalently at least for T K × 3 × × Ω epochs during the pruning phase, which is relatively time-consuming. For deep learning models, reducing the computational complexity of the modelto save inference time in the evaluating stage is more important. As we introduced a hyper-parameter α to control the maximum number of channels of the pruned networks, the model parameters can be reduced according to the lottery ticket hypothesis, thus reducing the computational complexity of the model.

| Measured data
We use an electromagnetic simulation software, a.k.a., FEKO, to simulate radar echoes from five mid-course ballistic targets, that is, S = 5, and the physic characteristics of these targets are shown in Figure 5. Based on the high-frequency asymptotic theory, the physical optics method is adopted for simulation. 48 The FEKO simulation parameters are set as follows: azimuth angle is 0-180°, angle step is 0.05°, the pitch angle is 0, the center frequency is 10 GHz with start frequency 9.5 GHz and end frequency 10.5 GHz and the number of frequency sampling points is 128. At last, the default optimal mesh size and horizontal polarization are adopted.
Finally, there are 18,005 samples attained through processing by inverse fast Fourier transform, 49 each target contains 3601 HRRP samples of different degrees and each sample is a 256-dimensional vector, that is, D = 256. We randomly split these samples into two data sets, the testing data set with 3601 samples and the training data set with 14,404 samples, each data set contains the same number of samples for different targets. The HRRP simulated results of different targets with the azimuth of 90°are illustrated in Figure 6.
Except for ML-ELMand SAE-ELM, all neural networks mentioned above are trained to minimize the cross-entropy loss for 200 epochs, and the softmax classifier is adopted in these networks for classification as well. We use RAdam for minimizing the cross-entropy loss with 32 samples in each mini-batch, a learning rate of 10 −3 and standard hyper-parameter values β = 0. All models are implemented withdeep learning framework PyTorch 55 1.15.1 and programing language Python 3.7 on an x86-64 Fedora PC, which is based on Intel Core i5-9600K CPU @3.7 GHz, 24 G RAM, and NVIDIA GTX 1060 GPU with CUDA 10.2 accelerating computation.

| Impact of model hyper-parameters
In this section, we conduct a series of experiments to analyze the influence of hyper-parameters on the performance for choosing suitable network settings of the proposed method.
The performance metrics for measuring the network compressing in this paper include the total channel number, floating-point operations (FLOPs), and the parameter number of the model. The average evaluating accuracy (Acc) of different models is also provided. We provide ablation studies for the following hyper-parameters:

| The influence of C
To evaluate the efficiency of the proposed method in different 1D CNN model configurations, the channel number of the pretrained model in each conv1d layer, which is represented by  C= [c 1 , c 2 , c 3 , c 4 ], is considered. More specifically, we take into account five model configurations, that is, C 1 = [20,40,80,160],  Table 1 depicts both evaluating accuracy and channel number of the models in different architectures and configurations. As shown in Table 1, channel pruning and channel attention consistently improve performance in different model configurations, respectively, we see further increasing of the accuracy when channel pruning and channel attention are combined simultaneously, which proves the effectiveness of the proposed method.
Moreover, we plot the training and validation curves of the five architectures with the model configuration of C 5 in Figure 7. From Figure 7, we can see that CNN1D-CA, CNN1D-CP, and CNN-CACP consistently surpass CNN1D during training in both training data set and testing data set, and CNN1D-CACP shows the best performance. Since the channel number has been reduced after pruning, it proves the evidence that GBL-ABC can get rid of the redundancies which lead to overfitting. Figure 8 shows the Giga FLOPs (GFLOPs) and the   parameter number of the model in different architectures and configurations. From Figure 8, we have two observations. On one hand, compared with the vanilla CNN1D, the GFLOPs and parameter numbers are increasing slightly in CNN1D-CA, proving the evidence that the APR block proposed in this paper has a small additional computational cost. On the other hand, compared with vanilla CNN1D and CNN1D-CA, the GFLOPs and parameter number of CNN1D-CP and CNN1D-CACP are decreased more obviously after pruning as the channel number in different pretrained model increases, proving that model with more channels has more redundancies and the proposed method can automatically find and remove the pointless parameters. Based on the phenomenon above, we can also find that the proposed model CNN1D-CACP is capable of consistently improving evaluating accuracy of vanilla CNN1D while reducing the computational complexity under different network configurations with the help of channel attention mechanism to recalibrate learned features and channel pruning mechanism to find optimal network structure. To some extent, it proves that our solution has good generalizability on different network configurations.
Since CNN1D-CACP achieves the best evaluating accuracy with the configuration of C 5 , we use C 5 as the default model configuration for all experiments. Under the configuration of C 5 , CNN1D-CACP has significantly reduced the channel number, the computational complexity and the parameter number to ≈ 699/1500 46.6%, ≈ 9.35/21.89 42.71% and ≈ 0.58/2.11 27.49% of that of CNN1D respectively, even though it has been equivalently trained for at least T K × 3 × × Ω = 20 × 3 × 10 × 2 = 1200 extra epochs in the pruning phase.

| The influence of η
With the model configuration of C 5 , we analyze the influence of the introduced hyperparameter η, which represents the number of FC layers in the RE layer. We conduct experiments for a range of different η values and the results are shown in Table 2. As can be seen fromTable 2, FLOPs and the parameters increase along with η and achieve the best accuracy when η = 0 and η = 2 for the nonpruned models. In the pruned models, the accuracy varies from increasing to decreasing and achieve the best when η = 2. Since pruned models all have less computation complexity than nonpruned ones, we just focus on getting the best accuracy of the pruned model, so that we set η = 2 in this paper. Table 3 depicts the accuracy and model complexity with different values of μ, which represent the reduction ratio of the APR block. From Table 3, it is clear that larger μ leads to increase the computation complexity and slightly decreases the accuracy of the nonpruned models. The proposed method can automatically find the best pruned structure as well and setting μ = 8 achieves the best accuracy of the pruned models in this paper. Therefore, we set μ = 8 as default.

| The influence of α
The max threshold proportion α is introduced in Equation (19) as a hyper-parameter which allows us to constrain the proportion of the preserved channels during channel pruning. To investigate the influence made by this hyper-parameter, we conduct a series of experiments with the default settings mentioned above for α varying from 10% to 100%, the comparison is showing in Figure 9. From Figure 9, we can see that larger αleads to less reduction of computation complexity and there are small fluctuations of the evaluating accuracy while increasing α. When α = 70%, the model achieves the best accuracy and we use this value of α in the following experiments. For a deeper analysis of channel pruning, we display block-wise pruning results of CNN1D-CP (pruned from CNN1D) and CNN1D-CACP (pruned from CNN1D-CA) in Figure 10. As shown in Figure 10, we can see that the pruning rates differ across different blocks when the max ratio of the preserved channels in each conv1d layer is set as the same value of 70%, proving the evidence that the proposed method can automatically find the best structure which can promote the evaluating accuracy.

| Recognition performance
In this section, the proposed method is compared with other deep neural methods mentioned in Section 4.2 and the compared results are depicted in Table 4. As can be seen from Table 4, all deep neural networks can recognize all samples of decoy 1 in the test data set correctly, but CNNs has the superior recognition ratethan the other four targets. Since CNNs can capture structural information among range cells better than fully connected networks, CNNsalso outperform fully connected networks in total accuracy. In contrast to vanilla CNN1D, the embedded APR block for channel attention (CNN1D-CA) mainly improves the recognition rate of warhead targets, the improvement of GBL-ABC for channel pruning (CNN1D-CP) mainly comes from the improvements of the recognition rate of decoy 2 and warhead targets.  By combing these two techniques simultaneously (CNN1D-CACP), the recognition rate of three targets, namely decoy 2, decoy 4 and warhead, are all improved. Although these two techniques reduce the recognition rate of decoy 3, it is always higher than 99%. We can explain the improvements in two aspects. On one hand, the channel attention technique can recalibrate the features extracted by different channels to focus on the information more corresponded to different targets. On the other hand, the channel pruning technique can remove the redundant channels which have obviously negative impacts on extracting discriminative information related to different targets, in other words, it can search   for a better network structure. Moreover, the two techniques proposed in this paper show behavior of mutually reinforcing between each other. As a whole, they greatly enhance the ability of feature extraction for different channels.

| Feature visualization
To investigate the discriminative properties of the features extracted by different methods, we utilizethe t-SNE algorithm to map extractedfeatures into two dimensions. Figure 11 compares the separation of the original test data set, SDAE, SAE-ELM, Bi-GRU, Bi-LSTM, and the proposed CNN1D-CACP. As can be seen from Figure 11, targets of decoy 1 always have the discriminative distance to the others in both original data and all compared deep neural networks, deep neural networks chiefly improve the separability of the other four targets.
What's more, CNN1D-CACP produces more discriminative feature space than the other methods, since it is capable of searching for better structures of CNNs and enhancing the effective relations between channels.
To provide an assessment of APR blocks in CNN1D-CACP, we study the activations V σ u = ( ) c c , which is used to recalibrate the features extracted by conv1d blocks in Equation (18). We use all samples in the test data set and examine the distribution of their activations to study the function of this attention mechanism. In Figure 12, we plot the average activations of different targets in various depths and the kernel density estimation (KDE) is adopted to calculate their distribution density. From Figure 12, we can see those activation distributions of different targets in the first APR block are approximately the same, but they are much different in the following blocks. To explain, the earlier layers learn the general features of different targets while later layers learn the features which are more specific to each class. Furthermore, since different samples have different activations, the APR block has the merit of automatically finding the relationships between channels and adaptively enhancing the learned structural features of HRRP.

| Effect of noise
In this section, we compare CNN1D-CACP with other methods with a different signal-to-noise ratio (SNR) to evaluate the noise robustnessand generalizability, 15 the SNR (dB) is defined as follows: where L refers to the number of range cells (here L = 256), P l and P Noise denote the power of the original signal in lth range cell and the power of noise in each range cell, respectively. We add simulated white noise to all the samples in the testing data set with SNR varying from 10 dB to 40 dB. Figure 13 shows the average recognition rate of the methods versus SNR. As can be seen, CNN1D-CACP achieves consistently higher recognition rates than other methods, proving that the proposed method is more robust to noise in measured HRRP data than these methods. This is because that the proposed CNN-CACP can efficiently extract the structural discriminative features along the range dimensionin HRRP while the fully connected networks cannot. Besides, since all deep neural networks in Figure 13 are trained on the noisefree training data set and tested on the noise testing data sets with different SNR, it also proves that CNN1D-CACP has better generalizability on the noise data sets in some extent.

| CONCLUSIONS
This paper presents a novel approach, named CNN1D-CACP, for HRRP-based target recognition. We use a deep neural network interleaved with conv1d, batch normalization and max-pooling layers to extract hierarchy features of HRRP. To enhance the structural information of targets extracted from HRRP, we design an attention block with low computation complexity to recalibrate the learned features adaptively. By introducing the GBL-ABC algorithm for network pruning, the proposed method can automatically find a subnetwork with lower computation complexity and a higher recognition rate than the original network. Extensive experimental results on measured data demonstrate that the proposed model shows superior performance on classification accuracy, separability in the feature space and robustness to noise than compared deep neural networks. It is noteworthy that although the proposed method has the merits mentioned above, further studies are required to reduce the computational complexity in the pruning phase.