High‐Resolution Range Profile Sequence Recognition Based on Transformer with Temporal–Spatial Fusion and Label Smoothing

Radar high‐resolution range profile (HRRP) is widely used in radar automatic target recognition due to its advantages such as easy availability, convenient processing, and small storage space. Current recognition methods for HRRP sequences mainly focus on the temporal information of HRRP sequences, which cannot fully utilize the temporal and spatial information contained in HRRP sequences. Moreover, most of these methods fail in long‐range modeling and global information extraction of HRRP sequences. To solve above problems, a HRRP sequence recognition method based on transformer with temporal–spatial fusion and label smoothing (TSF–transformer–LS) is proposed. TSF–transformer–LS contains temporal transformer blocks and spatial transformer blocks, which are used to extract deep global features of HRRP sequences in the time domain and space domain, respectively. Then, an attention fusion mechanism is developed to realize the adaptive fusion of temporal and spatial features. Moreover, label smoothing is used to add noise to sample labels, which can solve the overfitting problem of transformer caused by a large amount of noise hidden in HRRP in real scenes. Experiments on MSTAR, a standard dataset, show that the proposed method outperforms other methods in recognition performance. Furthermore, the effectiveness and interpretability of the method are explored.


Introduction
Radar automatic target recognition (RATR) is of great importance to applications in the fields of national air defense, aerospace, and target detection and surveillance. [1]Radar HRRP has become a hotspot for experts and scholars in developing radar automatic target recognition technologies due to its rich information on target structure characteristics, easy availability, convenient processing, and small storage space.11] When the radar moves relative to the target, it is able to acquire the echo information of multiple azimuths, and the HRRPs of consecutive azimuths constitute the HRRP sequence.With alterations in azimuth, the HRRP sequences show a certain tendency of temporal pattern changes.Consequently, the HRRP sequence contains the temporal information of the HRRPs between consecutive azimuths, and the dynamic temporal features of the target can be effectively extracted through the correlation modeling of the HRRP sequence.Moreover, each HRRP in the sequence also contains rich spatial structure information of the radar echo adjacent range cells.By modeling the correlation between range cells, the spatial structure features of the target can be effectively extracted.However, HRRP sequence recognition is carried out mainly based on single temporal information without consideration of the spatial information between range cells currently; the temporal-spatial fusion information is rarely used to improve recognition performance.
To make full use of the temporal information in a HRRP sequence and the spatial information between HRRP range cells, our study focuses on feature extraction of HRRP sequences.Both temporal information in a HRRP sequence and the spatial information of HRRP range cells are considered simultaneously to extract the temporal-spatial features contained in HRRP sequences comprehensively.
HRRP sequence recognition is a special class of multivariate time sequence classification problem that aims to classify HRRP sequences.Moreover, since the large noisy regions are contained by HRRP in real scenes, it is significant to suppress adverse effects of noisy regions and enhance the perception and extraction ability of global information, which is helpful for utilizing the highly distinguishable information in the target region.With the development of deep learning technologies, many deep learning networks were applied to HRRP sequence recognition, such as convolutional neural network (CNN) [12,13] and recurrent neural network (RNN). [14,15]CNN can effectively extract the local correlation in a HRRP sequence through convolutional kernels with shared parameters, but the size of convolutional kernels restricts the receptive field of CNN and the ability to extract global information.RNN processes HRRP sequence data serially through a chain structure, which can effectively extract the temporal information of adjacent HRRPs due to the memory of recurrent neural network cells.However, the memory capacity of RNN cells is limited, which leads to the loss of important information when processing a long sequence, bringing a negative impact on the extraction of global information.[25][26] Since transformer can utilize the multihead self-attention mechanism to adaptively extract multiple long-range relationships of sequence data, dynamically adjusting the attention to the sequence data by calculating the relevance between the sequence data, thus increasing the weight of the important key sequence features to pay more attention to the key areas, suppressing the adverse effects of redundant sequence data, and significantly improving the ability to extract the global information of the sequence, is key.
To fully utilize the temporal-spatial fusion information of HRRP sequences, enhance the ability to extract global information, and effectively extract highly distinguishable features of the target region, we propose a temporal-spatial fusion transformer with label smoothing (TSF-transformer-LS) method for HRRP sequence recognition.The TSF-transformer-LS method can model the long-range temporal information in a HRRP sequence using temporal transformer.It can also represent the long-range spatial information between HRRP range cells using spatial transformer.Furthermore, we put forward an attention fusion mechanism to realize the adaptive fusion of deep global features in time and space to accomplish the target recognition using robust temporal-spatial fusion features.Moreover, aiming to solve the overfitting problem of transformer in the high-noise environment of real scenes, we adopted label smoothing to add the label noise and enhance the generalization performance of transformer.In this article, we use the standard dataset of RATR, MSTAR, and its variant dataset as a benchmark to verify the recognition performance of the proposed method in real scenes.In addition, we compare the proposed method with other current remarkable baseline methods to highlight the recognition performance and robustness of the proposed method.In particular, by visualizing the attention map of the multihead attention that consists of several self-attention modules, the effectiveness and interpretability of the self-attention mechanism applied to HRRP sequence recognition were demonstrated.In this article, the transformer with temporal-spatial fusion is introduced to the HRRP sequence recognition, which can facilitate future research.
The contribution of this article is as follows.1) To enhance the recognition performance of HRRP sequences, we propose a recognition method based on TSF-transformer-LS to fully utilize the deep global information of HRRP sequences in time domain and space domain, which achieves superior performance in recognition accuracy and robustness of HRRP sequence recognition than other remarkable methods.2) Considering more efficient fusion of temporal and spatial features, a novel attention fusion mechanism is proposed, which can adaptively and dynamically assign attention weights to temporal and spatial features based on data to achieve more efficient feature fusion.
3) The interpretability of the application to HRRP sequence recognition and effectiveness of the multihead attention mechanism for temporal-spatial feature extraction are verified by visualizing the temporal-spatial attention maps and feature extraction maps.
The other sections of this article are organized as follows.In Section 2, the current work related to HRRP sequence recognition and applications of transformer is briefly introduced.In Section 3, we introduce the architecture and the important components of the proposed method.In Section 4, the advantages, effectiveness, and rationality of the proposed method are analyzed based on the summary of the relevant validation experiments.In Section 5, we summarize the work of this article, and the research in the future is indicated.

Related Work
Since the traditional HRRP recognition method relies on manual extraction of features, researchers in the past were devoted to improving recognition accuracy by mining highly discriminable manual features.For example, Zhang et al. [27] proposed a recognition method based on bispectral feature vectors, which could select the bispectrum with the maximum separability between categories as the feature vector of the signal, avoiding the undesirable effects caused by harmful bispectrum.Du et al. [28] proposed a recognition method based on a double-distribution composite statistical model, which divided range cells into three statistical categories according to the number of dominant scattering points in range cells of the scattering center model, and then modeled the echoes of different categories of range cells as the corresponding distribution forms to accomplish recognition tasks.Molchanov et al. [29] proposed a recognition method based on micro-Doppler bicoherence features to extract cepstral coefficients from the micro-Doppler contributions in radar echoes and compute the classification features using bicoherence estimation.Timothy et al. [30] proposed a recognition method based on the hidden Markov model (HMM), which extracts six power spectrum features from high-resolution (HRR) radar signal amplitude versus target distance profiles using an HMM.Above all, manual feature extraction requires extensive a priori knowledge as support, which relies strongly on the subjective factors of researchers.Therefore, it has limited ability to represent features, resulting in low recognition accuracy.
In order to overcome the limitations of manual feature extraction, researchers introduced machine learning approaches to HRRP recognition.Lei et al. [31] proposed a recognition method based on support vector machines.Different classifier confidence levels were defined according to the distance between classifiers given by the confusion matrix; then the values and posterior probabilities of the support vector machines were integrated into the basic probability assignments to achieve a recognition method that combines support vector machines with the evidence theory.Li et al. [32] proposed an extreme learning machine (ELM)-based recognition method by introducing the L21 norm to reduce the undesirable effects of data noise points and outliers, which made the ELM model more stable.The machine learning method achieves automatic feature extraction, while the above methods extract features by considering HRRP as a whole, which ignores the correlation and temporal information between HRRPs, resulting in information loss.Moreover, the recognition performance still needs to be improved due to the disability in extracting deep features.
With the development of deep learning techniques, it is widely used for classification, forecasting, and clustering. [33,34]In particular, CNNs and recurrent neural networks are widely used in HRRP recognition.For example, Xiang et al. [5] proposed a recognition method based on 1D CNN, which can extract valid structure information of targets in HRRP through 1D-CNN, and introduced an aggregation-perception-recalibration module for feature enhancement.1D CNN can effectively extract the local correlation of HRRPs but ignores the temporal information in a HRRP sequence.Zhang et al. [35] proposed a stacked longshort-term memory (LSTM) based on the attention mechanism for HRRP sequence recognition, using the stacked LSTM to extract deep temporal features of an HRRP sequence and assigning attention weights to temporal features at different moments.However, for long sequences, RNN will lose important target features because of memory loss, and the original important information may be lost when the network is stacked deeply.In order to make full use of the multifeature information of HRRP sequences, researchers extract more stable and robust fusion features and work on more efficient fusion decision-making methods. [36,37]Zeng et al. [7] proposed a recognition method based on multi-input convolutional gated recurrent unit (GRU), considering both time-dependent and multidomain features of HRRPs to further enhance the target representation.Wan et al. [38] proposed a recognition method based on CNN and bidirectional RNN (BiRNN), using CNN to mine the spatial correlation of HRRP data and then using BiRNN to mine the temporal correlation, which make full use of the temporal and spatial features to accomplish the recognition.Feng et al. [39] proposed a multichannel spatio-temporal feature fusion method, which utilizes spatial features extracted by CNNs to fuse with temporal features extracted by recurrent neural networks.Fusion performance is optimized by the attention module, and superior recognition performance is achieved.Unfortunately, CNN limited by the size of convolutional kernels can only effectively extract local information, but has limitations in global feature extraction for long sequences.In addition, HRRP sequences in real scenes contain a lot of noise, under which the local information will impair the generalization of the model.With the birth of the transformer framework, the long-range information of sequences could be modeled by self-attention modules.Different weights are given to sequences adaptively by calculating the correlation in a HRRP sequence, which effectively extracts the global information of the sequence and pays more attention to the important information in the target region.Xu et al. [40] proposed a spatial-temporal transformer-based traffic forecasting method that can combine temporal and spatial features to improve the accuracy of longterm traffic prediction using long-range spatial information and temporal information, which achieved excellent performance in the field of traffic forecasting.Chen et al. [41] utilized the advantages of Swin transformer in feature extraction and integrated the unmixing theory into the model based on the attention mechanism to make full use of the temporal and spatial information of the image and verified the effectiveness and superiority of the method.Wang et al. [42] proposed a CNN bidirectional encoder representation from transformers (BERT) method for HRRP recognition, which represents the local spatial structure of the target through the convolutional module and captures the long-term temporal dependence within the HRRP using the multihead self-attention mechanism through the BERT module.Besides, a cost function considering both recognition and rejection ability is designed, and the validity of the method is verified through experiments.Therefore, transformer can be effectively utilized to extract temporal and spatial features, and the advantage of temporal and spatial fusion features can achieve feature complementarity and extract higher resolution detail features.
Inspired by transformer-related tasks, we propose a method named TSF-transformer-LS to simultaneously extract the long-range correlation in time domain and space domain of the HRRP sequences, which can effectively enhance the ability to extract important information from the target region and suppress the undesirable effects of noisy regions, improving the recognition performance using temporal-spatial fusion features.Besides, label smoothing is adopted to add the label noise and enhance the generalization performance of transformer.In particular, residual connections in transformer avoid the information loss of deep networks.Experiments on MSTAR show that the method proposed demonstrates higher recognition performance and better robustness.

Proposed Method
In this section, the overall structure of the TSF-transformer-LS and the working principle of important modules are introduced.

TSF-Transformer-LS Architecture
The overall structure of TSF-transformer-LS proposed consists of three parts, namely, the temporal transformer module, the spatial transformer module, and the temporal-spatial fusion and recognition module.The temporal transformer (left) module and spatial transformer (right) module are shown in the left part of Figure 1, and the temporal-spatial fusion and recognition module is located in the upper part of the two transformer modules.We only use the encoder of the transformer framework for feature extraction to adapt to the HRRP sequence recognition, both temporal transformer module and spatial transformer module consist of stacked encoders.We extract deep global features in time and space domain respectively by stacked encoders of temporal transformer module and spatial transformer module and then use temporal features and spatial features as input for feature fusion and recognition.
We assume the sequence X ¼ ½x 1 , x 2 , : : : , x n T is used as input, and the dimension of the sequence is n Â c, where n is the length of HRRP sequence and c is the channel dimension of each element in the sequence.At first, the elements in the sequence are processed from c dimension to d model dimension through high-dimensional linear mapping as where Y token is the sequence after linear mapping and the dimension of the channel is d model ; W e ∈ ℝ c Â d model is the weight matrix of the linear mapping.Through high-dimensional linear mapping, information can be aggregated into a high-dimensional representation, which facilitates feature extraction by self-attention mechanism.In addition, position encoding could be used to add a position information matrix X p ∈ ℝ n Â d model that can be learnt in the training process to Y token ; we can obtain the high-dimensional sequence Y with position information.
Then, Y is used as the input of encoders of transformer module, stacked encoders can extract deep global features of the sequence data.In Figure 1, N encoders are stacked for feature extraction; then output deep global features O. Temporal transformer outputs temporal global features O T while spatial transformer outputs spatial global features O S .After that, we propose an attention fusion mechanism to realize the fusion of temporal features O T and spatial features O S .The temporal and spatial features are concatenated, then mapped using a fully connected layer, and activated by softmax to obtain the weights, which are related to the outputs.So in the test phase, the weights are still dynamically adjusted with the samples.The calculation process of weights of temporal features and spatial features is To achieve the adaptive fusion of temporal and spatial features, concatenate temporal and spatial features; then they are weighted with attention weight A F , then temporal-spatial fusion features are obtained as The temporal-spatial fusion features O Fusion contain rich temporal and spatial features with higher discriminability; finally, the temporal-spatial fusion features are used as input for classification by softmax classifier.

Encoder Architecture
Encoder is the significant component of transformer module, which consists of a multihead attention module, feed forward network (FFN), layer normalization, and residual connections.
The multihead attention is shown schematically in the right part of Figure 1.For single-head attention, the high-dimensional sequence Y ¼ ½y 1 , y 2 , : : : , y n with embedded location information is mapped to obtain a query vector and d h represent the dimension of attention head h.The attention weight matrix A ∈ ℝ n Â n between sequence elements is calculated through Q , K, V , the process is defined as we obtain the weighted sequence from the output of the attention module by weighing the value vectors using the attention weight matrix, the output of single-head attention is The multihead attention module processes the input sequences for H self-attention modules and concatenates the individual outputs as where the weight matrix is W MH ∈ ℝ H⋅d H Â d model , and MHðX Þ is the output of the multihead attention module.The multihead attention module splits a sequence of high-dimensional vectors embedded with location information into query vectors, key vectors, and value vectors by linear mapping.Then, we calculate the correlation between the sequence elements by query vectors and key vectors to obtain the attention weight matrix.After that, we weigh the value vectors with the attention matrix to obtain the output of the single-head attention module; then the outputs of the single-head attention are merged to obtain the output of the multihead attention module.
After being processed by the residual connection and layer normalization, the output and input of the multihead attention module are fused as the input of the FFN.In general, FFN contains three layers, the first layer maps the input to a high-dimensional space, the second layer uses a nonlinear activation layer to enhance the nonlinear representation of the features, and the third layer performs the dimensionality reduction; the process is There are also residual connections and layer normalization following the FFN; the output of the encoder is calculated as where O l is the output of the Encoder l, is the weight matrix of the two-layer dimensional transformation respectively, and LNð⋅Þ is the layer normalization function.The stacked encoders are serially connected, taking the output of the previous encoder as input, which finally outputs deep global features of the input sequence.

Temporal Transformer
Consisting of HRRPs with continuous azimuth, the HRRP sequence varies dynamically with the change of azimuth.The HRRP sequences of different targets present diverse trends and patterns, which can be used as features for recognition effectively.However, the commonly used CNN-and RNN-based methods show shortcomings in extracting global temporal information, which limit the performance of recognition.Therefore, we propose the transformer-based method for extracting temporal information.
To extract the temporal information in the HRRP sequence, we use the HRRP sequence R ¼ ½r 1 , r 2 , : : : , r n T as input, where n ¼ 32 is the length of the HRRP sequence, and r i represents the HRRP data of the azimuth i, which contains 128 range cells.Then, temporal transformer models the temporal correlation in the HRRP sequence by calculating the correlation between HRRPs, which can effectively represent the global temporal relationships of the HRRP sequence by extracting the dynamic temporal change information.Besides, temporal transformer consists of N layers of Encoders, so the deep features of HRRP sequences can also be mined effectively.Above all, temporal transformer can effectively utilize the temporal features with deep and global information of HRRP sequences for recognition.

Spatial Transformer
Similarly, the amplitude information of the range cells represents the intensity of the echoes received when the radar detects the target, which contains rich information about the spatial structure of the target.Moreover, HRRP data in real scenes contain a large number of noisy regions, which can severely confuse the distinguishable information of the target region.To further enhance the information extraction ability of HRRP sequences, we extract the global spatial information of the HRRP sequence by spatial transformer, which can adaptively extract the significant information of the target region and suppress adverse effects of noisy regions.
To extract the spatial information between the range cells of the HRRP sequence, we perform a transpose operation on the HRRP sequence R and then use R T as an input.The length of range cell sequence is 128, and the dimension of elements in the sequence is 32.Spatial Transformer models the spatial correlation between range cells by calculating the long-range correlation between range cells, thus adaptively extracting the spatial information of the target region and suppressing the influence of noisy regions.Besides, spatial transformer also consists of N layers of encoders, so the deep features of HRRP sequences can be mined effectively.Above all, spatial transformer can effectively utilize the spatial features with deep and global information of HRRP sequences for recognition.

Label Smoothing
Transformer is a deep network based on self-attention mechanism, which requires a large number of training samples to extract highly distinguishable information of the target.However, due to the limitation of the number of HRRP sequence samples and the fact that the HRRP in real scenes is in a complex noise environment, there are still some differences in the HRRP of the same target, which leads to overfitting easily when using transformer for HRRP sequence recognition.To overcome the overfitting of the proposed method, we adopt the label smoothing regularization strategy to add noise to the label, which can avoid the model's over-reliance on training samples and enhance the generalization performance of the proposed method in real scenes.
Label smoothing adds noise to labels through soft one-hot, thus reducing the weight of real sample labels in calculating loss and finally suppressing overfitting. [43]After label smoothing is added, the probability distribution of soft one-hot labels is where K is the total number of the categories of multi-category tasks, i is the number of categories, and ε is a hyperparameter.When the predicted and true value labels are counted, the cross-entropy loss function is calculated as where p i is the probability of the true value and q i is the probability of the predicted value.
As we know, the neural network will optimize the model toward small loss values during training, but over-reliance on the training set will reduce the generalization performance of the recognition task of HRRP sequences in real scenes.The loss function after label smoothing will avoid overconfidence of the network and slow down the penalty intensity of the loss.Its loss function for each category is calculated as Therefore, during training the neural network, we obtain the optimal prediction probability distribution when minimizing the cross-entropy loss function of the predicted and true values, as where K is the total number of categories in the multiclassification task, ε is a hyperparameter, and α is an arbitrary real number.
From the probability distribution, it can be concluded that label smoothing regularization can increase the tolerance to the existence of errors in the true and predicted values, making the generalization performance of the model enhanced.

Results and Analysis
In this section, first, the construction method of MSTAR dataset is introduced.Then, the recognition performance and robustness of the proposed method on MSTAR dataset are verified by comparing with several remarkable methods.In particular, we explore the interpretability and effectiveness of the proposed method by visualizing the attention maps of multihead attention.Finally, the rationality of the method and the effectiveness of feature extraction are verified through ablation experiments.

Dataset
The MSTAR dataset is a standard dataset widely used for RATR, whose data source is a high-resolution clustered synthetic aperture radar operating in X-band with a resolution of 0.3 m Â 0.3 m and HH polarization.The MSTAR dataset includes 10 categories of targets such as T72, BMP2, and BTR70.Among them, the data with a pitch angle of 17°in the dataset are used as the training set, and the data with a pitch angle of 15°are used as the test set.The azimuth angles of all targets cover 0-360°.Dataset 1 includes the original MSTAR dataset training set of 2747 SAR images, and the test set includes 2426 SAR images.To further investigate the robustness of the method to variant targets, variant targets are modified versions of a target that are somewhat different from the prototype.We add four variant targets, BMP2 (SN-9563), BMP2 (SN-C21), T72 (SN-812), and T72 (SN-S7), to the test set, which constitutes dataset 2. The training set of dataset 2 is the same as dataset 1, and the test set is about 36% more than dataset 1. Dataset 2 includes 2747 SAR images in the training set and 3203 SAR images in the test set.In this article, SAR images are converted into HRRP sequences according to the method specified in ref. [44], and the MSTAR sequence datasets are composed as shown in Table 1 and 2.
The conversion steps are as follows: the dataset is first converted into a complex SAR image, and then an inverse fast Fourier transform (IFFT) is carried out in the orientation dimension of the complex SAR image, and the data obtained along the distance dimension is the HRRP complex sequence.Then, the HRRP sequence is obtained after the modulo of the HRRP complex sequence.For each complex SAR image, 100 HRRP samples could be obtained, and the average of every 10 HRRPs can be obtained as 10 average HRRP samples, so the original MSTAR dataset includes 27 470 HRRP samples in the training set and the test set includes 24 260 HRRP samples.
Assuming that the length of the generated HRRP sequence is LðL ≤ 50Þ, the sliding window algorithm for generating HRRP sequences is shown in Algorithm 1.
We use sliding window method to process HRRP data according to the steps shown in Figure 2.
As shown in Figure 2, the azimuth angle of 360°is divided into 50 azimuth blocks, and each block contains 7.2°, in which the sampling interval of each SAR image is 1°, and each SAR image can be processed and 10 average HRRP samples are obtained, so the sampling interval of each average HRRP sample is 0.1°.In many researches, denoised and enhanced samples are used as input for model, which lead to the poor generalization of the  Visualization of some samples of the MSTAR sequence dataset is shown in Figure 3.
As shown in Figure 3, where Figure 3a-d are sequence data and Figure 3e-h are HRRPs, it can be observed that MSTAR, as a real scene dataset, contains a large amount of noisy information, and so it causes more difficulties for feature extraction.
The parameter setting of the proposed method is shown in Table 3.

Recognition Performance
To verify the effectiveness of the proposed method, we select nine remarkable methods, namely, LSTM, [40] LSTM-FCN, [45] GRU, GRU-FCN, [46] 1D-CNN, TCN, [47] MLSTM-FCN, gMLP, [48] and XCM, [49] as the baseline methods for comparison experiments.The model structure and parameters of the baseline method are set according to the literature so that the performance of the model can be ensured.The recognition accuracy of ten methods on the MSTAR sequence dataset 1 is shown in Table 4.
As shown in Table 4, the TSF-transformer-LS method has the highest average accuracy and the best recognition performance for ten targets, demonstrating the effectiveness of the method in extracting the global information of HRRP sequences based on the self-attention mechanism.Moreover, the proposed method achieves the optimal recognition performance on 7 out of 10 targets, TCN and MLSTM-FCN achieve the optimal recognition performance on 5 out of 10 targets, other methods only achieve optimal recognition performance on 3 or less targets, and the optimal recognition results are bold in the table.The proposed method improves the average recognition accuracy by 17.19% compared to the gMLP traditional method, by more than 3.79% compared to the recurrent neural network LSTM and GRU methods, by more than 0.71% compared to the CNN 1D-CNN, TCN, and XCM methods, and by more than 0.29% compared to the LSTM-FCN, GRU-FCN, and MLSTM-FCN methods which use fusion features.Besides the method we proposed, LSTM-FCN, GRU-FCN, and MLSTM-FCN also have better performance than LSTM, GRU, TCN, XCM 1D-CNN, and gMLP, because they all achieve the target recognition through temporal-spatial fusion features.While LSTM, GRU, TCN, and XCM only use temporal features between HRRPs, 1D CNN uses local features of HRRPs, and gMLP processes the HRRP sequence as a whole, which could not make full use of the temporal and spatial information contained in the HRRP sequence.That is, temporal-spatial fusion features are more effective for HRRP sequence recognition.Meanwhile, the proposed method can generate fusion features more efficiently through the attention fusion method, which has better performance compared with other fusion methods.Algorithm 1. HRRP sequence generation algorithm.
Step 1: The azimuth blocks are arranged in order.To obtain the same number of HRRP sequences, L À 1 previous azimuth blocks are added after the 50th block, so that the sliding window data contains a total of 50 þ L À 1 azimuth blocks; Step 2: According to the order of azimuth blocks, the first HRRP sequence shall be taken from the 1st to the L block, so that the HRRP sequence with the length of L could be obtained; Step 3: Slide the sliding window down and repeat Step 2 until the whole block is taken; Step 4: Return to the 1st HRRP of the azimuth block, and repeat step 2 and step 3 from the 2nd to the L þ 1th azimuth block in the order of azimuth blocks; Step 5: Repeat step 4 until 50 azimuth blocks are taken and then obtain the same number of HRRP sequences as the HRRP data.According to the accuracy in Figure 4, the recognition performance of 10 methods in 10 categories of targets can be intuitively obtained.Compared with gMLP, GRU, TCN, and LSTM methods, which have a large fluctuation in recognition accuracies, the proposed method has more stable recognition accuracy and outstanding recognition performance.Since the proposed method is characterized by the ability to extract global features through transformer and to focus on important features adaptively and achieve the recognition using temporal-spatial fusion features that can complement each other in terms of features, it makes the recognition performance better, stable, and reliable.

Robustness Experiments
Radar target recognition is mainly aimed at noncooperative targets, which often have multiple modified targets.To verify the robustness of the proposed method for variant target recognition, we add four variant targets to the test set, dataset 2 has about 36% more variant samples in test set compared with dataset 1.Then we conduct robust experiments using dataset 2. The recognition accuracy is shown in Table 5.As shown in Table 5, the addition of variant targets to dataset 2 resulted in some decrease in the average recognition accuracy of various methods on ten targets.The TSF-transformer-LS method has the highest average accuracy and the best recognition performance for ten targets, demonstrating the effectiveness of the method in extracting the global information of HRRP sequences based on the self-attention mechanism.Moreover, the proposed method achieves the optimal recognition performance on 7 out of 10 targets, MLSTM-FCN achieves the optimal recognition performance on 5 out of 10 targets, other methods only achieve optimal recognition performance on 4 or less targets, and the optimal recognition results are bold in the table.In particular, the average recognition accuracy of gMLP decreases by 18.55%, indicating that gMLP can only extract shallow target features, resulting in poor robustness to variant targets.GRU decreases by 4.14%, indicating that features extracted by GRU are not stable, because its structure is simpler than LSTM.1D CNN decreases by 1.90% because of its limitation of local information.MLSTM-FCN, TCN, LSTM-FCN, XCM, LSTM, and GRU-FCN all decrease by less than 1.50%; they all utilize fusion features except TCN and LSTM.However, the accuracy of the proposed method only decreases by 0.10% compared to dataset 1, indicating that the proposed method can effectively extract more stable and robust features.Moreover, the use of label smoothing reduces the over-reliance on training data and enhances the generalization of the method, so that small changes in the target structure do not lead to significant degradation in recognition performance.The effectiveness has been verified in ablation experiments in Table 7.Meanwhile, we note that dataset 2 has about 36% more variant samples compared with dataset 1, which concentrates on BMP2 and T72 targets, constituting an imbalanced dataset.The experimental results show that the   proposed method has better robustness to both variant targets and imbalanced datasets.That is, the result illustrates that the global features focusing on both time domain and space domain are more robust in the HRRP sequence recognition, which will improve the generalization performance.
The recognition performance of ten different methods on dataset 2 is shown in Figure 5. Compared with dataset 1, the recognition performance of all the methods decreases to varying degrees.In contrast, the proposed method still maintains high accuracy and stable recognition performance for 10 categories of targets, demonstrating the robustness of the temporal-spatial fusion feature and the effectiveness and stability of the proposed method for the variant samples.

Temporal-Spatial Feature Fusion Experiments
In the feature fusion task under the transformer framework, the fusion methods of add fusion and concatenate fusion are mainly used.Although both temporal global features and spatial global features can be used in the recognition task, it is not appropriate to perform feature fusion only by summation or same-dimensional concatenation without considering the correlation between the two features.The attention fusion proposed herein could learn the correlation between the two features through attention mechanism, and assign weights to temporal and spatial features, so as to realize more effective adaptive fusion.The recognition results of the three fusion methods on dataset 2 are shown in Table 6.
From Table 6, the accuracy of the proposed attention fusion is 0.42% higher than that of add fusion and 0.41% than that of concatenate fusion.Meanwhile, attention fusion has the optimal recognition effect on seven datasets, which is higher than the other two fusion methods, and the optimal recognition effect is bold in the table.Add fusion adds temporal and spatial features, which destroys the original characteristics of temporal and spatial features and loses important information of the original features.Concatenate fusion only achieves the merging of temporal and spatial features; although the original features are preserved, the importance of temporal and spatial features is obviously not equal.Experiments show that using the attention mechanism to assign different attention weights to temporal and  spatial features can represent the correlation between temporal and spatial features more effectively and obtain more effective temporal-spatial fusion features.

Attention Feature Map Analysis Experiment
The proposed method enhances the ability to extract global features through a multihead attention mechanism to extract important temporal and spatial features adaptively.To further investigate the effectiveness of the self-attention mechanism in HRRP sequence recognition tasks, we select test samples to output attention map, intersample dynamic time warping (DTW) map, and L2 distance map.Since there may be offsets between distance units of HRRP sequences, we use DTW to calculate the similarity and the temporal similarity between HRRP sequences by L2 distance.The validity and interpretability of the proposed method are analyzed based on sample data.As shown in Figure 6, spatial transformer could model the spatial correlation between distance units of HRRP sequences.According to the attention map of distance units (Figure 6a) and the DTW map between distance units (Figure 6b), different attention weights could be assigned to different distance units, and the parts with higher weights are mainly concentrated in the distance unit of about 50, and the parts with larger DTW values are assigned with higher attention weights.Temporal transformer could model the temporal correlation between HRRP sequences.According to the attention map of HRRP sequences (Figure 6c) and the L2 distance map between HRRP sequences (Figure 6d), HRRPs with different orientations are assigned with different attention weights, and the parts with larger L2 distances are assigned with larger weights.The greater DTW distance indicates less similarity with other distance units and contains more discriminative features, so the attention weight assigned is greater; similarly, the greater L2 distance indicates less similarity between sequences, indicating that this HRRP contains more discriminative features, so the attention weight assigned is greater.Therefore, the multiheaded attention mechanism mainly focuses on data with less similarity and the most distinguishing features in HRRP sequences; it tentatively explains that the self-attentive mechanism applied to the HRRP sequence classification task has a greater correlation with the similarity measure; we will explore more details about this correlation in future work.

Ablation Experiments
To verify the effectiveness of temporal-spatial fusion features compared with single temporal and spatial features, we establish transformer methods based on temporal and spatial features respectively and verify the effectiveness of label smoothing on the proposed methods.The recognition accuracy of the ablation experiments is shown in Table 7.According to Table 7, the recognition accuracy of TSFtransformer by means of temporal-spatial fusion features is improved by 0.21% compared to temporal-transformer and 0.41% compared to spatial-transformer, indicating that temporal and spatial features can complement each other and temporalspatial fusion features have the effect of feature enhancement, which is superior compared to single temporal and spatial features.Label smoothing mechanism is used to improve the TSF-transformer, and the recognition accuracy is improved by 0.91%, indicating that label smoothing can somewhat overcome the overfitting of the proposed method in real scenes due to high noise and further enhance the recognition performance of TSF-transformer.
To verify the effectiveness of the improved method for feature extraction, we visualize five features using the t-distributed stochastic neighbor embedding (t-SNE) algorithm, which reduces the features to 2D.As shown in Figure 7, the original feature distribution for the ten categories of targets has a high degree of sample overlap and poor differentiation (Figure 7a).Compared with Figure 6a, the feature distribution has better differentiability but smaller intercategory distance and larger intracategory distance after the features are extracted by spatial transformer (Figure 7b).The feature distribution of temporal transformer has smaller intracategory distance and certain differentiation between various types of targets, as well as few overlaps and sample distribution deviations (Figure 7c).Compared with Figures 6a-c the feature distribution has better differentiation, smaller intraclass distances, and basically no overlap (Figure 7d).The distribution in Figure 7e clearly outperforms the other distributions, with better aggregation of samples in the same category, and a more distinct distribution between classes with good differentiation, demonstrating the effectiveness of the proposed method for feature extraction.

Conclusion
In this article, we propose a HRRP sequence recognition method based on TSF-transformer-LS.The proposed method extracts deep global temporal and spatial features in time domain and space domain by temporal blocks and spatial blocks respectively.Then, an attention fusion mechanism is proposed to fuse the temporal-spatial features.We adopt label smoothing regularization to solve the overfitting problem caused by noise in real scenes.Experiments on the MSTAR standard dataset and its variant datasets show that the recognition performance and robustness of the proposed method are significantly improved.TSF-transformer-LS can extract more highly discriminable features, has more stable recognition performance in all categories of targets compared to other methods, and has superior generalization performance for variant targets.In addition, we tentatively analyze the interpretability of the proposed method for the multihead attention mechanism and verify the effectiveness of the proposed method for feature extraction.Although the spatial-temporal fusion features can effectively improve the recognition performance, the lightweight level and feature sparsity of the model need to be further improved.Based on the existing work, we will further improve the recognition performance of transformer on limited HRRP samples and efficiently utilize spatial-temporal fusion features in a more lightweight way.

Figure 1 .
Figure 1.Structure of TSF-transformer (left) and schematic overview of the multihead attention module (right).
model for real scenes.In this research, only the HRRP data are energy normalized, because in the real scenes, data of individual angles are often lost due to aircraft motion.Therefore, the dataset did not interpolate the missing data in the MSTAR dataset, making the data more consistent with the real scenes.After processing by the sliding window algorithm, we get 27 470 HRRP sequence samples in the training set and 24 260 HRRP sequence samples in the test set from dataset 1; 27 470 HRRP sequence samples in the training set; and 32 030 HRRP sequence samples in the test set from dataset 2.

Figure 4 .
Figure 4. Accuracy of ten targets in MSTAR dataset 1, the numbers on the x-axis represent 10 targets in dataset 1, respectively, the numerical order is consistent with the order of the categories in dataset 1.

Figure 5 .
Figure5.Accuracy of 10 targets in MSTAR dataset 2, the numbers on the x-axis represent 10 targets in dataset 2, respectively, the numerical order is consistent with the order of the categories in dataset 2.

Table 2 .
MSTAR sequence dataset 2. Category of training set Training set (17°) Category of test set Test set(15°)

Table 3 .
Parameter settings of the proposed method.

Table 4 .
The recognition accuracy of compare experiments on dataset 1 (%).

Table 5 .
The recognition accuracy of robust compare experiments on dataset 2 (%).

Table 6 .
Recognition accuracy of compare experiments of feature fusion.

Table 7 .
Recognition accuracy of ablation experiments.