Learning view‐invariant features using stacked autoencoder for skeleton‐based gait recognition

Hossen Asiful Mustafa, Institute of Information and Communication Technology, Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh. Email: hossen_mustafa@iict.buet.ac.bd Abstract Human gait recognition in a multicamera environment is a challenging task in biometrics because of the presence of the large pose and variations in illumination among different views. In this work, to address the problem of variations in view, we present a novel stacked autoencoder for learning discriminant view‐invariant gait representations. Our autoencoder can efficiently and progressively translate skeleton joint coordinates from any arbitrary view to a common canonical view without requiring the prior estimation of the view angle or covariate type and without losing temporal information. Then, we construct a discriminative gait feature vector by fusing the encoded features with two other spatiotemporal gait features to feed into the main recurrent neural network. Experimental evaluations of the challenging CASIA A and CASIA B gait datasets demonstrate that the proposed approach outperformed other state‐of‐the‐art methods on single‐view gait recognition. In particular, the proposed method achieved 46.31% and 33.86% average correct class recognition on probe set ProbeBG and ProbeCL, respectively, of the CASIA B dataset while considering the view variation; this is 0.3% and 30.68% higher than previous best‐performing methods. Furthermore, in cross‐view recognition, our method shows better results over other state‐of‐the‐art methods when the view‐angle variation is large than 36°.


| INTRODUCTION
Gait recognition is a behavioural biometric modality that identifies a person based on his or her walking pattern. Owing to its noninvasive nature, it offers recognition at a distance with no cooperation or awareness of the subject. In addition, gait recognition can be conducted reliably and accurately in lowresolution images using a simple instrument, whereas other biometrics may not provide the required accuracy under these conditions. For these potential advantages, it is the only likely identification method to solve person reidentification in a multicamera video surveillance environment. It also has a vast prospect for application in access control, human-computer interactions, forensic identification, and so forth. Moreover, gait analysis can be employed in medical applications such as abnormality detection [1] and the automatic classification of gait impairment [2,3] for aided diagnosis.
The past few decades have witnessed significant improvements in gait recognition algorithms under a controlled environmental setup [4]. However, gait recognition remains an open research problem to be addressed, especially in the realworld multicamera environment owing to large variations in camera view angle, pose, and illumination, and subjects' intrinsic variations such as changes in walking speed, clothing, and carrying conditions. Among these covariate factors, the variation in view is the most difficult one because it simultaneously creates intraclass variation and interclass confusion. Approaches proposed in the literature to address this view variation problem can be grouped into three categories: (1) reconstructing the three-dimensional (3D) structure of a human body [5], (2) extracting handcrafted view-invariant features [6], and (3) learning cross-view projections [7,8] to normalise gait features by transforming one view into a more common canonical view, such as the side view. However, reconstructing a complete 3D structure of the human body often requires multiple calibrated cameras, which are impractical in real-world surveillance applications. In this work, we propose a technique for view-invariant feature extraction by fusing the second and third categories of classic approaches. In our proposed technique, we employ an autoencoder to extract view-invariant gait features from the 3D skeleton information of the target.
To extract discriminative features for gait recognition, various input data types such as the human silhouette [9], optical flow [10], and dense trajectories [11] have been extensively employed in the literature, which are not robust to the covariate factors mentioned earlier and often require costly equipment. However, it has been proven that information about the skeleton [12,13] does not depend on changes in illumination and body appearance and hence is less affected by variations in covariate factors. Moreover, the dynamics of the human body skeleton can convey temporal information about gait; therefore, it can effectively represent the gait pattern. Unfortunately, however, when estimating missing joint information in human poses caused by occlusion and low-resolution images, we often obtained errors while modelling the skeleton sequence. Nevertheless, employing recent deep learning-based algorithms in body pose estimation [14,15] and proper preprocessing, we can now model different human body parts accurately, which significantly improves the performance of skeleton-based gait recognition.
Recurrent neural networks (RNNs) and their variants, such as long short-term memory units (LSTMs) and gated recurrent units (GRUs) [16] have achieved promising performance in many sequence-based tasks such as speech recognition [17]. Because a gait is considered to be a series of walking postures, it can form a time series of skeleton sequences. Therefore, in many works, RNNs have also been successfully employed to model the skeleton sequences of a gait pattern [1,18].
An autoencoder is an unsupervised learning algorithm designed to learn an identity function by reconstructing the original at the output. Therefore, besides learning efficient encoding [19], it also learns how to generate the output similar to the input from the reduced encoding. In this research, we employed an RNN-based stacked autoencoder to reconstruct the skeleton joint coordinates from any arbitrary view to a targeted view, in contrast to many existing algorithms [20,21] that usually try to transform the gait features of an entire sequence from one view to another view by losing temporal information. Hence, we believe that employing an autoencoder to translate skeleton information across views is more effective than previous methods. In addition, we can extract view-invariant features from any view using a single uniform model without an earlier estimation of the view angle or the type of covariate variations such as the carrying bag or clothing condition.
The main idea of our proposed method is to develop an RNN-based stacked autoencoder to extract view-invariant gait features using 3D skeleton data. In addition, we considered two handcrafted gait features: motion features and the length of body parts. We then fused them with the encoded viewinvariant features to construct a discriminative lowdimensional gait feature vector for robust cross-view gait recognition. We considered these spatiotemporal descriptors because in our experiment, we found that it greatly improved gait recognition performance. To model the temporal dynamics associated with the gait features, we designed a simple and efficient two-layer bidirectional GRU (BiGRU) architecture. An extensive experimental analysis of the two challenging benchmark datasets demonstrated the effectiveness of our proposed method in both single-view and cross-view gait recognition.
The main contributions of this work are as follows: • We introduce a technique for learning view-invariant features from 3D skeleton information using a single stacked RNN-based autoencoder that can progressively and efficiently translate skeleton joint coordinates from any arbitrary view to the side view angle.
• We employ only a single uniform model to translate skeleton information while keeping both spatial and temporal information among the frames. Furthermore, our proposed method requires no prior estimation of the view angle or other type of covariate variations.
• We also design a simple discriminative RNN model based on a two-layer BiGRU network, which is efficient in skeleton-based gait recognition.
• An extensive experimental evaluation on two challenging benchmark gait datasets shows that our proposed skeleton-based method achieved state-of-the-art performance on cross-view and outperformed current best approaches in single-view gait recognition.
The rest of this work is organised as follows: an overview of the related works in gait recognition is presented in Section 2. A detailed description of the proposed method along with the steps regarding extracting the view-invariant features from stacked autoencoder, forming a low-dimensional discriminative gait feature vector, and training the main RNN network for gait recognition is presented in Section 3. In Section 4, we demonstrate the experimental evaluation of the proposed approach on two benchmark datasets in different experimental settings. Furthermore, we explain the total contribution and deficiencies of our proposed method in Section 5. Finally, we mention some possible future research directions in Section 6.

| RELATED WORKS
In this section, we briefly discuss existing literature that is closely related to our work.

| Traditional methods
Traditional approaches to gait recognition can be divided based on the feature extraction technique from gait video frames into two categories: model-based and appearancebased. Model-based approaches [5,12,13,18] are often built with a structural and a motion model to capture both the static and the dynamic information about gait. Although these methods can handle some covariate factors efficiently, their performance greatly depends on the accurate modelling of body parts. Therefore, they often require relatively highresolution images.
In contrast to model-based methods, a lot of existing work on gait recognition [7][8][9] is based on appearance. The common framework for these types of methods is to compute a gait descriptor, that is, a gait energy image (GEI) [9] by aligning the extracted human silhouettes from video data in a spatial domain and then averaging them in a temporal domain over an entire gait cycle. Owing to its simplicity and relative robustness to segmentation errors, GEI is considered the most effective gait representation for appearance-based methods. However, they are sensitive to covariates such as the subject's body appearance and shape. When the shape of the human body and appearance changes substantially, the performance of these methods severely degrades. Moreover, the extraction of human silhouettes is affected by changes in lighting. Therefore, these methods are not completely robust to covariate variations. To address these issues, Maryam et al. [22] proposed a method to learn viewinvariant GEIs based on non-negative matrix factorization. They modelled a gait with different view-angles as a Gaussian distribution and then used a function of joint Bayesian to transform gait features into a low-dimensional space.

| Deep learning for gait recognition
Deep learning algorithms such as convolutional neural network (CNN)-based gait recognition methods [6,10,23,24] have gained increasing popularity owing to their ability to learn features automatically from given training images. In [10], Wolf et al. employed 3D convolutions for multi-view gait recognition. To make the model colour invariant, they formulated a special type of input with three channels. The first channel of the input was the RGB-image converted to greyscale, and for the second and third channels, optical flow in the x and y directions were, respectively, employed. Shiraga et al. [6] proposed GEINet, an eight-layer CNN network, to learn gait features directly from GEI.
Wu et al. [23] conducted cross-view gait recognition by learning the similarity of the subject's GEI pair using three deep CNN networks. Xu et al. [24] learned an appropriate nonrigid transformation by transforming a pair of inputs: that is, both the probe and the gallery from different views into their intermediate view using a CNN-based pairwise spatial transformer. Then, the gait features were fed into the subsequent recognition network to obtain gait recognition. Yu et al. [25] employed a generative adversarial network (GAN) to learn invariant features for gait recognition. They further improved the GAN-based method by adopting a multi-loss strategy to optimise the network efficiently [26].

| Skeleton-based gait recognition
There has been huge interest in studying deep learning-based approaches for the task of real-time pose estimation from images and videos [14]. It mainly involves estimating the locations of body parts. To recognise multi-person poses in real time, Cao et al. [14] developed a deep CNN-based regression method to estimate the association between different anatomical parts in the RGB image. Besides RGB images, we can estimate a body pose using other forms of input data such as the silhouette [27] and depth image [28]. Ding et al. [27] provided an efficient estimation for the body's key points from silhouette images. However, most of the stateof-the-art human pose estimation algorithm requires colour RGB images, because it is the most prevalent input format in many computer vision tasks.
Martinez et al. [15] proposed a fully connected (FC) network with residual connections that takes a 2D pose as an input and predicts the 3D pose with a regression loss from a single image. This simple model has good performance on several state-of-the-art benchmark datasets. In this work, we took their pretrained 3-day-pose baseline model [15] to predict a 3D pose from the 2D pose information as estimated by the OpenPose algorithm [14].
Parallel to the advancement of the pose estimation algorithms, skeleton-based gait recognition has attracted increasing attention. In many of these gait recognition approaches [1,12,13], RNNs have been successfully employed to model the temporal sequences of skeletal data. For example, Liao et al. [12] constructed a pose-based temporal-spatial network (PTSN) to extract spatiotemporal features that were robust to variations in clothing and carrying items. They used two different kinds of networks for their model: the LSTM network to extract temporal features from pose sequences and the CNN network to extract spatial features from static gait pose frames. They introduced a model-based gait recognition method, PoseGait [13], which employs 3D skeleton joint information as input to the network. They concatenated four different kinds of features at the input level, in which some of the handcrafted features are extracted based on human prior knowledge to form a spatiotemporal feature vector. Finally, they trained a seven-layer CNN architecture for the CASIA B [29] dataset and a 20-layer CNN architecture for the CASIA E dataset. Jun et al. [1] developed a two-stage framework for skeleton-based abnormal gait recognition. In the first stage, they employed an RNN-based autoencoder to extract robust gait features; in the later stage, a deep RNN-based discriminative model was employed for classification.

| Cross-view gait recognition
A lot of approaches [8,20,21] to cross-view gait recognition have been proposed in the literature. They can be mainly classified into two categories: generative and discriminative. Generative approaches are based on transforming gait features from one view to another or normalising gait features by translating one view to a more common canonical view. For example, Makihara et al. [20] introduced an singular value decomposition-based view transformation model for gait analysis with frequency-domain features. In Wang and Yan [30], an ensemble learning-based algorithm was proposed that HASAN AND MUSTAFA -3 used gait features based on the area average distance to reduce the sensitivity of view-angle variation effectively.
In contrast, discriminative approaches were designed to learn view-invariant subspaces by directly optimising the discrimination capability rather than projecting gait features into another common space. For example, Hu et al. [8] proposed a view-invariant discriminative projection (ViDP) method that extracted view-invariant features using a unitary linear projection. They also found the optimal projection according to the geometry. Thus, with the help of the unitary nature of ViDP, cross-view gait recognition was performed without estimating the view angles. Although these approaches have been employed extensively in the literature, they performed poorly, especially in large view-variations, because of problem with searching robust view-invariant subspaces.
Deep autoencoders have been employed to learn viewinvariant features. By stacking multiple autoencoders in progressively, we can transform any arbitrary view to a more common view using only one model. For example, Kan et al. [31] proposed a stacked progressive autoencoder (SPAE) in which they stacked multiple shallow autoencoders to convert non-frontal face images to frontal ones progressively to learn robust features for multi-view face recognition. Furthermore, Yu et al. [21] proposed a stacked seven-layer autoencoder to synthesise gait features for invariant feature extraction. The first two layers of their stacked autoencoder were employed to handle clothing and carrying variations, respectively, whereas the rest of the layers were employed to transform GEIs of any view to the side view.
In this study, similar to Yu et al. [21], we employed a stacked autoencoder progressively to translate all skeletal data into the side view angle for view-invariant features extraction. In addition, we extracted different kinds of spatiotemporal features and fused them to form a discriminative gait feature vector. However, unlike Yu et al. [21], our proposed architecture is composed of three encoder layers that translate 3D-skeleton data into the side view angle along the pose variation manifold.

| PROPOSED METHOD
The workflow of the proposed framework is illustrated in Figure 1. In this section, we will discuss the proposed method and its main components in detail.

| Collecting pose information
Let v i represents the i th subject's RGB gait video and s i represents the i th subject's ID. From the 3-day-pose baseline algorithm [15], we get a map of the RGB data and the body skeleton: Here, p ðf Þ i represents the i th skeleton information at frame f and F i represents the total number of frames in sample video i.
consists of a list of 3D coordinates: where j represents the joint index and J is a set of 32 body skeleton joints defined by the pose estimation algorithm [15]. Now, because all of the joints on the body skeleton do not have a significant role in gait pattern, we did not translate all of the joint coordinates into the side view angle. Rather, we searched for joints with a powerful and discriminative gait representation capacity. In our experiment, we selected a total of 12 body joints from the human left and right leg, hip, neck, shoulder, and head area. All available body joints and selected joints are shown in Figure 2(a) and 2(b) respectively. Consequently, we have x, a 36D input skeletal data from a single frame:

| Handling missing joint information
One of the most challenging tasks for the pose estimation algorithm is to estimate the pose of a subject who is completely or partially occluded. This scenario often leads the algorithm to fail to estimate one or more joint coordinates. To make the proposed gait algorithm robust and accurate, we have to address the problem of missing joint information carefully. In F I G U R E 1 Overview of proposed framework for robust cross-view gait recognition. First, two-dimensional (2D) human poses were estimated from gait video frames using an improved OpenPose [14] algorithm. The 3-day-pose baseline algorithm [15] was employed to estimate 3D skeleton information from the 2D data. Then, a stacked autoencoder learned the view-invariant gait features. The extracted encoded features from the top layer of the stacked autoencoder were fused with other spatiotemporal features to construct a 54D gait feature vector. The discriminative feature vector were then fed into the main recurrent neural network, which identified the subject this study, we employed strategies that are proven to be effective in addressing the missing joint problem: � The centre of the hip joint is considered to be the origin of the coordinate system. If the origin cannot be located owing to the missing hip joints; the frame is rejected; � If more than two body joints are missing between the knee and hip joints of both legs, the frame is rejected; � Persistent missing joints are calculated by exploiting the left and right side body symmetry; � In other cases, a position of [0.0, 0.0] is given to the joint that is not located in the frame.
From this technique, for a particular gait video, if the hip joint or more than two body joints are occluded for a significant duration, the proposed algorithm may fail owing to a large number of the missing frame. Section 4.3.7 provides an experimental analysis in this regard. The experiment demonstrates that our proposed method shows better robustness when the occlusion degree is less than 30%.

| Normalisation
It is important to normalise skeletal data with regard to the subject's position in the frame. Because in most gait videos people walk through the fixed camera, the size of the subject's skeleton changes owing to ongoing changes in distance between the subject and the camera. Therefore, we have to normalise the gait sequence, that is, keep the subject size constant at every frame for improved performance. To eliminate variations in the size of the body skeleton, we transformed the 3D coordinates of all of the body joints into a new coordinate system whose origin (x o , y o ) was selected as the middle of the hip ( J o ) joint: Here, ð� x j ; � y j Þ is the new coordinate of (x j , y j ). In fact, this normalisation has a huge impact on the robustness of the gait recognition algorithm. First, it allows a fair comparison between different subjects' skeletons by reducing the effect caused by variations in camera positions. Second, skeleton sizes become homogeneous among different camera settings and proximities. Thus, it makes the system robust to zooming, camera position, and subject location.

| Proposed autoencoder architecture
An autoencoder is an unsupervised neural network that aims to reconstruct inputs by minimising the difference between the input and the output. It generally consists of two parts: the encoder and the decoder. In the encoding phase, the model learns a latent space representation of the input, whereas in the decoding phase, the model reconstructs the input from the latent space.
LSTM-based autoencoders have been successfully employed in an unsupervised learning video of representation [32] and robust gait features extraction [1]. In this work, to extract a view-invariant gait feature, we developed an autoencoder based on bidirectional LSTM [33]. As shown in Figure 3, our proposed autoencoder consists of a twolayer bidirectional LSTM (BiLSTM) in both the encoder and the decoder stages. From the extensive experimental evaluation, as described in Section 4.3.5, the optimal size of the latent space representation of our proposed autoencoder was found to be 30. The first and second BiLSTM layers in the encoder stage contain 60 and 30 LSTM cells, respectively. The decoder stage has the same architecture, but in reverse order. Again, by stacking multiple layers, we have a stacked autoencoder that can learn more complex coding than a simple autoencoder. Figure 4 shows a stacked F I G U R E 2 Schematic diagram for proposed autoencoder. The input of the autoencoder is the three-dimensional skeleton information. At the output, the autoencoder tries to reconstruct skeleton information into the side view angle. In the encoder part, the first and the second bidirectional long short-term memory unit layers have a total of 60 and 30 cells, respectively. The decoder part has a similar architecture, but in reverse order autoencoder with three encoding layers and one decoding layer.

| Formulation of stacked autoencoder
Because in CASIA B the gait view angle varies from 0°to 180°, we chose the side view angle (90°) as the standard view, because it contains the most dynamic information about gait. It is also easier to translate the joint coordinates from any arbitrary view to 90°. In our experiment, we found that by employing a single autoencoder to translate the skeleton information from a rare view (the front or back view to the side view), we cannot achieve good performance. For example, the variation in pose between two view angles such as 36 and 90°is so dramatic that it would be difficult for a single autoencoder to translate the joint coordinates effectively between them. We also found that we can efficiently handle a small variation in pose, that is, a variation within the range of 36°, by employing a single autoencoder. Therefore, we can translate from a view angle of 36°to a view angle of 72°with a minimum drifting error instead of a view angle of 90°. Now, as suggested in Yu et al. [21], if we stack several autoencoders progressively, we can map the gait videos at any view into the side view. Thus, our first autoencoder can map the skeleton information within the 36°view angle variation range, for example, form a view angle of 0°to a view angle of 36°and form a view angle of 180°to a view angle of 144°while keeping the other input gait skeletons unchanged. By conducting this translation, our input view angle range will be narrowed from Similarly, the second autoencoder will further narrow the view angle range from [36°, 144°] to [72°, 108°]. Finally, the third autoencoder will translate all of the input coordinates to the view angle of 90°. Therefore, by stacking our proposed three autoencoders progressively, we can translate any gait views to the side view angle efficiently. Figure 4 shows a schematic view of the proposed stacked autoencoder.

| Training
In this work, we trained the proposed stacked autoencoder using a greedy layer-wise algorithm. Therefore, each autoencoder was trained independently. The first autoencoder was trained using an unsupervised learning algorithm to set the initial parameters for the first layer of the network. Similarly, the second and third autoencoders were trained independently. In this way, the parameters for all three layers were initialised. After this unsupervised pretraining stage of the stacked layers, the entire network was fine-tuned. The optimization was done by employing the least square error as the cost function found to be efficient in training the autoencoders. In the training phase, the model shows some overfitting owing to its high capacity, which enables it to fit the noise in the data instead of the underlying relationship. To prevent overfitting, the L 2 regularisation was employed: Here, x i denotes the ith input feature vector andx i denotes its correspond output. λ is the parameter to control the strength of regularisation.

| Constructing discriminative gait feature vector
In this section, we will discuss three gait features that are fused to construct the proposed gait feature vector.

| Encoded features
As in our proposed stacked autoencoder, any input gait views were translated into the side view angle at the topmost hidden layer. Its latent representation was extracted as the viewinvariant feature for that gait video. Therefore, we obtained a 30D encoded feature vector, f encoded , which demonstrated strong robustness in various covariate factors especially in the varying view angle:

| Motion features
To preserve the temporal information about gait patterns, we developed a motion feature descriptor that stored local motion features of gait by keeping the displacement information between the two adjacent frames. This descriptor is inherently F I G U R E 4 Translation of skeleton joint coordinates by our proposed stacked autoencoder in a three-dimensional Euclidean space. The architecture is composed of three hidden encoder layers to deal with the variations in view within the range [0°, 180°]. In the training stage, each autoencoder aims to translate skeleton joints from a given view angle into an angle that is 36°closer to the side view angle. Then, by stacking three autoencoders progressively, we can effectively translate all of the skeleton joints to side view step by step along the pose variation manifold. Finally, the encoded features from 30D latent space of the topmost layer are extracted as view-invariant features for gait recognition HASAN AND MUSTAFA -7 robust to view variation and contributes greatly to improving gait recognition performance. Let f and (f + 1) be two adjacent frames of a particular gait sequence. Now, information about the motion of any joint coordinates of the skeleton at frame (f + 1) would be the difference between the corresponding coordinates of frame (f + 1) and f: Here, ðx i Þ is the 3D coordinates of the i th body joint in the fth frame of the video and (△x 1 , △y 1 ) is the displacement information of the first joint coordinate in the (f + 1)th frame. As shown in Figure 3c, we selected a total of 10 skeleton joints, which is effective in representing the gait pattern. Consequently, we obtain a 20D motion feature descriptor, f motion . We did not consider △z information from each joint coordinate an effective feature to form f motion . It probably represents a direction that does not contribute significantly to gait representation.

| Limb length features
Static gait parameters, such as the length of the limbs calculated from joint information, are also important for gait recognition. They are inherently view-invariant and robust to covariate factors such as carrying and clothing variations. In this work, as shown in Figure 3d, we took a total of four effective body parts to form a 4D spatial feature descriptor, f limb−length .

| Features fusion
In this study, we consider feature-level fusion to construct a 54D gait feature vector by fusing the three gait descriptors (encoded features from the topmost layer of the proposed stacked autoencoder, the motion features, and the limb length features). Therefore, the proposed gait feature vector is low-dimensional and view-invariant in nature, and also has a discriminative capacity for representing the gait pattern:

| Feature preprocessing
The preprocessing steps are employed on the proposed feature vector before they are fed into the main RNN network for identification.

| Forming feature map
For each frame, we extracted a 54D gait feature vector from the 3D skeleton information. We then split a gait video into a number of timesteps, each of which has a 28-length frame segment: Here, T ϵ R 28�54 is the feature matrix for each timestep, N is the total number of timesteps, and V is the sequence of features maps for a gait video.

| Data augmentation
The performance of deep learning-based models relies heavily on the amount of training data. Thus, we need a large amount of training data on each class to combat overfitting. In this work, we devised several data augmentation techniques to increase the amount of relevant data in our dataset. First, we split the input video into an overlapping sequence of video clips. For each timestep, we overlapped 24 frames of the previous timestep. Therefore, the overlapping rate was 85.7%.

| Recurrent neural network for gait recognition
3.5.1 | Network architecture Figure 5 illustrates the architecture of our proposed discriminative RNN network, which was built by stacking two BiGRU [34] layers followed by an FC layer. In this research, we tried several other architectures such as LSTM, GRU [16], BiLSTM [33], and BiGRU [34] for our main RNN network. First, we built all of these architectures using a single layer and searched for an optimum hidden unit size between 40 and 150. We then increased the capacity of the network by adding more hidden layers. Consequently, we got the best results in a two-layer BiGRU architecture with 90 GRU cells among all of the RNN architectures we tested. The architecture required a reduced number of parameters while successfully retaining long-term temporal information compared with other recognition networks proposed in the literature.
After the input and the second recurrent layer, we placed a batch normalisation [35] layer to standardise activation and reduce overfitting. Finally, an FC layer with a softmax classifier was employed to predict subject IDs. We employed Adam [36] optimization algorithm for training. We experimented with various learning rates and found 0.001 was the best initial learning rate. The batch size was set to 128 and the network was trained for 450 epochs.

| Loss function
It has been observed that when extracting gait features, the intraclass variance of one subject is sometimes larger than the interclass distance. Thus, for robust gait recognition, we need to reduce intraclass variations effectively and enlarge the interclass distance. To accomplish this, we adapted a multi-loss strategy while optimising the parameters of our network. We combined the softmax loss (L s ) with the centre loss (L c ) [37]. As the training progresses, the softmax loss increases the interclass variation while the centre loss minimises the distances between the features and their corresponding class centres: The calculation of total loss (L) is defined in Equation (10). Here, x i ϵ R d denotes the i th skeleton data, which belongs to the y th i class. c y i ϵ R d refers to the y th i class centre of the gait features. W ϵ R d�n is the weights and b ϵ R d is the bias of the last layer of our proposed network. To balance between the two loss functions, a variable, λ, is introduced. We got the best result when λ was set to 0.01 in our experiment. Again, m, and n refer to the batch size, and the total number of classes, respectively.

| Postprocessing
The predicted output of the proposed model is a sequence of class probabilities for each timestep of a given video. To get the subject ID, the majority voting scheme has been employed in the output. In this scheme, a subject who receives the highest number of votes in all timesteps is referred to as predicted ID.
Let us consider a gait video that has been split into N total timesteps. For a particular timestep t, a gait video has input feature map X t ϵ R 28�54 and an output probability o t ϵ R n . Here, o t i ¼ Pðs i |X t Þ refers to the probability of input feature map X t belonging to the subject ID s i . Now, employing the majority voting scheme on the predicted output probabilities of our model at tth timestep (o t ), we obtained subject class s i . The following equations described the voting scheme in which we predicted the final class of the gait video s from n total classes:

| EXPERIMENTAL EVALUATION
In this section, we will conduct an extensive evaluation of our proposed method in both single-view and cross-view gait recognition on multiple benchmark datasets.

| Dataset
Few existing gait datasets have a large number of subjects walking in a multi-view camera environment. Some publicly available multi-view gait datasets are the CASIA A and CASIA B gait dataset [29], OU-ISIR multi-view large population (OU-MVLP) dataset [38], and USF HumanID dataset [39]: • USF HumanID gait dataset [39]: The USF dataset has a total of 122 subjects walking outside under five different factors on two different surfaces and in two different view F I G U R E 5 Block diagram of the proposed main recurrent neural network. It is composed of two bidirectional gated recurrent unit [34] layers, each of which consists of a total of 90 gated recurrent unit cells. The network was fed with a 54-dimensional gait feature vector. Similar to the input, the output of the recurrent layers was followed by a batch normalisation layer [35] and then fed into an output softmax layer for gait identification angles. However, every subject was not filmed under all conditions.
• OU-MVLP [38]: It is the largest dataset available for gait recognition. It contains a total of 10,307 subjects from 14 view angles ranging from 0°to 90°, and 180°to 270°. However, unfortunately, we cannot employ OpenPose [14] to estimate body pose for this dataset because it was formatted and released only as a set of silhouette sequences. Nevertheless, using methods like those of Ding et al. [27], we can still estimate the body key points on this dataset.
• CASIA (CASIA A and CASIA B) database [29]: The most widely used multi-view database for gait recognition. CASIA A dataset contains a total of 20 subjects walking in an outdoor environment, whereas the CASIA B dataset contains a total of 124 subjects walking in an indoor environment. Each subject in CASIA A walks along a straight line under three different view angles. For each view angle, a total of four gait sequences are available with the front and backward walking directions.
In the CASIA B dataset, each subject has a total of 10 gait videos captured from 11 camera view angles ranging from 0°t o 180°under three covariate condition: normal walking (nm), walking in coats (cl) and walking with a bag (bg). Figure 6 illustrates some of the sample video frames of the CASIA A dataset.

| Experimental setup
To evaluate the performance of gait recognition on the CASIA A dataset, we employed the leave-one-out cross-validation rule: that is, one sequence was set for testing and the remainder sets were used to train the network. We then trained a classification model for each view angle (0°, 45°, and 90°) and compared our method with other state-of-the-art methods. Because a single model was trained for each view angle, we cannot employ our proposed autoencoder here. Instead, the normalized 3D skeleton information was employed to construct the gait feature vector.

| Evaluations of single-view gait recognition
We compared our results with other state-of-the-art methods including those of Goffredo [40], Wang [41], Liu [18], and Kusakunniran [42]. As illustrated in Figure 7, the proposed method achieved the highest average correct class recognition rates (CCRs) (100%) in all three angles and outperformed all. Among these methods, Kusakunniran et al. achieved 100% CCR in 0 and 45°but failed to achieve perfect CCR for 90°; our proposed method achieved 100% in 90°as well. Table 1 summarises the CCR of all of these methods.

| Experimental setup
Various experimental protocols have been designed in the literature to evaluate the performance of the CASIA B dataset. For a fair comparison, we strictly followed these protocols to train and evaluate our model on the respective baseline methods. Here, we categorise these experimental protocols under three experimental setups (A, B, and C), as demonstrated in Table 2.
Experimental setup B was designed according to the protocol suggested by Liao et al. [13], Yiu et al. [21] and Yu et al. [25] to evaluate the performance of single-view gait recognition. In this experimental setting, we split the dataset into two groups based on a total of 62 subjects, in which the first group was used to train the network and the second group was used for testing. We employed this setup to investigate the robustness to view variation on single-view gait recognition.
Experiment setups A and C were designed according to the protocol suggested by Kusakunniran et al. [7], Chen et al. [11] and Wu et al. [23] to evaluate the performance of gait recognition in a cross-view setup. In experimental setup A, the training set included videos of the first 24 subjects; the rest of F I G U R E 6 Sample video frames of the CASIA A dataset in which subjects walk along a straight line in three different view angles 10the 100 subjects were used for testing. Setup C used the first 74 subjects for training and rest of the 50 subjects were set for testing. In the test split of all of these experimental settings, the first four normal walking sequences of each subject were put into the gallery set whereas rest of the sequences (two normal, two bags, and two coast) were composed of three probe sets: ProbeNM, ProbeBG, and ProbeCL, respectively.

| Evaluations on single-view gait recognition without view variation
The complete experimental evaluation of single-view gait recognition on all three probe sets of the CASIA B dataset is illustrated on Tables 3-5. We achieved the highest average class recognition rate 99.41%, 87.10%, and 72.14% on probe sets ProbeNM, ProbeBG, and ProbeCL respectively. The experimental results proved the effectiveness and robustness of the proposed algorithm to the covariate variations.
We compared our experimental results with other skeletonbased methods such as PTSN [12], PoseGait [13], and appearance-based methods such as GaitGANv2 [26], and SPAE [21]. The experimental settings for all of these methods were set according to experimental setup B (Table 2); that is, the first 62 subjects were used for training. Figure 8 illustrates the results of the comparison, which shows that the CCR achieved by the proposed method on all three covariate variations outperformed all of the previous methods on the CASIA B dataset with 99.41%, 87.10%, and 72.14% average CCRs in ProbeNM, ProbeBG, and ProbeCL, respectively. Moreover, our method achieved an average CCR of 86.22% with an improvement of approximately 18.99% from Gait-GANv2 [26], which was specifically designed to handle crossview gait recognition; we outperformed PTSN [12] by 3.1%, which was specifically designed for single-view gait recognition. The average CCRs for all of these methods are listed in Table 6.

F I G U R E 7
Comparison of proposed method with other gait recognition methods shows that our method achieved a correct classification rate of 100% in all three view angles of the CASIA A dataset. The result proved the effectiveness of our proposed method T A B L E 1 Comparison of different state-of-the-art gait recognition methods using correct class recognition rate on CASIA A dataset without considering view variation

Methods 0°45°90°Mean
Goffredo [40] 100 Note: The leftmost column reported the average CCR (%) of all of the probe angles for each gallery view. The bold value in each row represents the CCR for the same gallery and probe view.

-
HASAN AND MUSTAFA

| Evaluations on single-view gait recognition with view variation
To better illustrate the robustness of our gait recognition algorithm to view variation, the proposed method was compared with state-of-the-art methods including GaitGANv2 [26], PoseGait [13], and SPAE [21] as illustrated in Figure 9. Our proposed method outperformed these methods in two covariate conditions and achieved comparable performance to ProbeNM. For instance, we achieved an average CCR of 33.86% on the probe set, ProbeCL, which was 5.88% better than the previous best result achieved by PoseGait [13] and an average CCR of 46.31% on the probe set, ProbeBG, which was 0.3% better than the previous best result achieved by Gait-GANv2 [26]. In the case of ProbeNM, we achieved a 64.22% average CCR, which is better than PoseGait and SPAE, and comparable to GaitGANv2. Table 7 lists the results. Furthermore, Figure 10 illustrates the impact of view variation on different state-of-the-art methods in various probe angles such as 54°, 90°, and 126°. Our method clearly outperformed other methods in all view angles, especially at a view angle of 90°.

| Evaluations on cross-view gait recognition
To show the effectiveness of our proposed method, we compared our proposed method with other state-of-the-art methods including DeepCNN [23], CMCC [7], GEI-SVR [43], PoseGait [13], and ViDP [8] in cross-view gait recognition. All of the experiments were conducted within the same experimental settings.
In the first experiment, the performance of our proposed method was evaluated at different view variations. Here, we trained our proposed model according to experimental setting A: only the first 24 subjects were trained. The probe angles were selected 0°, 54°, 90°, and 126°with the normal walking condition. Table 8 shows that although the proposed method has only a single model to handle any view angle variation, it achieved comparable performance compared with the previous ones, which were specially designed and trained for cross-view gait recognition. Our method performed better than CMCC and GEI-SVR in most cases and outperformed DeepCNN in 3 of 14 scenarios. The comparison in Table 8 also illustrates that the proposed method performed better when the difference in view angle between the gallery and probe is large, especially 36°.

F I G U R E 8
Average correct class recognition (CCR) comparisons of the proposed algorithm with other state-of-the-art methods on CASIA B dataset without considering the variation in view. The proposed method achieved the highest CCRs (99.41%, 87.10%, and 72.14%) in normal, carrying bag, and clothing conditions, respectively. In particular, our method achieved a 5.91% higher average CCR in clothing and a 1.54% higher average CCR in the carrying bag condition than the previous best result achieved by the pose-based temporal-spatial network [12] in single-view gait recognition T A B L E 6 Comparison of proposed method with previous state-of-the-art methods showing that the proposed method outperformed others at a significant margin in all three probe sets of CASIA B dataset without view variation: it achieved a 3.1% higher average correct class recognition rate than the pose-based temporal-spatial network [12] Methods

ProbeNM ProbeBG ProbeCL Average
Pose-based temporal-spatial network [12] 96 In the second experiment, we evaluated the performance in cross-view recognition on experimental setting C. Table 9 compares the proposed method and the methods that gave us the previous best results. However, in this experiment, the gallery set contains the gait sequences of all view angles (0°−180°) except the one that is identical to the probe view. The results of the comparison are listed in Table 9 for all three covariate variations. The experimental results show that our method achieved higher average recognition rates compared with other methods except for DeepCNN [23]. DeepCNN achieved the highest average recognition rates in both experiments. This was mainly because their deep models were trained in a verification manner that was different from ours. We trained our proposed model in a classification manner that is much smaller and requires fewer data to train compared with DeepCNN [23].

| Experiments on the proposed autoencoder
In this research, one of the most important tasks is to find the best architecture of the proposed stacked autoencoder. To accomplish this, we conducted several experiments to evaluate the performance of different RNN-based architectures. The evaluation was done based on the average CCR (%) of the CASIA B dataset while considering the view variation. Furthermore, it is also important to find the appropriate dimensions of the latent space representation through experimental analysis because it directly related to the effectiveness of data compression. If the number of latent space dimensions is too few or too many, the autoencoder cannot effectively encode the gait pattern.
In our first experiment, we employed several RNN architectures such as LSTM, BiLSTM [33], GRU [16], and BiGRU [34] with the latent space dimensions ranging from 20 to 60. We then compared the average CCR among them to find the best architecture. Table 10 compares the average CCR among different types of RNN architectures with different numbers of latent dimensions. In most of the experiments, bidirectional RNN outperformed unidirectional RNN. Alternatively, the difference in performance between the BiLSTM and the BiGRU was not large enough to pick one over another. However, BiLSTM autoencoders performed a little better over BiGRU autoencoders specifically in covariate conditions. Again, the highest CCR was achieved when the size of the latent space was taken to be 30. Furthermore, in our second experiment, as shown in Table 11, we got the minimum drifting error when the latent space size was 30. Therefore, in this research, based on the conclusions of these results, we took the BiLSTM architecture to build our proposed autoencoder with a 30D latent space.

| Experiments on different training sets
To evaluate the performance of our proposed method in different training sets of the CASIA B dataset, we conducted F I G U R E 9 Comparison of algorithms while considering view angle variations for gait recognition. The comparison shows that the proposed method achieved a 5.88% higher average correct class recognition rate (CCR) than the previous best result achieved by PoseGait [13] in the probe set, ProbeCL, and a 0.3% higher average CCR than the previous best result achieved by GaitGANv2 [26] in the probe set, ProbeBG T A B L E 7 Comparison with different best-performing methods using average correct class recognition rate (%) on all three probe sets of CASIA B dataset while considering the view variation

Methods ProbeNM ProbeBG ProbeCL
Stacked progressive autoencoder [21]  Note: The proposed method achieved 0.3% and 30.68% higher average correct class recognition rates compared with the previous best method GaitGANv2 [26] in both ProbeBG and ProbeCL, respectively. It also achieved comparable performance in normal walking with others.

-
HASAN AND MUSTAFA several experiments. First, we split the dataset into different divisions, and for each split, the gallery set consists of a total of one to four normal walking sequences. The results of the experiments are illustrated in Table 12. The performance of the proposed method degrades severely when the number of gallery sequences is less than 3.

| Experiments in occlusion handling
In real-world scenarios, the problem of occlusion can drastically affect the performance of the gait recognition algorithm.
In fact, because of the unavailability of complete gait cycle information, most gait recognition algorithms perform poorly in this scenario. To evaluate the robustness of our proposed algorithm in occlusion events, we performed the following experiment on a synthetic CASIA B dataset. In this study, we introduced synthetic occlusion in the CASIA B dataset because it lacks occlusion as a covariate factor. It was performed by removing skeleton information from a percentage of frames at random where the subject appeared to be walking. We conducted an experiment to evaluate the robustness of our proposed method using the synthetic CASIA B dataset. Table 13 illustrates the experimental results. The average CCR of the proposed method decreased around 15% when the degree of occlusion was less than 30%. Therefore, our method demonstrates strong robustness to occlusion events.

| Computational cost analysis
to convert it into 3D using the 3-day-pose baseline algorithm. The total computational time required to run the proposed method was 141.2 ms/frame, which is 156 ms lower than PoseGait [13]. Table 15 compares the computational cost among relevant methods. The proposed method is the fastest in a given hardware setup. The reason for the low computation cost of the proposed method is probably because it performed gait recognition by modelling a low 54D gait feature vector using a simple and lightweight RNN network.

| DISCUSSION
The main contribution of this work is to learn view-invariant features using a single stacked RNN-based autoencoder that can progressively translate the 3D skeleton joint coordinates from any arbitrary view to a more common view: the side view. Most existing state-of-the-art methods cannot efficiently convert multiview data to one specific view like our proposed stacked autoencoder. Another advantage is that to implement this autoencoder, we only need to employ a single uniform model that does not require a prior estimation of the view angle or another type of covariate variation. In many state-ofthe-art algorithms, prior knowledge of the view angle or the type of covariate variation is necessary. Furthermore, in this study, we design a low-dimensional discriminative gait feature vector based on the 3D coordinates of human pose information, which is proven to be effective in representing the dynamics of the gait pattern. Our preprocessing steps are also powerful in handling long-term occlusion events.  Note: The gallery set consists of all of the view angles except the identical one. The proposed method achieved higher average recognition rates compared with other methods except for DeepCNN [23], which, in contrast to ours, was trained in a verification manner and required huge training data. As mentioned, the success of the proposed method mainly depends on the performance of the 3D pose estimation algorithm. Most state-of-the-art deep learning-based 3D pose estimation algorithms use RGB video sequences as input. Furthermore, the performance of 3D pose estimation from a single RGB image heavily depends on the amount of training data. Therefore, the proposed method cannot take input other than RGB and performs poorly on a small gait dataset. This is one of the most serious shortcomings of the proposed method. Another drawback is that the proposed stacked T A B L E 10 Average correct class recognition rate (CCR) of the CASIA B dataset with view variation recorded by altering the number of dimensions in encoded feature space and the architectures of the recurrent neural network in the proposed autoencoder autoencoder is specifically designed for the 11 view angles of CASIA B dataset. However, some changes must be made before it can be used for other datasets. Finally, the performance of our proposed algorithm is drastically reduced when the view variation in the cross-view setup is increased. Employing a more advanced 3D pose estimation algorithm can help us to model the body parts in any view accurately, leading to improve the performance of recognition further at a crossview setup.

| CONCLUSION
A stacked autoencoder was proposed for extracting viewinvariant gait features from 3D skeleton information. By stacking multiple autoencoders, we can effectively and progressively translate the skeleton joint coordinates from any arbitrary view to side view along with the pose variation manifold using a single uniform model. We then extracted the encoded features from the 30D latent space of the proposed stacked autoencoder and fed them into the main discriminative RNN network for robust gait recognition. The experimental evaluation conducted on two challenging datasets clearly confirms the effectiveness of our proposed approach. Employing a larger multi-view dataset containing RGB videos of thousands of subjects will help us develop a more stable network suitable for practical applications such as real-time surveillance.