Personal‐specific gait recognition based on latent orthogonal feature space

University Natural Science Research Project of Anhui Province, Grant/Award Number: No. KJ2019A0086; National Key Research and Development Program of China, Grant/Award Number: Grant No. 2017YFB1302302 Abstract Exoskeleton has been applied in the field of medical rehabilitation and assistance. However, there are still some problems in the interaction between human and exoskeleton, such as time delay, the existence of certain constraints on the human body, and the movement in time is hard to follow. A human motion pattern recognition model based on the long short‐term memory (LSTM) is proposed, which can recognise the state of the human body. Meanwhile, the orthogonalisation method is integrated to make personal‐ specific disentangling, and it can effectively improve the generalisation ability of different groups of people, so as to improve the effective follower ability of the exoskeleton. Compared with some other traditional methods, this model has better performance and stronger generalisation ability, which has certain significance in the field of exoskeleton algorithm.


| INTRODUCTION
With the arrival of the ageing population and the occurrence of various traffic accidents, more people become mobility inconvenience. How to help these people walk or carry out activities better has become an urgent problem to be solved. Fang et al. present a rotation-invariant dynamical-movement-primitive method for learning interaction skills to effectively transfer human behavior skills to a robot [1]. Exoskeleton, as one of the wearable robots, is a good solution to this problem. In order to improve the comfort and performance of wearing exoskeleton, many scholars have studied the human-exoskeleton interaction. Establishing the mapping relationship between human motion data and behaviour can guide the control of exoskeleton and improve the interaction between human and exoskeleton.
Gait pattern signal is commonly performed via inertial measurement units (IMUs), tactile sensors, vertical ground reaction force [2], surface electromyography, electroencephalograms, and so on [3]. The majority of popular gait datasets employ computer vision technology to improve efficiency, such as the Carnegie Mellon University Motion of Body dataset [4], the Human Identification at a Distance-University of Maryland dataset, the Chinese Academy of Sciences Institute of Automation Gait database [5], and the Osaka University used in the areas of time series analysis and natural language processing. The cyclic nature of the human gait has previously prevented the use of these networks [9]. Chalvatzaki et al. presented a novel LSTM and non-wearable sensor based on the human gait stability prediction framework, which can provide robust predictions of the human stability state [10]. These datasets and methods play an important role in analysing the human gait pattern. However, for the gait analysis algorithm applied to the exoskeleton, video-based data and methods are not applicable.
To achieve the gait recognition algorithm applied to exoskeleton, the data used are acceleration and gyroscope data of IMU. These data are essentially time series signals. For inertial data, Fang et al. presented a novel data glove for gestures capturing and recognition based on inertial and magnetic measurement units (IMMUs), which are made up of three-axis gyroscopes, three-axis accelerometers and three-axis magnetometers. Meanwhile, they used extreme learning machine (ELM) for gesture recognition and achieve an ideal result [11]. Meanwhile, many methods have been proposed for gait analysis and recognition based on inertial data. In the past, plenty of researchers used the traditional manual feature extraction method to extract and analyse the gait features. However, most of those approaches heavily rely on heuristic hand-crafted feature extraction methods, which dramatically hinder their generalisation performance [12]. Shi et al. used IMU and random forest method to recognise five kinds of gait and achieved good results [13].
With the rapid development of deep learning, it makes feature extraction possible to perform automatic high-level feature extraction thus achieves promising performance in this area. Fang et al. proposed the gait neural network (GNN) method based on temporal convolutional neural network (TCN), which can effectively recognises and predicts the human lower limb movement behaviour, and have achieved better results than the traditional methods [14]. Andrey et al. propose using convolutional neural networks (CNN) for local feature extraction together with simple statistical features that preserve information about the global form of time series, and the results show that the proposed model demonstrates stateof-the-art performance while requiring low computational cost and no manual feature engineering. [15]. A deep convolutional neural network (convnet) also proposed to perform efficient and effective human activity recognition using smartphone sensors by exploiting the inherent characteristics of activities and 1D time-series signals [16].
However, there are still some problems in exoskeleton. For example, for the lack of human-computer coupling, it can not give a good feedback to human motion, which leads to a certain delay and restricts human movement to a certain extent. Moreover, at present, the generalisation ability of exoskeleton needs to be improved. How to effectively improve the matching degree of various groups of people, instead of just targeting specific wearers, has become a problem.
In practical application, it is difficult for us to collect gait features of all groups of people to train the model. How to improve the generalisation ability of the model on the recognition of lower limb motion behaviour on a small number of samples is of great significance.
We proposed a model based on LSTM and CNN-based person-specific feature disentangling module, which can reduce the influence of human specificity on the model, so as to improve the generalisation ability and better realise the action recognition of different groups. Compared with other methods, we find that the generalisation performance of the model has been significantly improved.

| RELATED WORK
The human gait prediction by IMU, sole pressure, myoelectric and EEG signals has also been studied. The HuGaDB dataset of the University Higher School of Economics contains detailed kinematic data for analysing human gait and activity recognition. It is first providing human gait data from inertial sensors and containing segmented annotations for studying the transition between different activities. The data is obtained from 18 participants and totally recording about 10 h [17]. The Robotics Laboratory in Seoul National University collected the data of six gaits through the direct driving of the exoskeleton on the body, such as walking on the flat, walking up the ramp, walking down the ramp, walking upstairs, walking downstairs, and squatting.
The gait data obtained by inertial sensor is a kind of time series data in essence, so the LSTM model can achieve good results. Some researchers' works also proves this view. Romero et al. has report the application of LSTM for the modeling of gait synchronisation of legs using a basic configuration of offthe-shelf IMU. The model can be transferred to robotised prostheses and assistive robotics devices in order to achieve quick stabilisation and robust transfer of control algorithms to new users [18].
Recently, in the study of human facial action unit detection, Niu et al. have proposed a method based on local relationship learning with person-specific shape regularisation [19], which can effectively extract the pattern features of the same expression, reduce the influence of specific person, and improve the generalisation ability of the model. Niu et al. also proposed a cross-verified feature disentangling strategy to disentangle the physiological features with non-physiological representations which can improve the robust ability of physiological measurement [20].

| Motion capture system
We collect the inertial data of human motion, including acceleration and gyroscope data by IMU. The whole motion capture system have six inertial sensors that located the lower limb. The locations of these sensors are as shown in Figure 1.
The static accuracy of the system is pitch angle, roll angle accuracy is �1°and heading angle accuracy is �2°. The maximum measurement range of the system is �2000 DPS for 62angular velocity and �16 g for acceleration. All sensors are fixed in the designated place by special bandage, and the data is transmitted through hub.
In this study, we collected data on five gait patterns including walking, up and down stairs, slope up and down for nine participants. The participants were all students, both male and female.
As shown in Figure 2, we selected the left thigh y axis acceleration data of walking pattern of four people who participated in the data collection, and we can find that although there are some slight differences, the overall gait presents a kind of similarity. If we can extract useful features, we can achieve better recognition performance.

| Data pre-processing
The data we collected have acceleration and quaternion; in different channels, the data value range is different, so that we need normalise them. The normalised formula is as follows: where x i is the i-th channel gait data (acceleration or quaternion). After normalising the data, the model can converge better.
Considering the possible similarity of gait data at a certain time point under different motion modes, we do not take the data of a single time step as the input of the model, but select a period of time data as the input. The sequence length of gait data is set to 10.

| Whole model
In this paper, we use LSTM as the stem module to extract features related to lower limb pattern recognition. At the same time, we use a parallel CNN module to extract features related to personal information. Our parallel module mainly includes a convolutional layer, an average pooling layer and a full connection layer to classify personal information.
The whole model structure is shown in Figure 3. Through this structure, we want to achieve that the features extracted by CNN module are more related to personal information than gait, and the features extracted by LSTM module are more related to gait action than to personal information. In order to achieve this goal, we use the idea of cosine similarity, calculate the inner product of the state vector of the last hidden layer of LSTM and the vector of the last feature extracted layer of CNN, and get the result as separation loss. By strengthening the ability of the two modules to extract different information, the stem module can have stronger feature extraction ability to the general gait data information, so as to improve the generalisation ability of gait recognition.

| LSTM (baseline)
The gait data acquired by motion capture system is time series data in essence, so LSTM model is used for further processing and feature extraction. LSTM is widely used in speech recognition, sequence data processing. Compared with the traditional recurrent neural network (RNN), LSTM can remember the information in the time series signal in the past by gating mechanism.
As shown in Figure 4, the structure of LSTM (baseline) is provided.
The calculate formula of LSTM is as follows: where i t , f t , o t are input gate, forget gate, output gate probability, respectively. Through LSTM-based stem model, we can get the information of gait data on time series. At the same time, the final hidden state will be output as the general gait feature, and the final full connection layer will complete the classification task of the action. In this way, the final hidden state vectors have a strong correlation with general gait information.

| Parallel module
As a deep learning model, CNN can effectively extract the characteristics of data. The data we input into the CNN model is in the form (sequence length * sensor channels). We used the feature extraction capability of CNN to extract the feature containing personal information of the matrix and obtained the corresponding feature vector. Finally, the feature vector is feed to the full connection layer to output the classification result of personal information (ID). In this way, the extracted feature vectors have a strong correlation with personal information (ID).
As shown in Figure 5, the structure of parallel CNN module is provided. We use a convolution layer and an pooling layer to extract the relevant features. In addition, considering the noise in gait data, we use the average pooling layer instead of the maximum pooling layer, which can reduce the abnormal results caused by noise.
The calculation formula for the convolution layer is: The best and second are indicated using brackets and bold, and brackets alone, respectively. Abbreviation: BP, back propagation; LSTM, Long short-term memory; SVM, support vector machine.
ZHOU ET AL.
-65 where x n j is the output feature map, x nÀ 1 i is the input feature map, M j is the selected area in the n À 1 layer, k n ij is weight parameter, b n j is bias, f is activation function. The calculation formula for the fully connected layer is: where x is the input layer, N is number of input layer nodes, w ij is the weight between the links x i and y j , b j is the bias. f is activation function. Finally, the full connection layer outputs the personal information (ID) of different data providers, so that the features extracted by CNN have strong correlation with personal information and we can do the inner product with another set of feature vectors from LSTM to achieve feature disentangling.

| Loss function
In the training process of the model, we will get the feature state vector in the middle output of the model, and after dotting the two, we will get seperation loss as one part of total loss. Through this loss function, we can separate the general gait characteristics from the individual characteristics as far as possible, so as to improve the generalisation ability of the model.
In addition, we use the cross-entropy loss as the recognition loss function.
where α and β are the balance hyper-parameters, L cls is the loss of gait pattern recognition, L sep is the inner product of CNN and LSTM feature vectors and L id is the loss of person recognition.
In the process of model iteration, the total loss will continue to decrease, which means that L sep is also decreasing, which indicates that the similarity of feature vectors extracted from the two modules is decreasing, that is, the feature vectors extracted by LSTM will have stronger correlation with general gait features, while the feature vectors extracted by CNN module will be more relevant to personal information \enleadertwodots.
In the validation of generalisation ability, only the recognition performance of gait pattern is considered, and the classification loss of personal information L id can be regarded as the regularisation term of the weight of LSTM model.

| Performance measures
Although our sample size is approximately balanced, considering the performance of a more comprehensive evaluation model, it is not enough to only use the accuracy rate. Therefore, we use a number of evaluation indicators to prove the superiority of our model. The indicators of performance measure are as follows: where true positive (TP), which is samples of positive that are correctly recognised by the model. False positive (FP), which is samples of negative that are recognised as positive by the model. True negative (TN), which is samples of negative that are correctly recognised by the model. False negative (FN), which is samples of positive that are recognised as negative by the model.

| EXPERIMENTS AND RESULTS
In order toverify the generalisation abilityof the model, we carried out tests on small-sample and larger-sample respectively.

| Experimental approach
In this study, the performance of our model was evaluated on a human gait dataset obtained using an inertial-based wearable motion capture device. The model was trained by an Adam optimizer with a learning rate of 0.001 and at 50 epochs, divided by 10. The maximum epoch and batch size were 100 and 64, respectively. The dropout rate of all dropout layers was set to 0.3. The model was implemented by PyTorch and trained and tested on a computer with an Intel Core i7-8750H, two 8 GB memory chips (DDR4), and a GPU (GeForce GTX 1060 6G).

| Experiment on small-sample
In order to better evaluate the performance of the model, we selected the gait data of four collectors for validation. The gait data of three data providers is used as training set, and the other data not participating in training is used as test set to verify the generalisation ability of the model. In order to prove the effectiveness of the method, we use some other methods to compare the classification effect on the same dataset, including CNN, support vector machine, back propagation and LSTM. In addition, the LSTM model structure is the same as stem module in our personal-specific disentangle model, and in CNN model, the gait data is transformed into a matrix whose length is the sequence length and the width is the sensor channels as input.
We collected 2574 gait data that did not participate in the training, which included five types of movements. And on 66these data, we verify the performance of all the models. As shown in the Table 1, we provide the verification performance of different models on five types of actions. We can find that our model shows better performance than other models and achieves better results in gait recognition for different groups of people.
As shown in Figures 6, we provide the confusion matrix of our model and we can find that our model can better make gait pattern recognition. At the same time, we found that walking up and down stairs and uphill is easy to misjudge.

| Experiment on larger-sample
Considering the characteristics of the data we collected and the gait data of the whole population in practical application, we conducted further experiments. We used the data of five gait movements of five people as the training set, and the data of five movements of the other four people as the test set and the size of test set is 47,485.
Similarly, we compared the performance of other models, and found that when the samples of training data increased, the performance of all methods were improved, but our model still has some advantages. Considering the increase of the number of samples, more gait pattern features will be included, so it is inevitable to improve the effect of each model, but when the number of samples is small, our method is still very meaningful. At the same time, it should be added that the parameters α and β may need to be adjusted to obtain the best effect for different data samples.
As shown in Table 2, we provided the comparison result of larger-sample. We can find that when the number of samples increases, the accuracy of action recognition based on misclassification in the case of small samples is also improved. Note: The best and second are indicated using brackets and bold, and brackets alone, respectively.

ZHOU ET AL.
As shown in Figure 7, we also provide the confusion matrix of larger-sample based on our model. From the two parts of experiments, we can see that our model has achieved the best performance.

| CONCLUSION
We proposed a human gait recognition model based on LSTM and a personal-specific disentangle method, which can improve the generalisation ability of the model. The experiments are implemented by the datasets of small-sample and larger-sample. In the small-sample experiment, we can find that our model can better separate the features, which are more related to the gait pattern, so as to improve the generalisation ability of the model in different groups of people. In the larger-sample experiment, when the sample size is large enough, the features extracted by the model may be enough to cover the gait of all kinds of people, and the effect may not be so significant. But even so, considering the complexity of data collection in a real environment, our model is still meaningful. From the results of these two experiments, we can find that our model can effectively extract the general gait features of different populations. Thus, when new untrained data are input, the model can recognise the corresponding actions more accurately. At the same time, through the confusion matrix, we can find that the classification result of all models is not very ideal for the up and down stairs gait, which may be caused due to the similarity between the upper and lower gait, which is also a problem to be solved in the future, that is, how to correctly classify the gait with similarity.
In future work, we can consider more and more complex action categories to identify tasks. At the same time, how to effectively identify actions with certain similarity is also a very meaningful problem.