Densely connected GCN model for motion prediction

Human motion prediction is a fundamental problem in understanding human natural movements. This task is very challenging due to the complex human body constraints and diversity of action types. Due to the human body being a natural graph, graph convolutional network (GCN)-based models perform better than the traditional recurrent neural network (RNN)-based models on modeling the natural spatial and temporal dependencies lying in the motion data. In this paper, we develop the GCN-based models further by adding densely connected links to increase their feature utilizations and address oversmoothing problem. More specifically, the GCN block is used to learn the spatial relationships between the nodes and each feature map of the GCN block propagates directly to every following block as input rather than residual linked. In this way, the spatial dependency of human motion data is exploited more sufficiently and the features of different level of scale are fused more efficiently. Extensive experiments demonstrate our model achieving the state-of-the-art results on CMU dataset. K


INTRODUCTION
Forecasting the future movements of human actions is a crucial topic in computer vision and computer graphics for its various practical applications in real life, such as surveillance, 1 pedestrian tracking, [2][3][4] interactive robotics, [5][6][7] and autonomous driving systems. 8The data of human motions are usually captured by the Mocap system and represented in the format of the three-dimensional (3D) skeleton.In this paper, we address the problem of generating human action movements in the 3D skeleton format.There were a lot of researchers attempting to propose various approaches for motion prediction.0][11][12][13][14][15][16] However, different from machine translation, motion data has special human body constraints and is actually a spatial-temporal data rather than temporal data.The LSTMs are majorly designed for temporal data and they are not sufficient for capturing the dependency between joints on the human body.Several existing works 17,18 consider that the special hierarchical structure of the human body can enhance the performance to capture spatial information.
0][21] The reason is the human body is a natural graph structure, that is, the joints are nodes of the graph and connectivity is defined by the limbs.The GCN-based work 21 significantly outperforms those of the LSTM or CNN-based models for motion prediction.However, the existing GCN-based models are limited on feature utilization and under research as an emerging topic.When the GCN layers go deeper, the gradient is prone to vanish.Moreover, the features extracted from earlier layers contain the different scales of graph information.The GCN layers usually have a receptive field with size 1, that is, only operating on the 1-nearest node.However, the impact of the 1-nearest feature will diminish when the layers go deeper.On the contrary, the nearest joints on the human body are vitally important for prediction movement.
To address these limitations in a simple but effective way, we propose an advanced GCN based framework for motion prediction which connects all the GCN blocks directly.Therefore, the output feature of each GCN block skips the middle layers and jumps to the final layers.In details, our model formulates the graph as the 3D skeleton of the human body.Then the trajectory of each joint is encoded and fed as node input features.Each GCN block consists of two GCN layers and LeakyRELU layers.The first layer is used to preserve the feature map size.Similar to Reference 22, we concatenate the different scale of features.Therefore, if our model has N GCN blocks, and the first GCN block has an output feature with size C, then the input features for the last GCN blocks will have size C × (N − 1) and N(N + 1)/2 links between blocks are built in our model.
Compared to the aforementioned GCN model, 21 our model requires almost the same level of parameters.But it significantly enlarged the feature maps utilization and increase the impact of earlier layers' feature map.Moreover, this densely connected structure makes the model for motion prediction easier to train and able to go deeper.Another important factor decreasing the performance of motion prediction is that the models are prone to be overfitting due to the motion data amount being small.However, our model has a regularizing effect and LeakyRELU layers reduce the overfitting on training.
In conclusion, our contribution in this paper are: We proposed a new Densely GCN-based model to address the problem of motion prediction.It reuses the multi-scale feature maps from every block to enlarge the receptive field and reduce overfitting problem.We conduct extensive comparison experiments on the standard benchmarks for motion prediction, which are Human3.6M,and the CMU motion capture dataset.The model is evaluated from both the angle and 3D position aspects, and it surpasses the state-of-the-art performance on CMU dataset.

Motion prediction
The mainstream of the existing deep learning models for motion prediction can be categorized as RNN-based, fully convolutional network (FCN)-based and CNN-based.In the early research, Fragkiadaki et al. 9 proposed a recurrent based Encoder-Recurrent Decoder model to address this problem.Following this trend, researchers put effort into designing various recurrent based models for this problem.Jain et al. 23 investigated the spatial-temporal structure of motion data and introduced the Structural-RNN model, which forms a spatial-temporal graph that trains a RNN on each node.Furthermore, inspired by the success of Seq2Seq model on machine translation, Martinez et al. 12 proposed a sequence to sequence model for motion prediction.They surprisingly find a simple zero-velocity baseline which outperforms all the existing complex models.All of the aforementioned strategies face a discontinuity problem and mean pose problem.To address the limitations, Gui et al. 11 built an adversarial model to distinguish the synthesised sequences and the real ones so that the performance of prediction can be lifted.Aside from these RNN-based models, researchers also attempt to solve the problem from other aspects.Instead of training the motion in angle space, Butepage et al. 17 proposed a new FCN-based model with three types of bottlenecks and learned the movements directly in the format of 3D positions.However, FCN-based models are easily overfitting and insufficient to capture the relationships between different body parts.To overcome this, Li et al. 16 designed a convolutional sequence to sequence model which used a long term and short term encoder to extract deep features.

Graph neural network
Recently, researches concerning Graph Neural Networks have become increasingly active because of their superior ability to tackle irregular shaped data.Convolutional operations on the graphs have been investigated as well.The human body can be regarded as a natural graph, therefore recent approaches achieved distinctive success by introducing GCN for skeleton-related problems.A spatial-temporal graph 19 is introduced to address the motion classification problems and gains a considerable improvement.Furthermore, Li et al. 20 considered more relationship edges between human joints and enhanced the classification accuracy.In contrast to the undirect graph used for the skeletons, Shi et al. 24 introduced direct graph for the human body to exploit the dependencies between bones and joints.Zhao et al. 25 designed a SemGCN operation for 3D human pose regression.For motion prediction, Mao et al. 21firstly proposed a GCN-based approach, which predicts the future sequences within the trajectory space instead of the pose space and outperformed all the existing RNN-, FCN-, and CNN-based approaches.To investigate further on the GCN-based model for motion prediction, we constructed a densely connected GCN-based model for motion prediction in this paper.Compared to the work of Reference 21, our model is more effective on learning feature maps and it is less likely to overfit and experience gradient vanishing problems.

METHODOLOGY
In this section, we give out the details of our methodology to address the motion prediction problem.Firstly, the mathematic formulation of the motion prediction problem is described.Then the definition of our model's graph networks is depicted.After that, we show the structure of our networks and how it has been densely connected.The whole pipeline is shown in Figure 1.

Problem formulation
In this section, we provide the mathematic formulation of the motion prediction problem.Assuming that the sequence of the 3D skeleton is t is the time step ranges from 1 to T so that x t is the related pose at time t.Every pose x t contains K joints and each joint can be represented as a 3D position (x, y, z) or orientations (, , ) (the data usually removes the global rotation and translation).Therefore, we denote x t ∈ R 3×K .This task aims to predict the future sequences X T + 1:T ′ .T ′ is the final frame.The ground truth future sequences are denoted as GX T + 1:T ′ and the synthesized sequences from the model are denoted as SX T + 1:T ′ .Therefore, the objective of the problem is to minimize the error of ||GX T + 1:T ′ − SX T + 1:T ′ || and also make the SX T + 1:T ′ looks plausible like the real human actions as well.Most of the traditional methods predict the future sequences by generating poses recursively.However, this type of method suffers from large error accumulation.For example, if the generated x t ′ has an error, the next pose x t ′ + 1 is generated based on the information of x t ′ , so it will accumulate the error as well.Therefore, we follow the recent approach, 21 predict the future sequences in the trajectory space, and generate the whole trajectory of each joint at once.Moreover, a padding strategy is employed to predict the residue of the sequences rather than the absolute value of the sequences because zero-velocity 12 is proved to have better performance.Specifically, we obtain a padding sequence by repeating the last pose x T in the sequence (T ′ − T) times, which can be written as The input sequences for our model are the concatenation of X 1:T and P T + 1:T ′ , which is denoted as The target sequences of our model are the concatenation of X 1:T and GX T + 1:T ′ , which can be denoted as Target 1: Therefore, our model is designed to take the Input 1:T ′ and produce a synthesis sequence Output 1:T ′ .Then the objective function of our model measures the error between Output 1:T ′ and Target 1:T ′ .

Graph neural networks
In our approach, we proposed a GCN-based model to predict future movements in the trajectory space.As the aforementioned denotations, a human pose contains K joints and every joint is represented as 3D position or Euler Rotation Angle.

Graph formulation
The human body is of a natural graph structure.Graph Neural Network-based methods achieved remarkable success on a lot of human pose related tasks in recent years because of their ability to exploit the implicit dependencies between joints.Therefore, we form our graph model intuitively.Recalling that the human pose has K joints and the graph is defined as G = (V, E).Here the node-set V contains K joints {J 0 , J 1 , … , J K } and the edge set E contains the graph edges which correspond to the limbs on the human skeleton.Usually, the adjacent matrix of our graph G is denoted as A. In the matrix A, element a ij on ith row jth column has the value 1 if and only if V i and V j are connected on this graph or i = j.In the experiment, we use this model to predict the graph connectivity, in other word, the edge set E is obtained from training rather than predefined.Moreover, we treat every joint as three nodes on the graph for they have the position ⟨x,y,z⟩.So our node set V is actually {J 0x , J 0y , J 0z , J 1x , J 1y , J 1z , … , J Kx , J Ky , J Kz } for practical use.
In our model, the input feature F ix for node J ix is obtained from the trajectory data of J ix from time period 1 to T ′ .It is a one-dimensional continuous function.Following Reference 21, we transform this trajectory into a series of Discrete Cosine Transform (DCT) representations which are compact in the space to benefit the training.The DCT method uses cosine trajectories as a basis to represent the original trajectory.Any continuous trajectory can be represented by a series of the linear combinations of these bases uniquely.Therefore, every trajectory can be represented by the DCT representation's coefficients.In other words, the input feature F ix of our Graph model is the DCT method's coefficients.The Graph model will output a feature F ′ ix , which is the DCT coefficients as well.Then the F ′ ix will be transformed back to the trajectory by the linear combination of the basis.Therefore, the Graph model takes in the trajectory information for every joint and then produces the output trajectory for every joint.As evident in the literature and experiments, predicting the residual data rather than the exact data will greatly reduce the gradient vanishing and gradient explosion problem, therefore, achieving better performance.We use the coefficients F ix + F ′ ix as the final results and then use it to reconstruct the output trajectory.Therefore, we not only predict the future movements from time T + 1 to T ′ but also reconstruct the movements from time 1 to T. The part of the generated sequence from 1 to T can be used as an identity regularizer to guide the model training.In the next section, we will give out the layers used in obtaining output features F ′ from the input features F.

Graph convolutional layers
There have been various kinds of convolutional layers introduced for graph data.Here, we adopt the graph convolutional layer 26 designed in the spectrum perspective.
The graph convolutional layer is designed based on the idea that the convolutional operation of the two signals x and g is actually the dot product of them in the Fourier domain.The following formulation can be formed. x where x stands for the input signal to the graph and g is the convolutional kernel signal, respectively.F is the Fourier transformation function to project the signal on the graph to their Fourier domain.Due to the existing knowledge, Fourier transforms F(x) on the graph can be written as U T x, where U is a matrix obtained from the Laplacian matrix of the graph.Every row of U is the eigenvector of the Laplacian matrix L = I − D −1/2 AD −1/2 which can be formulated as L = UΛU T .
Therefore, the Equation ( 1) can be rewritten as: Here, g  is the convolutional function operates on Λ.The Chebyshev polynomials are introduced to approximate g  .If only the first order of Chebyshev polynomials are considered, the equations can be written as: T k is the kth order of Chebyshev polynomials.Finally, we assume the input feature of lth GCN layer is F l ∈ R K × C (C is the channel number of the input features) and the output feature is F l + 1 ∈ R K × C ′ .The trainable parameters of the neural network is denoted as W ∈ R C × C ′ , the matrix related to the Lapalian Matrix of graph is denoted as Z ∈ R K×K , then the convolutional operation can be drived out: is an activation function.We used the LeakRELU() as function  here rather than Tanh() used in previous work. 21In our experiment, Z is setting trainable to improve the performance because it can reduce overfitting.

The densely connected network structure
After we explained our problem formulation and layer operations, we will describe the network architecture in our work, which is our main contribution.The input feature for the model is F, then it will pass through N GCN blocks and generate the final output feature F N .Residual GCN blocks.Each GCN block contains two GCN layers and every GCN layer will append a BatchNorm layer, a LeakRELU layer and a Dropout layer.Inside every GCN block, we estimate the residual part of features as well.Then the procedure passing through the lth GCN layer can be formulated as: Densely connection.In the previous work, 21 the output of each GCN block is directly feed into the next block.We believe that approach does not exploit the feature maps of each layer sufficiently.For example, the first layer offers feature maps obtained by operating convolution on the 1-nearest node.Then the next feature map has a receptive field 2 because every node feature contains the 2-nearest information.The key idea is to produce a more informative input feature by fusing multiscale feature maps with a different size of the receptive field.Instead of feeding F l feature maps for the lth block, we try to feed all the feature maps F 0 , F 1 , … , F l into the lth block to enlarge the ability of layers to exploit the hidden dependencies between joints in a different levels.
Therefore, we reconstruct the network structure by adding dense links on the network to increase its ability.Firstly, our GCN blocks do not have residual links anymore, because the input feature size is not matching the output feature size anymore.
Assuming the input feature map F 0 of the model has feature size C, then the output feature map of lth GCN block has size C as well.The input feature map for lth GCN block is actually not the same anymore, but the concatenation of the output feature of all the previous GCN blocks.Therefore, the input size of lth GCN block is l × C. Consequently, the formulation of the lth GCN block can be described as: Compared to Huang et al., 22 this is the first time dense structure-based GCN network is proposed for the motion prediction task.In this way, the feature maps of each layer contribute more significantly to the final results since the final layer is getting backpropagation directly to all the other GCN blocks.Therefore, it is less likely to get a gradient vanishing and gradient explosion problem as well.
Another significant advantage of this kind of structure compared to Residual GCN blocks is that the residual GCN block requires size matching but the dense structure not.For example, we can actually set the channel size of every output feature map F l differently.Assuming their channel sizes are C 0 , C1 , C2 , … , C N , then the input feature of lth GCN block has feature size ∑ l k=0 C k .The benefit of that is we can use the same size of parameters but actually design a much deeper network.For a special case, we can set C 1 , C 2 , … , C N the same size but narrower than C 0 , such as half of C 0 .Then the network can have twice the deeper layers than before and keep the model size at the same level.Meanwhile, the more narrow middle layers also help to reduce the overfitting problem.

EXPERIMENTS
For evaluations, we empirically demonstrate our proposed models' effectiveness on the widely used benchmark Human3.6M  12 CNN-based model convSeq2Seq, 16 and GCN-based model LearnTraj 21 are reported.In the end, we conduct ablation evaluations to investigate the impact of the proposed strategy.

Implementation details
For a fair comparison, we follow the same set of other prior works. 21The input feature size of the model is 15 and every GCN layer output a hidden feature with size 256.Every GCN block contain 2 GCN layers and 12 GCN blocks employed in total.The dropout rate of each layer is set to be 0.5.The learning rate is 0.0005 and batch size is 16.An adam optimizer is used for training and the results are trained after 50 epochs.The whole framework is implemented in PyTorch and trained on an NVIDIA GeForce GTX 1080 Ti with 11GB memory.The approximate training hour is 50 hours.

Datasets
Human3.6M: Almost all of the existing motion prediction models are evaluated on the benchmark Human3.6Msince it provides the largest amount of human poses.Following the typical settings, 12,16,21 15 actions performed by six actors are selected for our experiment from Human3.6M(H3.6M).Three kinds of format are provided by H3.6M.Here we used the 3D skeleton format with 32 joints to represent the human structure.The global rotation and translations are removed and all sequences are downsampled to 25 HZ.The trials from five actors are used as training dataset and the rest trials of one actor are used as testing dataset.
CMU-Mocap: This dataset is firstly introduced for motion prediction evaluation by Li et al.. 16 It contains a wider range of action types than H3.6M.Similar to H3.6M, we remove the global rotations and translations as well and normalize it.In total, eight actions (such as basketball, soccer, jumping and etc.) are selected under the prescriptions. 16Every action set contains more than five trials.For our experiments, we use the same dataset splitting strategy like Li et al. 16

Evaluation baselines and metrics
Baselines.The existing approaches for motion prediction can be broadly categorized as RNN-, CNN-, and GCN-based methods.We select the models with the best performance and public codes so far in these three domains accordingly, which are RRNN, 12 convSeq2Seq, 16 and LearnTraj. 21For the errors reported directly in the Euler angle space and 3D coordinates, we quote their results directly from papers.For the visualization comparison, we only compare our method to LearnTraj 21 and obtain the results from the public code 2 .We trained our model both in 3D position and orientations space.
Metrics.Two kinds of evaluation protocols are used in the experiments.Firstly, the traditional Euler angle error 12,16,21 which represents the input and prediction data in the Euler angle space and measures their Euclidean distance.However, some researchers find this kind of loss incapable to completely reflect the visual similarity-the zero-velocity baseline has a smaller error but looks different at visual aspect.Therefore, another Mean Per Joint Position Error (MPJPE) 27 is used as an evaluation metric in this work as well.This error calculates the displacement between the groundtruth and predicted sequences from the 3D coordinates representation.

Results
In this section, we report our performance in the given tables.The results of short-term (80, 160, 320, 400 ms) prediction of each dataset are given out.Human3.6MTypically, walking, eating and smoking, and discussion actions are most widely evaluated as they are basic and ubiquitous in daily life.Firstly, we show the results of these four types of actions in Table 1.Four baseline

TA B L E 4
The average error of all types of actions in the CMU dataset performances in the Euler Angle spaces is given out in this table.However, it has been found out that the Euler Angle error does not reflect the similarity visually. 21So we train the model in 3D coordinate space and report their comparison results.LearnTraj achieved the smallest error in terms of 3D errors compared to the existing trending work.However, it can be seen from Table1 that our model surpasses it with a gap on serval actions as well.Moreover, the results of the rest 12 types of actions are reported in Table 2. Our methods outperform LearnTraj on majority of the action types in 3D coordinate spaces which demonstrates the effectiveness of our method.CMU-Mocap: Similarly to H3.6M, the short-term results of the CMU-Mocap dataset are shown in Table 3 and 4. Firstly, the Euler Angle error and 3D error of eight actions are shown in Table 3.More than 80 % of the errors are smaller after using our method, except for Jumping and Soccer.This might happen because of the unbalance of the dataset, i.e. some actions achieve their best value while other action may get overfitting.To investigate further, we report the average values of different time intervals.Table 4 demonstrates that our method achieved the state-of-the-art performance for the CMU-mocap dataset which volidate the effectiveness of our model.

CONCLUSION
In this paper, we firstly introduced a densely connected GCN-based model for motion prediction task which enhance the feature maps utilization and reduced the overfitting problem.Experiments on heavily benchmarked databases validate the effectiveness of our model.The performance of 3D joints representation is better than the representation in angle space.The performance on CMU dataset is much better than on H3.6M datasets.Our methods beat down the state-of-the-art methodologies, therefore, shows the dense strategy is useful.

F I G U R E 1
The overview of our model.Dense link shows how the feature maps propagate.Each GCN block shows the input of each node of the graph is the Discrete Cosine Transform of the trajectory 17and CMU-Mocap1. Coprehensive experiments have been carried out to validate the superior ability of our model.The error results are reported both in the aspect of Euler Angles and 3D coordinates.The comparison results to the state-of-the-art work including RNN-based model random recurrent neural networks (RRNN),