3D human pose and shape recovery by a temporal convolutional transformer network

From a 2D video of a person in action, human mesh recovery aims to infer the 3D human pose and shape frame by frame. Despite progress on video ‐ based human pose and shape estimation, it is still challenging to guarantee high accuracy and smoothness simultaneously. To tackle this problem, we propose a Video2mesh, a temporal convolutional transformer (TConvTransformer) based temporal network which is able to recover accurate and smooth human mesh from 2D video. The temporal convolution block achieves the sequence ‐ level smoothness by aggregating image features from adjacent frames. The subsequent multi ‐ attention transformer improves the accuracy due to its multi ‐ subspace for better middle ‐ frame feature representation. Meanwhile, we add a TConvTransformer discriminator which is trained together with our 3D human mesh temporal encoder. This TConvTransformer discriminator further improves the accuracy and smoothness by restricting the pose and shape in a more reliable space based on the AMASS dataset. We conduct extensive experiments on three standard benchmark data-sets and show that our proposed Video2mesh outperforms other state ‐ of ‐ the ‐ art methods in both accuracy and smoothness.

estimation is performed frame by frame, the result may be a discontinuous sequence if the correlation between frames is not considered to ensure smoothness.With explosive growth of video data, the dynamic learning of human motion between adjacent frames has become important clues for realistic human pose and shape recovery and behaviour understanding [12][13][14].Human pose and shape recovered from 2D space to 3D space in sequences dramatically increase the complexity of feature expression.Multiple approaches have focussed on improving the accuracy and stability of 3D human mesh recovery from video in recent years.
Temporal HMR [15] learnt the human motion kinematics by feeding each frame's image feature from Resnet50 to feedforward convolutional networks [16,17].Unfortunately, due to the lack of supervision of 3D video data at that time, the result was regarded as over smooth with weak accuracy.The subsequent work VIBE [18] adopted a new 3D motion capture dataset called AMASS [19] as a good solution for the supervision with 3D data.At the same time, VIBE fed selfimproving features from each frame extracted from the SPIN method to Recurrent Neural Networks (RNNs) [20,21] for temporal information learning.RNNs present an intuitive method to deal with the sequential video data, which can be applied on various input such as 3D joints, 2D joints, or image features extracted by Resnet50.However, RNNs suffered from problems such as error accumulation and gradient explosion.Alternatively, feed-forward convolutional networks do not process input frame by frame and rather feed all the frames at once.As a result, the smoothness of RNN based method is inferior to the feed-forward convolutional networks.
At present, the main problem is that previous methods fail to guarantee the high accuracy and smoothness together.Recovering the 3D human mesh from 2D videos still needs to be improved.To tackle this, we take inspiration from SPIN [11], which embraced the strengths of previous methods and made use of them in tight collaboration.We address this by proposing Video2mesh, a temporal convolutional transformer (TConvTransformer) based method to recover 3D human mesh from 2D video.Unlike the state-of-the-art RNN-based method [18], our Video2mesh method adopts the temporal convolutional block over image feature extractor for sequence local smoothness.Specifically, the Temporal Convolution Networks (TCNs) out-performs RNNs in some video-based applications such as human action recognition and human motion prediction [22][23][24] due to the parallelisable training process with more stable gradients, which is a promising alternative to guarantee the integrity and smoothness of video data by efficient temporal feature extraction.However, the aggregated temporal feature will inevitably lose the middleframe accuracy since there is no focus on the output middleframe feature.We thus apply the multi-head attention based transformer to makes use of its multi-subspace to help the network capture the information that is more important to the middle frame.Finally, referring to the VIBE [18], an adversarial 3D supervision from the AMASS dataset is applied on the output middle-frame for accurate human pose and shape recovery.
Our Video2mesh has the following technical contributions: � We leverage the temporal convolution block which obtains the sequence-level smoothness by aggregating image features from adjacent frames.� We employ a multi-head attention based transformer to improve the frame-level accuracy by exploring middle-frame information in multi-subspace.Our approach is the tradeoff between smoothness and accuracy.� We adopt the temporal convolution transformer in both generator and adversarial discriminator.Extensive quantitative and qualitative experiments show that our method has significant improvements over state-of-the-art methods on three benchmarks.

| 3D human pose and shape estimation from single image
Recent deep learning based work adopt two-stage methods that estimate intermediate results such as keypoint locations and silhouettes [7], body part segmentation [8] to recovery the 3D mesh.Undisturbed by background and other information, the two-stage methods help to deal with the domain shift between the indoor MoCap datasets and the in-the-wild datasets.HMR [1] proposed an end-to-end method to infer 3D mesh parameters directly from the single image via iterative error feedback and weak supervision from adversarial prior [25,26].However, due to the limited pair of 3D human ground truth in the existing datasets, the direct regression network lacks sufficient single-frame supervision.SPIN [27] applied the iterative optimisation-based method that fits the parametric body model to 2D joints to generate strong supervision for the 3D mesh regression network which in turn provides reliable initialisation for fast and accurate pixel fit.

| 3D human pose and shape estimation from monocular video
Considering the rich dynamic information and consistency constraints in video, recent works focus on recovering the 3D human mesh from monocular video.Temporal HMR [28] utilise a temporal encoder to aggregate the frame-level feature directly extracted by Resnet50 for sequence-level representation generation and incorporate a hallucinator which predicts 3D pose changes in adjacent frames to maintain temporal consistency.Similar to multi-task learning, the prediction loss helps the temporal encoder to focus on the dynamic changes between nearby frames for better smoothness, however the accuracy of the 3D mesh regression will be limited by the prediction uncertainty.DSD-SATN [29] associate the decoupled single frame feature in a sequence via an adversarial self-attention temporal network [30], where the self-attention mechanism applied on 380 - the initial input sequences has a bad effect on the smoothness due to the order insensibility.VIBE [18] apply the GRU (Gate Recurrent Unit), a variant of RNNs, for temporal motion aggregation and adopt unpaired SMPL parameters from the AMASS dataset to obtain the motion prior in an adversarial way, which encounter incoherence and error accumulation among adjacent frames especially in the case of complex background and partial body occlusion.

| Framework
Our approach is based on a temporal encoding for dynamic scenes that combines the advantages of temporal convolutional block and multi-head attention based transformer to achieve high performance both in temporal consistency and per-frame 3D human mesh accuracy.Our task aims to regress the triangulated 3D human mesh with 6890 vertices from the input 2D video.The overall framework of our model is shown in Figure 1.
In the context of human mesh recovery from video, given N frames of 2D RGB images, the self-improving feature extractor aims to obtain the image feature by iterative fitting [31].Our method is the first one to adopt Temporal Convolutional Network (TCN) with multi-head attention based transformer to aggregate the temporal information among the sequences for better human pose and shape recovery.
The input 2D image sequences X ¼ fx 1 ; x 2 ; …; x t g of length T for a single person are first fed into the self-improving feature extractor to obtain the feature f i ∈ R 2048 for each frame x t .In order to aggregate the information among sequences, we adopt a temporal convolutional layers with multi-head attention transformer (TConvTransformer) as the temporal encoder.Next, we adopt a parametric statistical 3D human body model, SMPL, to disentangle the 3D human body into the pose and shape parameters.In order to unify the pose formats of different datasets, we adopt the 14 common nodes of the LSP [32].Specifically, we regress the parameters Θ = [θ, β], in which the parameter θ ∈ R 72 describes the global rotation of the human body along with the relative rotation of 23 joint points in axis-angle representation and shape parameter β ∈ R 10 belongs to the first 10 linear coefficients of a PCA shape space of SMPL body model as the middle frame results form the aggregated feature.We regard the TConvTransformer as the generator and employ another TConvTransformer as discriminator D to generate natural and reliable human poses adversarially, where the real motion sequences Θ real are from the AMASS dataset.

| 2D video temporal convolution block
Compared to RNNs, our temporal convolutional blocks aggregate the past and future temporal clues to estimate the intermediate feature and the sequence length has no impact on the gradient's path to avoid exploding and vanishing gradients The framework of our Video2Mesh method.From left to right, we show that a video sequence is sent to CNN for each frame feature extractor.Then, our method adopt TCN with Multi-head attention mechanism as the generator to obtain the aggregated temporal middle-frame feature which is used to regress the parameters of SMPL human model.A temporal discriminator is trained together with our 3D human mesh generator to constraint the pose and shape to be reliable based on the AMASS dataset.In the TconvTransformer module, which consists of TCN block and multi-head attention followed by a Feedforward network (FFN).
CHAO ET AL.
-381 in RNNs.Moreover, the temporal convolutional networks support parallelisation over time and batch processing to maintain the integrity and smoothness of temporal information instead of processing input frame by frame like chainstructured RNNs [20].
In order to make our model generate smooth human mesh output, we adopt a temporal convolutional network (TCN) as shown in Figure 2 with kernel size W and dilation factor D = W B where B represents ResNet-style blocks.Given a sequence of 2D RGB images self-improving feature [31] as input, the TCN obtains the temporal feature fusion which is used to regress the pose and shape parameter for the middle frame.The input layer takes the concatenated feature for each frame.The one-dimensional convolution with the kernel size W and the dilation factor D are performed first in each block, and then the convolution with the kernel size 1 is applied.Each convolution (except for the last layer) is followed by batch normalisation [33], rectifying linear elements [34], and dropout [35].The filter hyperparameters W and D are set in order to form a tree with which the receptive field for any output frame is able to cover all input frames.The receptive field of each block has exponential growth as W increases, while the number of parameters increases only linearly.The temporal feature aggregated from the adjacent frames provides more temporal dependencies and motion cues in the 3D human mesh generator.

| Multi-head attention transformer
Considering the inherent hierarchical structure and the kernel size of TCN, the correlation among adjacent frames is efficient for modelling and the information redundancy is reduced by the fusion ability of the temporal convolutional network.However the aggregated temporal feature will inevitably lose the middle-frame accuracy since there is no focus on the output middle-frame feature.
In order to further improve the per-frame accuracy and keep the smoothness simultaneously, we adopt the transformer with the Multi-head Attention (MHA) following the TCN for better middle-frame feature representation.The MHA transformer makes use of its multi-subspace to help the network capture the information that is more important to the middle frame, so as to express the short-term current frame while keep learning the long-term temporal information.
Multi-head attention (MHA) [36][37][38][39] is the core module of the transformer, which is a variation of typical scaled dotproduct attention (SDPA) [36] as shown in Figure 3.In SDPA, the keys K, values V and queries Q come from the aggregated feature f t , where t represents the aggregated order of adjacent frames.Then f t is first linearly projected to Then SDPA is performed on each The outputs are concatenated and linearly projected to Y t ∈ R N�512 using a FC layer.The SDPA is derived as where d k = 512 is the dimension of inputs and serves as the scaling factor.In Equation 2, the output is the weighted sum of f t .The weight matrix represents the relation between each pair of frames.
In this work, we employ 8 heads for MHA which can be calculated as: The detailed configuration of Temporal Convolution Network (TCN).From right to left, the TCN has 4 convolution layers to capture temporal information where 2048, 3D1, 2048 denotes the 2048 input channels, kernels of size 3 with dilation 1, and 2048 output channels.Convolutional layers except the last layer are followed by batch normalisation, rectified linear units, and dropout.To match the shape of subsequent tensors, we slice the residuals.

F I G U R E 3
The architecture of Multi-head attention (MHA).From the bottom to top, a set of queries (Q), keys (K) and values (V) from aggregate features are processed through the scaled dot-product attention (SDPA) several times in parallel.The independent attention outputs are then concatenated and linearly transformed into the expected dimension.
where the parameter matrices W f t i ∈ R d m �d k and the parallel attention layers h = 8.For each layer in multi-head, d k = d m / h = 64.The dimension of each head is reduced so as to the computing cost is equale to that of SDPA with same model dimensionality.With the assistance of the multi-head attention, the temporal encoder could reach the entire sequence.The receptive field is relatively expanded.Short and long-term temporal coherence can be learnt more efficiently.

| 3D human mesh temporal discriminator network
Different from the existing method VIBE [18], which adopted the sequence-to-sequence RNN with scaled dot-product attention motion discriminator.We apply the TConvTransformer as discriminator D to achieve the full-to-single (sequence to frame) supervision, that will treat the singleframe results from the temporal generator as false and the aggregated frame from the AMASS dataset as true.Given a sequence of 3D human meshs from AMASS dataset, our TConvTransformer network makes use of the temporal convolutional layer to learn the temporal information in the sequence and the multi-head attention is used to improve the expression of the output middle-frame in multi-subspace, which ensures the smoothness and accuracy of the 3D mesh sequences from the AMASS dataset.

| Loss function
The loss of our model is composed of 2D (x), 3D (X ), pose (θ) and shape (β) losses when they are available.This is combined with an adversarial loss.Specifically the total loss of the is: where each term is calculated as: To compute the 2D keypoint loss, we need the SMPL 3D joint locations XðΘÞ ¼ W Mðθ; βÞ.The 14 joints were derived from 6890 vertices of the human body mesh using the linear regressor W. The linear combination of the regressor ensures the differentiability of the joint position with respect to the shape and the position parameters.We use a weakperspective camera model with scale and translation parameters ½s; t�; t ∈ R 2 .With this we compute the 2D projection of the 3D joints X, as x ∈ R j�2 ¼ sΠ À R XðΘÞ � þ t, where R ∈ R 3 is the global rotation matrix and Π represents orthographic projection.Furthermore, L adv is the adversarial loss defined as: and the objective for discriminator is: where p R is a real motion sequence from the AMASS dataset, and p G is a generated motion sequence.

| 3DPW dataset
The 3D Pose in the Wild dataset (3DPW) [47] is the first outdoor dataset captured by IMU sensors and moving phone cameras, which includes accurate 3D body scans, 3D people models and video footage.There are 60 video sequences divided into 24 video sequences for training, 12 video sequences for validation, 24 video sequences for testing.Following previous method [18], the 3DPW dataset is used for both training and evaluation.[48] is a multi-view, mostly indoor dataset captured using markerless motion capture system.We use the training set proposed in ref. [48], which consists of 8 subjects and 16 videos per subject, and we evaluate on the official test set.

| Evaluation metrics
We evaluate the accuracy of our method according to mean per joint position error (MPJPE), Procrustes aligned mean per CHAO ET AL. -383 T A B L E 1 Evaluation of state-of-the-art models on 3DPW, MPI-INF-3DHP, and H3.6 m datasets.Ours (wo3dpw) is our proposed model trained on video datasets similar to [31,40] and VIBE (wo3dpw) [18], while Ours (w3dpw) is trained with extra data from the 3DPW training set similar to VIBE (w3dpw) [18].
Our method outperforms all state-of-the-art temporal models including VIBE [18] on the challenging in-the-wild datasets (3DPW and MPI-INF-3DHP) and obtains comparable result on H36 M. '−' shows the results that are not available

F I G U R E 4
Visual accuracy comparisons between VIBE and our Video2meh method.From left to right, we show the qualitative comparison.As indicated by the circles, the recent state-of-the-art approach VIBE fails to produce accurate 3D human mesh compared with our Video2mesh method in challenging 2D image pose in MPI dataset.
joint position error (PA-MPJPE), Percentage of Correct Keypoints (PCK) and Per Vertex Error (PVE) on 3DPW [47], MPI-INF-3DHP Dataset and Human3.6 M Dataset [49].Acceleration error is used to evaluate the smoothness of our method by calculating the average difference between ground truth 3D acceleration and predicted 3D acceleration of each joint in mm/s 2 .

| Implementation details
Our proposed algorithm is implemented on Pytorch [51] and trained on a NVIDIA Tesla V100 GPU.We adopted the Adam [52] optimiser to train our model for about 100 epochs.We experimented with different input sequence length T = [3, 9, 15], we use T = 9, as it yields the best results.The learning rate was set to 0.0001 and the batch size was 128.Our models use pretrained model from SPIN [31] as the self-improving feature extractor.

| Comparison to state-of-the-art-results
We compare our model Video2mesh with previous state-ofthe-art methods on 3DPW, MPI-INF-3DHP and Human 3.6 M in Table 1.
We utilise the same setting as VIBE [18].Our method is trained on video datasets similar to VIBE(wo3dpw) and on extra data from the 3DPW training set similar to VIBE (w3dpw).
The results show that our method has better performance in not only accuracy (i.e.PA-MPJPE, MPJPE, PVE) but also smoothness which is illustrated from the acceleration error in the challenging in-the-wild 3DPW by a significant amount.These results confirm our hypothesis that the use of feedforward temporal convolution network with multi-head attention based transformer in our video temporal encoder and mesh discriminator is important for dealing with the temporal information in video for competitive pose and shape estimation.

| Visualisation
We visualised the results of our method in comparison with the state-of-the-art VIBE [18] in the MPI dataset [48].As shown in the Figure 4, given continuous video input, the 3D projection of human body model obtained by our method is more consistent with the input 2D image than VIBE.At the same time, The advantages of our method are even more obvious in the hands and feet as shown in the circles.
More visualisations on Internet images are shown in Figure 5.We observed that on both input view and another side view, VIBE produced obvious errors, especially when hands and arms were crossed.The visualisation shows that our method is able to estimate more accurate 3D human mesh than F I G U R E 5 Visual accuracy comparisons on Internet images between VIBE and our Video2meh method.From left to right, we show the qualitative comparison for input view and side view.Our approach is better than VIBE from a side view especially when hands and arms are crossed.Then, we visualised the smoothness comparison from the image sequence obtained from the Internet as shown in Figure 6.Given the same start frame and end frame, our method can produce more smooth intermediate results than VIBE.

| Comparison of model complexity
We also compare the computation complexity between VIBE and our network as shown in Table 2.The FLOPs of our network is 14.224 G and the FLOPs of VIBE is 12.794 G as shown in Table 2.The frame rate of the video is 30 FPS and the resolution is 1080P.The number of parameters of our network is 92.1 M and VIBE is 86.4 M.Although our method will increase FLOPs and the number of parameters, the inference time of our network is 1.01s which is slightly higher than VIBE.Here, the batch size is 128 and the GPU is a NVIDIA Tesla V100 GPU.Since the improvement of accuracy and smoothness are more important for this human mesh recovery task, appropriate time consumption is acceptable.

| Different temporal encoders
We conduct the ablation studies on different temporal modules which are evaluated on 3DPW with MPJPE as shown in Table 3.The results show that TCN alone is superior to RNN in both smoothness and accuracy.While the accuracy is improved by using self-attention in front of the TCN as SATCN due to the better long-term correlation learning ability, the smoothness is affected by frame-by-frame position embedding, especially in some scenes with rapid changes in F I G U R E 6 Visual smoothness comparisons between VIBE and our Video2mesh method on Internet video sequences.Given the same video sequence, we observe that our Video2mesh method produces more smooth inter frames compared to VIBE.
T A B L E 2 Comparison on model complexity.We calculate the FLOPs, the number of parameters and inference time between VIBE and our method.Despite the higher complexity of our model, the difference in inference time is acceptable considering the accuracy and smoothness T A B L E 3 Ablation study on different temporal encoders.RNN is the temporal model adopted by VIBE [18].TCN is the baseline model for our method.Then, we add self-attention (SA) in front of the TCN as SATCN and after the TCN as TCNSA.The results show that the improvements are not significant.We also add multi-head attention (MHA) before the TCN as MATCN and after TCN as TConvTransformer.
VIBE on Internet images which are not present in the training datasets.

Note:
The meaning of the values provided in bold are the best quantified results compared with other methods.

T A B L E 4 Note:
Ablation study on different discriminators in our temporal method.We report the adversarial training comparisons on the 3DPW dataset with only generator (Only G), single-frame discriminator (w/Single frame D) and our temporal based TConvTranformer discriminator (w/Our D).The results show that our TConvTranformer based discriminator outperforms other designs Methods 3DPW PA-MPJPE ↓ MPJPE ↓ PVE ↓ Accel ↓ The meaning of the values provided in bold are the best quantified results compared with other methods.
CHAO ET AL.Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12172 by City University Of Hong Kong, Wiley Online Library on [02/07/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License The results show that our TConvTransformer model has the best performance in both accuracy and smoothness 3DPW PA-MPJPE ↓ MPJPE ↓ PVE ↓ Accel ↓ Note: The meaning of the values provided in bold are the best quantified results compared with other methods.