Multiscale Feature Fusion Network for Monocular Complex Hand Pose Estimation

Hand pose estimation based on a single RGB image has low accuracy due to the complexity of the pose, local self-similarity of ﬁnger features, and occlusion. A multiscale feature fusion network (MS-FF) for monocular vision gesture pose estimation is proposed to address this problem. The network can take full advantage of diﬀerent channel information to enhance important gesture information, and it can simultaneously extract features from feature maps of diﬀerent resolutions to obtain as much detailed feature information and deep semantic information as possible. The feature maps are merged to obtain the hand pose results. The InterHand2.6M dataset and Rendered Handpose Dataset (RHD) are used to train the MS-FF. Compared with the other methods (which can estimate interacting hand poses from a single RGB image), the MS-FF obtains the smallest average error of hand joints on RHD, verifying its eﬀectiveness.

Introduction: Hand pose estimation aims to identify and localize key points of human hands in images, and it has a wide range of applications in virtual reality (VR) and augmented reality (AR) [1].Methods based on deep learning have obvious advantages over traditional methods, both in processing speed and prediction accuracy.However, owing to the complexity and diversity of the photographic environment, such as hand shapes and occlusion, the robustness of hand pose estimation methods is low.
Hand pose estimation methods can be categorized as either depth- [2][3][4][5]15] and RGB-based [6][7][8][9][10][11][12][13][14]16].Most methods rely on depth images, such as Chen et al. [2] extracted effective joint features through the initially estimated hand pose as guiding information, then fused the joint features of the same fingers, and finally regressed the hand pose by fusing the finger features.However, the method of connecting five fingers and the palm at the same time can cause loss in accuracy.According to Zhang et al. [4] made full use of the information between the adjacent joints of the fingers to estimate the depth coordinates.Then, 2D hand joint estimation and depth estimation of a part of the hand joints were used as the bootstrap information to obtain depth coordinates of all the hand joints.
Deep images are often limited by the application context, so RGB images have been used for hand pose estimation.Simon et al. [6] estimated 2D hand poses from multi-view images and extended them to the 3D space.However, this method could not estimate hand pose from a single RGB image.Spurr et al. [7] used RGB images to train an encoder-decoder model to estimate the complete 3D hand pose with different inputs.However, the method did not make full use of the hand structure.Yang et al. [9] learned the hand pose and hand images by a disentangled variational autoencoder to achieve image synthesis and hand pose estimation, but the disentangled process may lose useful information.Since most datasets only have single hand sequences, estimating complex gestures is relatively difficult.For this reason, Moon et al. [16] constructed a dataset containing single and interacting hand sequences.Additionally, the InterNet model was proposed to estimate hand poses by a single RGB image.Due to the influence of occlusion, the method cannot estimate complex hand pose well.However, the edge information in the hand pose estimation is usually ignored, due to the presence of occlusion, this information is especially important for extracting the information of the occluded part.Simultaneously, because the fingertip is a small object, it is relatively difficult to recognize the joint at the fingertips.To address this, a robust Multi-Scale Feature Fusion Network (MS-FF) is presented in this paper.The main contributions of this method are as follows: 1. MS-FF more accurately estimates hand poses in an RGB image and better copes with complex application scenarios, so as to better deal with difficult-to-recognize joints and inaccurate gesture recognition in occlusion scenes; 2. Channels contain different implicit information.We need to focus on the information that is more important for recognizing gestures.A channel conversion module adjusts the weights of channels to enhance important information; 3. Fingertips occupy a small percentage of an image, and are relatively difficult to identify.A global regression module generates different resolutions with rich semantic information, to better utilize image edge details and deep information, which is important in estimating finger poses; 4. The global regression module may not accurately identify occluded joints.A local optimization module is designed with deeper information in the feature map.It fuses all level feature maps, correcting joints that do not return to the correction position, for better application to the occlusion scene;

Method:
A. Multiscale Feature Fusion Network Gesture pictures usually contain complex detailed features.A strong correlation between fingers and joints is present.Therefore, the use of a single feature for hand pose estimation tends to ignore diverse feature information, which makes accurate extraction of more gesture information difficult.Fig. 1 shows the proposed MS-FF, whose purpose is to estimate the hand pose through a single RGB image.Feature maps of different resolutions are extracted from RGB images through the ResNet50 module.Feature maps are fed into the channel conversion module to explicitly learn the dependencies between channels, so as to enhance important information and downplay minor information.Because the level of feature information depends on the resolution of a feature map, the global regression module obtains high-resolution feature maps containing more semantic information, and these are separately input in the local optimization module to extract deeper information.The Gaussian heatmap of hand joints () is obtained to improve the spatial generalization ability of the model, and thus obtain more accurate joint locations.We take the feature map with the smallest resolution from the channel conversion module, through which the handedness () and relative depth information between the wrist joints () are obtained.The above results are combined to estimate the hand pose, , , (2) where equation ( 2) represents the result of gesture estimation, and and are the camera inverse projection and inverse affine transformation, respectively.

B. Channel Conversion Module
Each channel of the feature map contains different feature information.To make better use of this information, the relationship between channels of the feature map is modeled explicitly.Higher weights are assigned to channels with higher semantic characterization ability, so as to improve the sensitivity of the model to important feature information.The structure of channel conversion module is shown in Fig. 2.
The channel conversion module has aggregation and excitation stages.Global feature information of spatial dimension is aggregated into a channel descriptor of dimension by average pooling.The c-th element calculation of vector A is , where and are the height and width, respectively, of the feature map; and is the pixel in the c -th channel.
The average feature of each channel in the feature map is calculated by aggregation.
To fully utilize the aggregated feature information, the excitation operation captures the dependencies between channels.The aggregated information learns the inter-channel dependencies through the fully connected layer.The weight vector with dimension is obtained by the sigmoid function, and can characterize the importance of each channel.The weight vector is multiplied by the original feature map to obtain the reassigned feature map, which can enhance important information and weaken the minor information.The dependency between the channels is , where is the calculation of channel weights, and are the weight matrices of the two fully connected layers, is the sigmoid function, and is the ReLU function.The channel information of the feature map is recalibrated as , (5) where is the feature map after reassigning channel weights, and is channel-wise multiplication between the weight vector and feature vector .

C. Global Regression Module
The ResNet50 module produces feature maps with different resolutions.High-resolution, low-level feature maps contain less semantic information but rich spatial detail information, while low-resolution, high-level feature maps have rich semantic information and less spatial detail information.To fully exploit the feature information of different dimensions, the low-and high-resolution feature maps are combined by vertical and horizontal paths.The vertical path obtains the high-resolution feature map by upsampling the spatially lowresolution feature map.Then, 1 × 1 convolution is used to reduce the number of channels in the low-level feature map, so as to obtain a feature map with the same dimension as the corresponding longitudinal path feature.The horizontal path fuses the two feature maps (Fig. 3).This pyramidal structure allows feature maps of different resolutions to contain more semantic information, enabling the network to learn richer feature information.Feature maps are obtained by the channel conversion module.and have high spatial resolution but low semantic information, while and have more semantic information but low spatial resolution.In addition to obtaining rich hand feature information, the fusion of feature maps can obtain detailed information, such as that of fingertips and masked edges.To fuse the feature information, feature maps in different dimensions are subjected to dimensionality reduction, so that their channels can be unified under the same dimension, , where V k is the feature map obtained by dimensionality reduction, U k is the feature map obtained by upsampling, R 1 is the convolution operation with a 1 × 1 convolution kernel, is the ReLU function, andB is the upsampling operation of bilinear interpolation, which calculates the corresponding points in the new image by the four adjacent points as , . (10) Equations ( 8) and ( 9) are linear interpolation operations in thex -direction, and equation ( 10) is a linear interpolation operation in the y -direction., , , and are points in the original image with coordinates , , , and , respectively.and are added to fuse feature information of different spatial resolutions.The calculation method is .(11)

D. Local Optimization Module
To reduce errors generated by the global regression module, a local optimization module addresses the inaccuracy of predicting the joint position under occlusion.This can extract deeper information from feature maps obtained by the global regression module.The input information is divided into two branches by the "channel split" operation (Fig. 4).The feature maps are processed separately through two paths; one is not processed, and the other has 1 × 1, 3 × 3, and 1 × 1 convolution kernels and can extract deep semantic information.The channel conversion module explicitly models the dependencies between the channels, which can enhance important information.Residual connectivity solves the problem of network degradation and improves representational capability.The outputs of the two paths are spliced to ensure that the channel dimension remains unchanged.The "channel shuffle" operation disrupts the order of the channels to improve the efficiency of information transmission and promote information fusion.Finally, the upsampling operation of bilinear interpolation is used to obtain a highresolution feature map.
Four feature maps of different resolutions are taken from the global regression module.The same dimensional feature maps are obtained by the local optimization module, , , (13) where is the local optimization module and denotes upsampling.Let .Then , , , and denote the feature maps at the 1/4, 1/8, 1/16, and 1/32 scales, respectively, of the original image.The result in (13) represents the processing times of the above four feature maps by the local optimization module, i.e., , , , and .At this time, the four feature maps have the same dimension, and the "concat" operation is performed as .
The 2.5D Gaussian heatmap of the joints of the hand obtained by 1 × 1 convolution is .
Experimental Results and Analysis: Datasets RHD and InterHand2.6M were used to evaluate the performance of the proposed method.The PyTorch framework was used for training.The hand image was resized to 256 × 256 and input to the network.In the experiment, the batch size was set to 16.The network was trained Table 1 compares different methods on RHD, where EPE is the average error of hand joints, and GT H and GT S indicate handedness and scale of the hand, respectively.It can be seen that Spurr et al. [7] and Yang et al. [9] required additional input at test time, achieving lower joint errors, while our method could obtain low errors without ground-truth information during testing.

Conclusion:
We proposed an MS-FF for monocular visual hand pose estimation.To effectively process the detailed information of occluded edges and fingertips, the network can extract information of different levels from feature maps of different resolutions to more accurately estimate hand poses.A channel conversion module adjusts the weights of channels.To make full use of both the edge detail characteristics of the images and deep semantic information, a global regression module fuses feature maps of different resolutions.An optimization procedure corrects some joints that are not returned to the correct position.Higher accuracy and robustness were achieved using the proposed method.Experiments verified the effectiveness of the MS-FF.

Fig. 4
Fig. 4 Structure of local optimization module.