Skeleton‐Guided Action Recognition with Multistream 3D Convolutional Neural Network for Elderly‐Care Robot

With the arrival of a global aging society, elderly‐care robots are becoming more and more attractive and can provide better caring services through action recognition. This article presents a skeleton‐guided action recognition framework with multistream 3D convolutional neural network. Two parallel dual‐stream lightweight networks are proposed to enhance the feature extraction ability of human action and meanwhile reduce computation. Two different modes of skeleton input video are constructed to improve the recognition accuracy by decision fusion. The backbone networks adopt Resnet‐18, the feature fusion layer and sliding window mechanism are both designed, and two cross‐entropy losses are used to supervise their training. A dataset (named elder care action recognition (EC‐AR)) with different categories of action is built. The experimental results on HMDB‐51 and EC‐AR datasets both demonstrate that the proposed framework outperforms the existing methods. The developed method is also applied to a prototype of elderly‐care robots, and the test results in home scenarios show that it still has high recognition accuracy and good real‐time performance.

of action recognition.[14] Since single-frame images cannot represent continuous actions, the accuracy of action recognition is still low.Compared with 2D CNN, 3D CNN can extract the spatiotemporal features in the videos and thus improve the recognition accuracy, so it has been applied to action recognition in recent years.Carreira et al. [15] presented a twostream 3D CNN named I3D.Based on RGB video data, optical flow data were also used to emphasize dynamic information in the video.Zhou et al. [16] proposed a MiCT-Net that integrated 2D CNNs with 3D CNN to generate more profound feature maps, while reducing training complexity in each round of spatiotemporal fusion.Yang et al. [17] pointed out that 3D CNN was expensive in computation/storage and difficult to train due to its more parameters, and proposed efficient asymmetric one-directional 3D CNN.Feichtenhofer et al. [18] developed the SlowFast network based on 3D CNN for video action recognition.The network adopted a slow pathway operating at a low frame rate and a fast pathway operating at a high frame rate, and it can respectively capture the information of spatial semantics and temporal motion better, thereby achieving better action classification and detection.After this, based on the 2D ResNet structure and the fast pathway, Feichtenhofer [19] found that the network with thin channel dimensions and high spatiotemporal resolution can be more effective for video action recognition by gradually expanding its parameters, such as frame rate and spatial resolution, and named the network X3D.Duan et al. [20] proposed the PoseConv3D model based on 3D heatmap volume to improve the robustness, interoperability, and scalability of the model.Li et al. [21] presented a spatiotemporal deformable 3D CNN with attention mechanisms, so as to effectively capture the long-range temporal and long-distance spatial dependencies.Wu et al. [22] demonstrated that existing methods based on RGB video or optical flow were easily disturbed by cluttered and ambiguous backgrounds, which would lead to object loss or recognition error.Therefore, based on I3D, a pose-based module is added to guide the network to make more correct judgments based on the spatiotemporal features of the skeleton sequence.
The human skeleton represents the position structure of the critical nodes of the human body.It can help the action recognition networks find and identify the body effectively, and thus better deal with blurred images caused by quick movement and covered body.For this kind of action recognition, the coordinates of the human skeleton in the video or image are first extracted by pose estimation, [23][24][25] and then they are input into different networks, such as temporal convolutional network (TCN), recurrent neural networks (RNN), graph convolutional networks (GCN), and so on.For TCNs, Kim et al. [26] proposed a Res-TCN model used for processing 3D human skeletal data, by which the category of action could be judged.Jia et al. [27] proposed a two-stream temporal convolutional network named as TSTCNs, the input of which adopts the spatiotemporal vectors among a sequence of skeletal data.Considering the latent relationships between the node pairs and edge pairs of the human skeleton, Zhu et al. [28] proposed a dilation group-specific convolution module to aggregate relation messages of all the unit pairs on the skeletal graphs.For RNNs, most of them mainly adopt long short-term memory networks (LSTMs).Wang et al. [29] pretrained a 3D CNN model on a huge video action recognition dataset, and then LSTM was introduced to model the high-level temporal features.Meng et al. [30] proposed a sample fusion model combined with an LSTM autoencoder.For GCNs, it has been used to process skeletal data recently.Li et al. [31] combined the relationships between skeletal points into a generalized skeleton graph, so as to learn more feature information.Si et al. [32] presented an attention-enhanced graph convolutional LSTM model, which not only can capture discriminative features in spatial configuration and temporal dynamics but also can explore the co-occurrence relationship between spatial and temporal domains.Li et al. [33] proposed a highly efficient GCN, which is achieved by a parallel structure that gradually fuses motion and spatial information and by reducing the temporal resolution as early as possible.Compared with 3D CNN, skeleton-based computation is less complex.However, if the skeletal data are only used as the input of the action recognition network and the environmental information is not taken into account, it is difficult to make more accurate judgments on the human-object interaction actions, such as "smoke," "answer the phone," "read newspaper," and so on.
In addition, the transformer has also been tried for action recognition due to its excellent computational parallelism and self-attention mechanism recently. [34,35]For example, Li et al. [36] proposed a transformer-based RGB-D egocentric action recognition framework, which adopted a self-attention mechanism to model the temporal structure of the data from different modalities.Shi et al. [37] proposed a skeleton-based action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.The transformer's self-attention block enables it to model long-term dependencies, but it requires a large amount of data and time to train.When the amount of data is insufficient, it generalizes poorly and cannot be put into use.The transformer's ability to obtain local features is poor, so its accuracy is not as good as 3D CNNs.
By considering the advantages of 3D CNN and skeleton in action recognition, the article proposes a skeleton-guided multistream action recognition network for elderly-care robots.The human skeletal data are represented in two different modes of video and are used to guide the lightweight multistream 3D CNN, to learn feature information more comprehensively and thus improve its recognition accuracy and computational speed.In addition, to verify the proposed network and enable it to be used in elderly-care robots, both an action recognition video dataset that focuses on caring for the elderly and an elderly-care robot platform is developed, and related experiments are carried out.
The main contributions of this article are as follows: 1) A skeleton-guided action recognition framework with multistream 3D CNN is proposed.The framework consists of two parallel dual-stream Light-SlowFast (Light-SF) networks based on ResNet-18, and it can extract richer features from action videos while maintaining high computational efficiency; 2) The two parallel dual-stream Light-SlowFast networks can process two different modes of skeleton input video (RGB-Skeleton video and skeleton video) simultaneously, and thus better improve the accuracy of action recognition by decision fusion; 3) An elder care action recognition dataset (named as EC-AR) is built.Experimental evaluations are carried out both on EC-AR and HMDB-51, and demonstrate the superiority of the proposed network and the rationality of each module design in it; and 4) The proposed method still has high recognition accuracy and good real-time performance by testing in real home elderly-care scenarios.

Methodology
The skeleton-guided multistream action recognition framework for elderly-care robots is shown in Figure 1.The robot captures original RGB video from Kinect-V3 (RGB-D camera), and a sequence of skeletal data is extracted from the video by pose estimation.The skeleton video frame and the RGB-Skeleton video frame are respectively reconstructed, and they are used as the input of subsequent multistream 3D CNN, which is composed of two dual-stream Light-SlowFast (Light-SF) networks.The decision fusion is carried out after the output of two parallel Light-SF networks.The proposed network is mainly based on SlowFast and Resnet-18.SlowFast network consists of a slow path with a low frame rate and a fast path with a high frame rate, and it can simultaneously learn semantic information and motion details in the video.Therefore, SlowFast is chosen as the backbone of the action recognition framework.ResNet-18 uses a residual structure, and it enables the network to be trained deeper while maintaining high computational efficiency.ResNet-18 can achieve good recognition accuracy with fewer layers, so it plays an important role in Light-SF.

Pose Estimation and Video Reconstruction
Currently, most action recognition networks based on skeleton mainly rely on hand-crafted features or traversal rules, so their expressiveness is limited and generalization is terrible. [27,28]o reserve more feature information and provide more reliable clues for the training network, the skeleton should be represented in different modes of video.
To help the proposed network identify the action subject and improve its generalization ability, the skeleton video frame without environmental information is constructed, as shown in Figure 2b.The skeleton video frame contains rich motion details without other redundant information.In addition, to supplement the environmental information and further help the proposed network lock the moving human body, the skeletal data are added to the corresponding RGB video for each frame, thereby obtaining the RGB-Skeleton video by data-level feature fusion, as shown in Figure 2c.The RGB-Skeleton video not only contains rich environmental information but also can enable the targeted feature of the network to be extracted.
Since human action needs to be recognized quickly for elderlycare robots and is easily disturbed by environmental light, OpenPose is utilized to predict the human skeleton due to its better real-time and anti-interference ability. [38]The running time of OpenPose (CUDA version) in NVIDIA GeForce GTX-1080Ti GPU is only 36 ms.The process of converting optical flow information into bone data is as follows: The foreground and background images are segmented by depth information first, and identify the body area in the image.And then, the key points can be detected by OpenPose, and get the 2D coordinates of the main body parts, including the head, hands, etc. 25 key points.After that, the key points can be connected according to the structural model of the human body, thereby constructing the skeletal structure of body parts, such as the head, trunk, and legs.The skeletal data in each frame and the time series data of their motion can be extracted by performing the three steps above consecutively.
The two-dimensional coordinate of all the skeletal joints obtained by using OpenPose can be expressed as , where J is the total joints of the skeleton in a video frame, F is the total frames of input video.Thus, the coordinate of a set of skeletal joints in the f-th video frame can be defined as a matrix M F where x achieve a balance between human posture representation and computational complexity.Therefore, J is set to 25. F is determined by the frame rate of the camera adopted by the elderly-care robot, and F is set to 32.For each input video, the spatial feature of skeletal joints can be expressed as a matrix P P ¼ . .(2) In matrix P, the coordinates of each row can provide a strong correlation between skeletal points in the spatial domain, and the coordinates of each column can represent the motion of each skeletal joint.Thus, both the skeleton video and the RGB-Skeleton video can be built based on this matrix, as shown in Figure 1.To facilitate subsequent networks to learn the feature information in the skeleton video, some adjacent skeletal joints are connected with different colored lines according to the structure of the human body.Different body parts are distinguished by the color gradient.For the same body part, the proximal and distal ends are characterized by gradient color.The change in color gradient indicates how far the skeleton endpoints are from the center of the body.The lighter the color, the farther the distance, and the darker the color, the closer the distance.For example, wrists and ankles are represented in the lightest color, the elbows and knees are darker, and the upper arms and thighs are darkest.

Multistream 3D CNN
The accuracy of human action recognition depends on three kinds of feature information: temporal feature, spatial feature, and environment-object appearance feature.It is not easy to extract all the feature information from a single video mode.
For example, the skeleton video does not contain the environment-object appearance feature, and the focus of the network will deviate due to complex feature information in the RGB-Skeleton video.However, the two modes of video are complementary.RGB-Skeleton video can complement the environmental data, and skeleton video can help the network identify the action subject.Therefore, to extract and learn enough feature information, a multistream 3D CNN is proposed, as shown in Figure 1.The RGB-Skeleton video and skeleton video are respectively used as the input of the two parallel dual-stream Light-SF networks.
As shown in Figure 3, to make the SlowFast network more lightweight and reduce its calculation, its backbone network adopts Resnet-18, [39] which can also avoid overfitting to a certain extent.In addition to inflating the convolutional and pooling layers to 3D, the network is divided into a slow pathway and a fast pathway, which are respectively used to process highfrequency and low-frequency video frames.In the slow pathway, the stride of the data layer is: (24, 1 2 ), the temporal stride of the convolution kernel before Res4 is set to 1, and the pathway focuses on learning the environmental feature in RGB-Skeleton video or the spatial feature in skeleton video.For Res4 and Res5, the size of the convolution kernel in the first layer is set to 3 Â 1 2 , to add some dynamic information to the pathway.In the fast pathway, the stride of the data layer is: (3, 1 2 ), and more video frames are reserved for capturing motion features.Few channels are used in the fast pathway to reduce the number of parameters.
The output size of the Light-SF network is shown in Table 1.The temporal and spatial dimensions of its kernel are denoted by T Â S 2 .It can be found that the outputs of the slow pathway are different from those of the fast pathway in the time dimension, so their feature maps cannot be directly fused.To take the clues of static and dynamic features into account overall, four fusion modules in the dual-stream 3D CNN are designed, and each module is a 5 Â 1 Â 1 3D convolution.The calculation formula is presented as follows where O is the size of the output feature map, k is the size of the input feature map, f is the size of the convolution kernel, and s is the stride.As shown in Table 1 that the stride of 3D convolution should be set to (8, 1, 1), to enable the data in the fast pathway to be concatenated with the data in the slow pathway.The fusion method can avoid the loss of feature information and further improve the feature extraction ability of the network.To learn the long-time dependence between continuous actions, the raw clip is set to 96 frames, so as to enable the video sample to represent action with a longer duration.Meanwhile, the sliding window mechanism in the network is designed with a sliding stride of 60 frames, so that the model can determine the temporal boundary of action more accurately and further capture the associated clues in each action stage more adequately.

Decision Fusion
To enable the two modes of input video to complement each other's advantages without increasing the network complexity, the output of two Light-SF networks is fused.The corresponding weights are set according to different actions.The specific formula is presented as follows where P RGBÀS and P Skeleton are, respectively, the output results of two Light-SF networks, Þ are, respectively, the weights of different actions in two modes of video, and N is the total types of action.The weights are initialized by a normal distribution with a standard deviation of 0.01, and can be updated iteratively by backpropagation.
In addition, considering that the proposed network needs to handle both skeleton video and RGB-Skeleton video, two crossentropy losses are used to supervise model training.The total loss ℒ can be expressed as where ℒ RGBÀS and ℒ Skeleton are, respectively, the cross-entropy loss of RGB-Skeleton feature and skeleton feature.λω 2 2 is used to avoid overfitting, where λ is the attenuation coefficient, and ω 2 2 represents the weight decay regularization of all parameters.If the total sample of video is M, then ℒ RGBÀS and ℒ Skeleton can be expressed as

Action Recognition Dataset
Since there are few action recognition datasets for elderly-care robots, an elder care action recognition dataset is built and named EC-AR, to simulate real-world application scenarios.
The EC-AR dataset contains four modes of video, and they are, respectively, RGB video, depth video, skeleton video, and RGB-skeleton video, as shown in Figure 4.
To diversify video data in the EC-AR dataset, there are 14 types of caring actions for monitoring the elderly's daily life, and they are classified into five categories: 1) daily actions: "stand," "lie," "sit," "bend"; 2) exercise actions: "walk," "squat," "push-up"; 3) risky actions: "fall," "lean"; 4) human-object interaction actions: "smoke," "answer the phone"; and 5) human-robot interaction actions: "wave one hand," "hand up," "wave two hands".For each type of action, the total duration of video samples is more than 20 min (no less than 400 standard video clips after clipping).Eight volunteers (five males and three females) participate in the filming in the home or laboratory scene, of which six are young adults ( 20

Implementation Details
In experiments, all the videos are processed at 32 frames s À1 (T = 32).Pytorch is chosen as a deep learning platform.The initial learning rate and momentum of the network are, respectively, 0.01, 0.9, and its dropout rate is 0.5.The objective loss ℒ is minimized by stochastic gradient descent (SGD), where the weight of L2 regularization term is set to 1e-4.The video data processing requires more significant memory, so the batch size is set to 8. Network training and testing are carried out on a computer with RTX3090 GPU and 64 GB RAM.

Ablation Study
To discuss the fusion strategy among RGB video, depth video, skeleton video, RGB-skeleton video, and RGB þ skeleton video, a related ablation study is done.Table 2 presents the recognition   results of different fusion strategies.It can be seen that the performance of the fusion strategy based on RGB-Skeleton video and Skeleton video is best on the EC-AR dataset, and the top-1 and top-3 accuracy of action recognition can be up to 89.09% and 100%, respectively.The main reason is that RGB-Skeleton þ Skeleton enables the network to obtain multimodal visual and skeletal information, and thus learn richer features.Second, the fusion of similar data sources can enhance the consistency of features better, avoid time and coordinate deviations, and thus help the network to focus on the areas of human motion more accurately.Finally, the multistream network is susceptible to skeletal detection errors, which can be made up by the fusion of RGB and Skeleton.
In addition, the ablation study between two pathways in Light-SF networks is used to demonstrate that the performance of the multistream network is better than that of the single-stream network, as shown in Table 3.It can be seen that the recognition accuracy can be improved after the fast pathway is introduced, and the computational complexity can be reduced by only adding fewer parameters.

Framework Results
To analyze the learning ability of the proposed network more intuitively and evaluate its prediction accuracy, the softmax outputs of two Light-SF networks and their decision fusion are respectively visualized, and their performances are displayed through confusion matrices, as shown in Figure 5.
By analyzing the confusion matrices, we can obtain the following conclusion: 1) for static actions, such as "bend," " hand up," "sit," and "stand" can be recognized more easily because their spatial feature in skeleton video is clearer.However, it can be found (Figure 5b) that the recognition of static actions is easily disturbed by the environment or objects around the human body; 2) For the human-object interaction actions, such as "smoke" and "answer the phone," their recognition accuracy in the Light-SF network based on RGB-Skeleton video is higher.This suggests that the environment-object appearance feature is the key to the network making correct predictions; 3) The recognition accuracy of "fall" in the Light-SF network based on skeleton video is much higher than that in the Light-SF network based on RGB-Skeleton video.This indicates that the spatiotemporal feature in the skeleton video plays a major role in the type of action with a larger range of motion; and 4) For the action of "wave two hands," because it is similar to "hand up" at some time during motion, both of the recognition results are bad in the two Light-SF networks.However, it can be predicted accurately after decision fusion.This indicates that the two Light-SF networks can make up for each other better.Finally, it can be seen that all the other actions can also be recognized better.

Comparison with the State-Of-The-Art Methods
To verify the performance of our method on the public dataset, we compare the proposed network with existing action recognition methods based on 3D CNNs and RNNs on the HMDB-51.HMDB-51 contains 51 types of action and has a total of 6849 video clips.There are at least 101 samples for each type of action.Most of the daily actions in the EC-AR dataset are also present in this dataset.Compared with other large-scale action datasets (such as Kinetics), HMDB-51 is more approximate to EC-AR.
The video reconstruction method in Section 2.1 needs to be used to process the HMDB-51 dataset first.The results are shown in Table 4, in which it can be seen our method outperforms all the other competitors.The comparison proves that multistream  fusion strategy is more effective, and demonstrates that the skeleton-guided action recognition method can obtain higher accuracy.
In addition, to further demonstrate the performance of our method on EC-AR, two representative and public action recognition methods based on 3D CNNs (I3D and SlowFast) are respectively chosen to compare with the proposed network.The comparative experiments are carried out on a computer equipped with an RTX2080.The results are shown in Table 5.It can be seen that our method is better than theirs.It is important to note that the average inference time is greatly reduced due to the multistream light network (The inference times do not include the 0.38s for skeleton extraction).This means that our methods can better satisfy the real-time requirements of the elderly-care robot for action recognition.

Experimental Platform
To explore the action recognition ability of the elderly-care robot, a prototype (named ZU-SR) has been developed, as shown in the Figure 6.The elderly-care robot is mainly composed of a Kinect-V3 camera, microphone array, robotic arm, computer, lidar, and mobile chassis.It has been used to take part in competitions, such as IJCAI elderly-care robot challenges, [40] Robocup@Home, [41] and so on.Up to now, the elderly-care robot already has the ability of object detection, people detection, people tracking, pose recognition, indoor SLAM, and humanrobot interaction.In the experiment, the trained multistream 3D CNN can be directly transplanted to the elderly-care robot, and then complete the action recognition.
To facilitate the statistics of the elderly's daily behaviors, an action recognition and analysis software is developed based on PyQt5, as shown in Figure 7. Video samples can be selected and processed by our model or SlowFast.The category and temporal localization of each action can be and displayed in the combo box.We can view and play videos of each action as it happens.In addition, the duration of each action can be also accumulated and displayed in a pie chart, reflecting the elderly's daily life.

Test in the Home Scenarios
An elderly volunteer of 61 years participates in the test in a nonstandardized home environment (living room and bedroom), as shown in Figure 8.To ensure the safety of the volunteer, nine types of living actions are chosen.They are respectively "stand," "sit," "lie," "bend," "lean," "walk," "wave one hand (wave-1)," "hand up," and "wave two hands (wave-2)," as shown in Figure 9.The test number for each action is 100, and the total test number is 900.A standard video clip (96 frames, 3 s) is used as input for each test.The entire test lasted nearly 45 min.The total number of correct recognitions is 872, and the Top-1 accuracy is 96.89%, which is slightly higher than the simulation, as shown in Figure 10.This is because the actions of the volunteer may be more standard.As shown in Figure 10, the minimum  [42] RGB One 61.3% Pan et al. [43] RGB One 63.8% Liu et al. [44] RGB, Flow, Saliency Map Three 64.4% 3D CNN Patravali et al. [45] RGB Two 47.6% Khorasgani et al. [46] RGB, Flow Three 56.2% Yang et al. [17] RGBF One 65.4% Carreira et al. [15] RGB, Flow Two 66.4% Zhou et al. [16] RGB, Flow Two 70.5% Li et al. [21] RGB Two 72.7% Ours RGB-S, Skeleton Four 72.9% I3D [15] RGB, Flow Two 2.34 s 79.6% SlowFast [18] RGB Two 3.17 confidence is 71% for each type of action, so we believe the samples with a confidence higher than 71% are reliable.It can be found that the inaccurate recognition rate of 3.11% is mainly due to the input of an incomplete skeleton.The wholebody skeleton of the volunteer may be blocked inevitably by furniture in the test.Therefore, it is also crucial for elderly-care robots whether it has a better observation position.Since the furniture placement generally does not change at home, the optimal observation position can be chosen according to confidence during application.For example, if the prediction result is lower than 71%, the elderly-care robot can change observation position timely by navigation and recognize again.The better observation positions can then be found and marked on the home map.Therefore, we believe that the accuracy of action recognition can be further improved by adopting a similar strategy, and the minimum confidence can be also improved more.Thus the elderly-care robot can completely recognize all the actions in practical application.
Then, to demonstrate the real-time performance of the proposed network, the trained model is transplanted to two computers with different computing power.It can be obtained by testing that the average inference time is 0.86 s on the computer equipped with GTX1060, and 0.53 s on the other one equipped with an RTX2080.Both of them are less than 1 s.For the elderly who move slowly, this can completely satisfy the real-time demand for caring service.

Conclusion and Future Work
The article presents a skeleton-guided action recognition model with multistream 3D CNN for elderly-care robots.Two parallel dual-stream Light-SF networks can extract abundant features for human action recognition from the RGB-Skeleton and Skeleton video.The built EC-AR dataset collects five categories (a total of 14 types) of caring actions, and it has 24 660 standard video clips with a total size of 17.5 GB.The Top-1 accuracy rate of action recognition can be up to 89.09% on the EC-AR dataset.The proposed model is also successfully applied to our  experimental platform.The test in the home scenario shows the elderly-care robot can completely recognize all the actions timely and thus monitor the elderly's daily life.
In the future, more daily actions will be added to the EC-AR dataset to enrich the caring service of elderly-care robots first.Second, more visual detection (such as human detection, object detection, and so on) will be fused with action recognition, to help the elderly-care robot recognize more complicated or similar actions.Finally, we also plan to transfer the calculation of action recognition to an edge or cloud server by 5G communication, to guarantee its real-time speed and meanwhile share the computing cost of the terminal of the elderly-care robot.The number of correct recognitions The minimum confidence of the correct samples
) where y m,n is the true result, b y RGBÀS m,n and b y skeleton m,n are, respectively, the predicted results of two Light-SF networks.
-40 years old) and two are old people (over 60 years old).Each person repeated each action more than 3 times.Considering the elderly usually move slower, young adults need to emulate their actions in speed.The volunteer is generally 1.5-3 m far away from the camera.Most of the shooting angle directly faces the volunteer, and a small number of samples are shot within 45°.Each mode of video consists of 6,165 standard video clips.The length, frame rate, and resolution of each video clip are respectively 96, 32 fps, and 1280 Â 720.The training set and test set, respectively, have 17 903 and 6757 video clips (the volunteers in the training set and test set are different), and the ratio is 2.65:1.The sizes of RGB video, depth video, skeleton video, and RGB-skeleton video are, respectively, 1.9 GB, 905.1 MB, 4.2, and 10.4 GB.

Figure 4 .
Figure 4. Elderly-care action and video mode in EC-AR dataset.

Figure 5 .
Figure 5. Confusion matrices of different networks.The row and the column of each confusion matrix, respectively, denote the ground truth and the prediction.a) Light-SF network based on skeleton video, b) Light-SF network based on RGB-Skeleton video, and c) decision fusion of the framework.

Figure 6 .
Figure 6.Prototype of an elderly-care robot.

Figure 7 .
Figure 7. Action recognition and analysis software interface.

Figure 10 .
Figure 10.Statistical plot of test data.

Table
The output size of the Light-SF network.

Table 2 .
Recognition results of different fusion strategies.

Table 3 .
Recognition results of Light-SF network.

Table 4 .
Comparison with state-of-the-art methods on HMDB-51 dataset.

Table 5 .
Recognition results of Light-SF network.