Infusing a Convolutional Neural Network with Encoded Joint Node Image Data to Recognize 25 Daily Human Activities

Human activity recognition (HAR) has gained popularity in the field of computer vision such as video surveillance, security, and virtual reality. However, traditional methods are limited in terms of computations and holistic learning of human skeletal sequences. In this article, a new time‐series skeleton joint data imaging method is infused into an improved convolutional neural network to handle these problems. First, the raw time‐series data of 33 body nodes are transformed to red–green–blue images by encoding the 3D positional information to one pixel. Second, the LeNet‐5 network is enhanced by expanding the receptive field, introducing coordinate attention and the smooth maximum unit to improve smoothness and feature extraction. Third, the ability of coded images to express human activities is studied in various environments. It is shown in the experimental results that the method achieves an impressive accuracy of 98.02% in recognizing 25 daily human activities, such as running, writing, and walking. In addition, it is shown that the number of floating point operations, parameters, and inference time of the method are 0.08%, 0.47%, and 3.05%, respectively, of the average values for six other networks (including AlexNet, GoogLeNet, and MobileNet). The proposed method is thus a novel, lightweight, and high‐precision solution for HAR.


Introduction
With the rapid advance of motion tracking and deep-learning technologies, human activity recognition (HAR) has received considerable attention in the field of computer vision and has found a wide range of applications in film production, [1] video surveillance, [2] psychological and emotional assessment, [3] and intelligent health. [4]owever, most HAR methods [5,6] suffer problems such as the large scale of equipment used, limited types of recognized activity, and inability to thoroughly learn human characteristics.It is thus essential to design methods that are lightweight, highly precise, and capable of recognizing multiple complex activities.

Background
In recent years, researchers have proposed various methods of recognizing human activities.Currently, widely used HAR methods fall into two categories: methods based on wearable sensors and methods based on computer vision.

Wearable-Sensor-Based HAR
Wearable-sensor-based HAR first collects posture data through portable sensors or sensing systems, such as inertial measurement units (IMUs), smart rings, [7,8] and smart bracelets, [9] and then identifies human activities through deep learning.[12][13][14][15][16] M. O. Mario [10] proposed a novel mechanism to detect specific activities performed by users using data from a single-triaxial accelerometer.When applying p-fold crossvalidation, their F1 score outperformed the dataset authors' F1 score by approximately 8%.V. Mikos et al. [11] integrated a freezing-of-gait detection system into a single-sensor node for the first time and classified the freezing of gait using a real-time learning neural network, achieving a sensitivity of 95.6% and specificity of 90.2%. A. Atrsaei et al. [12] proposed a machinelearning-based method that uses data from lower-back sensors to estimate the gait speed in clinical settings and homes.They used the proposed motion detection method to detect walking in the home with 96.4% accuracy.
[19][20][21][22][23][24][25] Y. Dong et al. [18] developed a sensor fusion strategy adopting Dezert-Smarandache theory and achieved 95.15% accuracy on a sports activity dataset.M. Webber et al. [19] compared three fusion levels of activity data and determined the optimal fusion level under multiple sensors, demonstrating that Kalman filtering has both good accuracy (0.7536 AE 0.1566) and a short processing time (61.71AE 63.85 ms).J. Wu et al. [20] proposed a wearable system for the real-time recognition of American Sign Language by fusing information from inertial sensors and surface electromyography sensors.The final mean accuracies of the within-subject and between-subject cross-session evaluations were 96.16% and 85.24%, respectively.

Computer-Vision-Based HAR
There are two mainstream categories of computer-vision-based HAR methods.[28][29][30][31][32] M. F. Aslan et al. [27] applied the Speeded Up Robust Features algorithm enhanced with the bag-of-visual-words technique to binary and gray scale images, achieving an accuracy of 96.52% and 92.71%, respectively.S. P. Sahoo et al. [28] proposed a new concept of the HAR depth, combining sequential and shape learning and deep history imagery, and constructed a deep bidirectional long short-term memory to model the temporal relationship between action frames, finally achieving a highest accuracy of 97.67% on four datasets.C. Chen et al. [29] used a depth motion map generated by accumulating the motion energy of a projected depth map in three projected views (front, side, and top) as a feature descriptor, enabling computationally efficient HAR.They achieved an average recognition rate of 90.5% on their dataset.
[35][36][37][38][39][40][41][42][43] Owing to their high computational efficiency and good recognition performance, these methods are increasingly attracting the attention of researchers.B. Su et al. [33] proposed a biologically based hierarchical model using Kinect skeleton data.By selecting different features at different levels and combining them with appropriate classifiers, they achieved a recognition rate of 97.64% on their dataset.H. Wang et al. [34] used the geometric relationship between joints for HAR and experimentally demonstrated that joints, edges, and surfaces can be used complementarily to recognize different activities, achieving a highest accuracy of 95.3% on multiple datasets.M. Li et al. [37] used graphs to capture the relationship between body joints and body parts and proposed a new symbiotic graph neural network for activity recognition and motion prediction.The network achieved good performance on four datasets and obtained a highest accuracy rate of 98.8%.F. Angelini et al. [41] used ActionXPose to extract features from skeleton data of different activities for the neural network and achieved an accuracy rate of 96.44% in a cross-dataset setting.

Limits of Prior Research
The aforementioned sensor-based methods have limitations.On the one hand, using a single-node sensor on the body for HAR offers good portability, but has limited accuracy and only recognizes a few human activities.On the other hand, using sensor networks requires users to wear multiple sensors.In applications such as geriatric medicine and patient condition diagnosis, these devices are susceptible to damage from usergenerated static electricity, and wearing multiple devices can be inconvenient and burdensome for users.There are also drawbacks to computer-vision-based methods of HAR.First, traditional approaches fail to capture the complete sequence of skeleton information for recognition.Second, human activities typically involve smooth and continuous motions, and the strong temporal correlation between video frames depicting human activities is often overlooked.
Figure 1 compares the existing human motion recognition research results in terms of the numbers of nodes and cameras, number of activities, and accuracy.Table 1 lists HAR methods and compares their differences in terms of device type, feature type, number and location of nodes or cameras, number of activities recognized, and accuracy.Figure 1 and Table 1 indicate that this work involved no physical contact with the participants, which reduced the possibility of discomfort for the participants.In addition, this work used minimal equipment and recognized the greatest number of actions and had the highest recognition accuracy among the HAR methods.

Research Motivation
The widespread application of deep learning methods has benefitted various fields.[46][47][48] K. Zhao et al. [44] proposed a federated multisource domain adaptation approach that combines transfer learning and federated learning.This approach leverages all user data to achieve accurate recognition of the target data.Furthermore, K. Zhao et al. [45] developed a multisource domain transfer learning approach called the conditional weighting transfer Wasserstein auto-encoder to tackle the challenges of cross-domain fault diagnosis.Additionally, they designed an ingenious conditional weighting strategy to quantify the similarity of different source domains and the target domain.This strategy further assists the proposed model in minimizing discrepancies in the conditional distribution.[51][52] As an example, D. Cheng et al. [50] proposed a prototype-guided federated learning framework for HAR, aiming to effectively decouple the representation and classifier in heterogeneous federated learning settings.They achieved the best performance and faster convergence speed on an HAR dataset.J. Liang et al. [51] proposed a collaborative compression scheme that combines channel pruning and tensor decomposition.Their scheme effectively addresses sparsity and low rankness while considering mutual interference within a network comprising efficient 1D convolutional kernels.The scheme thus reduces the runtime of HAR while having an acceptable level of performance degradation.Currently, imagebased HAR methods primarily rely on deep learning techniques to enhance accuracy.Many well-known convolutional neural networks (CNNs), such as LeNet [53] and AlexNet, [54] were originally designed for general image classification tasks.Therefore, to recognize complex human activities, it is necessary to extract complete human pose features and convert them into images as inputs for CNNs.This approach enables deep learning models to better learn and understand the features of human activities, thereby improving the accuracy of HAR.
Inspired by the characteristics of red-green-blue (RGB) images, we propose in this article a new time-series data imaging method that features spatial multidimensional encoding and data folding, as shown in Figure 2. We collected node information on the human body in the time dimension and normalized the 3D spatial coordinates of the single-frame node to R, G, and B values to generate an RGB image.We then input the image into the improved CNN for training, reaching a recognition accuracy of 98.02%, which was 2.33% higher than the accuracy of LeNet-5.Experimental results showed that our encoding method provides a lightweight and highly precise means of fully capturing the features of activity throughout the human body while retaining strong temporal correlations in continuous activity, which is important to monitoring and recognizing human movements.

Main Contributions
The main contributions of the article are summarized as follows: 1) We constructed a camera platform for collecting human activity data.This platform was used to build a dataset for 25 different activities in various scenarios.2) A novel method for encoding human activities is proposed.By encoding the 3D spatial information of human activity nodes into RGB images, we not only reduced the computational complexity of activity features but also visualized different human activity features in different scenarios.Moreover, this method preserves strong temporal correlations in continuous activity, providing a more comprehensive and accurate representation of activity features.3)An improved lightweight network model based on LeNet-5 is introduced.Compared with six other mainstream neural networks, this model has a higher accuracy in HAR tasks but a lower computational complexity (i.e., a smaller number of floating-point operations per second [FLOPS]), a lower parameter count, and a shorter inference time.The clear advantages and competitiveness of the proposed method in the field of HAR can thus be demonstrated.

Hardware System
We built a shooting platform that mainly comprised a mobile phone with a SONY IMX586 sensor and a 500 GB hard drive fixed on a 1.2 m high tripod.The platform had a shooting angle of 120°and captured video at a rate of 25 frames per second, with a resolution of 1920 Â 1080 for each frame.The platform was designed to run continuously for approximately 4 h on a single-battery charge and was intended for recording people's daily and physical activities.
We collected video data for 25 human activities performed by 161 experimental participants under natural and unsupervised conditions.These data include human activity videos collected using our equipment under real environmental conditions and human activity videos obtained from online materials with the permission of the authors.Further details are given in Table S1, Supporting Information.The experimental procedures were reviewed and approved by the Ethics Committee of the

Wearable sensor
Computer vision Northeastern University (No. NEU-EC-2021B023S).Most of the videos were shot indoors under standard lighting conditions.In such cases, most of the videos were of daily activities in the home.As the motions were within a relatively small area, the experimenters could largely stay put in shooting the videos.A minority of the videos were shot in open outdoor areas during the day, under natural lighting conditions.In such cases, most of the activities shot were physical exercises with motion over a large area.In total, we captured approximately 500 000 frames to create a total of 5869 collected video samples, which provided sufficient data for algorithm testing and performance evaluation.Notably, our database is still growing and will include more motion video samples in the future.

Encoding Method
We developed an efficient data encoding method that converts the original activity signal in the time domain into an RGB image, thus visualizing activity information.
Figure 3 shows the spatial coordinates x, y, and z of 33 nodes of the human body captured through MediaPipe pose. [55]As a machine-learning scheme for tracking body poses, MediaPipe Pose inferred 33 3D landmarks and background segmentation masks on the whole body from RGB video frames.After data processing, the coordinates x, y, and z were used as the input of the R, G, and B channels in the RGB image to visualize the activity information.For ease of recording, we put the name of each activity in the upper-right corner of the RGB image and the designator letter in the upper-left corner.

4
-19 daily activities: hand clapping, knocking, etc. 95.4% G. Huang [40] (2020) Camera Skeleton information 1 -7 classroom activities: raising hand, sleeping, etc. 82.8% F. Angelini [41] (2020) Camera Skeleton information  of the image.The pixel intensity of the R channel is expressed as Round (⋅) is a rounding function that normalizes pixel values from 0 to 255.Similarly, by extending this encoding method to channels G and B, we obtained an RGB image with a size of M Â 33.
Adopting the aforementioned encoding method, we generated a dataset containing 5869 images.Unlike other methods that extract depth and color information directly from images, our method obtained human activity features through nodes and converted them into RGB images and thus described activity information more directly and completely.In addition, our method did not require the preprocessing of signals such as curves of human body joint movement, which greatly reduced the data-processing effort.

Proposed CNN Structure
Our goal was to realize lightweight and high-accuracy HAR.To this end, we improved on the LeNet-5 CNN.The basic LeNet-5 model had seven layers.The first, third, and fifth layers respectively had 6, 16, and 120 convolution kernels of size 5 Â 5.The rectified linear unit (ReLU) activation function was applied to downsample the input image and extract features.The second and fourth layers were max-pooling layers used to reduce the computational effort and obtain features.The sixth and seventh layers were fully connected layers responsible for reducing dimensionality.We proposed an enhanced version of the LeNet-5 CNN that had a lightweight design, high accuracy, and improved prediction and response speed.This network is an expansion of the original LeNet-5 model and has eight layers, as shown in Figure 4a.

Expanding the Receptive Field
We used 16 convolution kernels with a size of 3 Â 3 in the first layer.By increasing the number of convolution kernels to 32 in the third layer, we expanded the receptive field (RF) to solve the problem of the poor recognition performance of the original model due to a small RF.The details are shown in Figure 4b.
The RF of the CNN is calculated as RF l is the receptive field of layer l, f l is the filter size of layer l, and s i is the stride of layer i.
The RF after the first layer of convolution in the original LeNet-5 is expressed as The RF after the first layer of convolution in the improved network is expressed as After the first layer of convolution, the RF was 32% greater for the improved network than for the original network, which improved the recognition performance of the network model.The number of channels in the improved network was 16 Â 3 Â 3, which was 96% of the number in the original network (6 Â 5 Â 5).This reduction improved the recognition performance in terms of accuracy and speed.

Coordinate Attention
The purpose of an attention mechanism was to select better features by assigning weights to different parts of the feature maps and suppressing useless information.We introduced the coordinate attention (CA) mechanism (proposed by Q. Hou et al. [56] ) in front of the fully connected layer of the network, as shown in Figure 4c.
CA addressed the limitations of global pooling in channel attention by incorporating positional information.To mitigate the loss of positional information due to 2D global pooling, CA decomposed channel attention into two parallel 1D signals along the vertical and horizontal directions.This allowed the model to capture long-range dependencies along one spatial axis while preserving positional information along the other axis.The resulting feature maps were encoded as orientation-aware and position-sensitive attention maps, which were complementarily applied to the input feature maps to enhance the representation of objects of interest.We knew from the structural diagram of CA that CA could be used as a computing unit to enhance the feature representation ability.CA could take any intermediate feature tensor X ¼ ½x 1 , x 2 , • • • , x C ∈ ℝ CÂHÂW as input and, through transformation, output Y ¼ ½y 1 , y 2 , • • • , y C of the same size and augmented features.
On the one hand, CA offered the advantage of capturing both cross-channel information and direction-aware, positionsensitive information.This enabled more precise localization and identification of objects of interest.On the other hand, CA was highly flexible and lightweight, allowing for easy integration into the network to enhance feature representation.

Improved Activation Function
LeNet-5 adopted the ReLU [57] as the activation function.ReLU had a level of sparsity that helped to solve the gradient disappearance problem in functions such as the sigmoid function, a good prediction ability, and the ability to minimize overfitting.However, it suffered from the ReLU dying problem; i.e., up to 50% of neurons die during network training.We thus adopted the smooth maximum unit (SMU), instead of ReLU, as the activation function in our network model.
With the smooth approximation of the jxj function, a general maximum function approximation formula was obtained to smoothly approximate ReLU.The maximum function could be expressed in two ways.
We approximated jxj through the specific approximation of xerf ðμxÞ.When μ !∞, xerf ðμxÞ was approximated to jxj, where erf is the Gaussian error function, defined as By replacing jxj with xerf ðμxÞ, the smooth approximation formula of the maximum function is obtained as Letting x 1 ¼ x and x 2 ¼ αx, we have The aforementioned formula is the SMU.When α ¼ 0 and μ !∞, we had a smooth approximation of ReLU. Figure 4e shows images of SMU and ReLU.The ReLU activation function is expressed as It could be seen that the SMU provided better smoothness than ReLU, which improved the feature extraction ability.

Experimental Results
We collected approximately 500 000 frames of video and divided the resulting dataset of 5869 video samples into training, validation, and testing sets at an 8:1:1 ratio for CNN learning.We used a personal computer equipped with an NVIDIA GTX 3060 GPU and an Intel i7-12700KF 3.60 GHz CPU, running on the Windows 10 operating system.We used 64-bit PyCharm as our development environment and programmed using the Python 3.9 interpreter.The selection of hyperparameters was primarily guided by their experimental performance and corresponding outcomes.We tried different combinations of hyperparameters and observed the model's performance on the training set, as shown in Figure 5. From the results of the experiments, we obtained a set of optimal hyperparameters, namely, a learning rate of 0.003, weight decay of 0.00001, batch size of 32, and CA reduction factor of 8.These hyperparameters were selected through a comprehensive evaluation and comparison of the model's performance.Regarding the selection of the loss function, we compared four loss functions and found that the network model performed best when using CrossEntropyLoss.
To verify the performance of our network, we compared it with other CNNs, namely, LeNet1D, AlexNet, GoogLeNet, MobileNet, ShuffleNet, MobileViT-XXS (extra-extra-small), VGGNet16, and LeNet-5.Figure 6 shows the accuracy and loss curves for the training of the different networks.It can be seen that our network had the best performance among the networks.
For the same dataset, we conducted five experiments on different networks and took the average test accuracy AE standard deviation as the final test accuracy.The results are given in Table 2.
Table 2 shows that the accuracy of LeNet1D was lower than that of LeNet-5 and that of our proposed enhanced network.This result further emphasizes the clear advantage of encoding action node information as images rather than directly inputting the sorted 3D coordinates of the action nodes as 1D data into the neural network.We improved the accuracy by sacrificing a certain number of FLOPS.Encoding action node information as images thus better captures the features of actions and enhances the accuracy of action recognition.
In Table 2, C1 refers to making changes to the neural network structure, C2 refers to the adoption of CA, and C3 refers to the adoption of the SMU.For the network with C1, combining the convolutional layer with more small cores allowed the network to generate a larger RF in the convolutional layer and better perceive activities across the human body.For the network with C2, adding CA allowed the network to effectively capture the relationship between channels and make full use of the directional and positional information to capture target features.For the network with C3, we used the SMU to solve the ReLU dying problem and thus obtain more features and improve the network performance.The networks with C1, C2, and C3 had a 0.98%, 0.30%, and 0.39% higher test accuracy than LeNet-5.The networks with C1 þ C2, C2 þ C3, and C1 þ C3 had a higher test accuracy than those with C1, C2, and C3 alone.Our network outperformed the other networks in terms of the test accuracy, reaching 98.02%, which was 2.33% higher than that of LeNet-5.
Another goal of our improved network is to be lightweight.Our proposed network had only 2.44 M FLOPS and 0.16 M parameters.Unlike LeNet-5, the structural changes in our network mainly occur in C1.Therefore, although the change in the convolution kernel allows the convolution layer to produce a larger RF and contain more information, it also leads to more network parameters and increased computational complexity.
AlexNet, GoogleNet, MobileNet, ShuffleNet, MobileViT-XXS, and VGGNet16 had lower accuracy rates, more FLOPS, more parameters, and a longer inference time than our network.In particular, the highly anticipated MobileViT-XXS had a test accuracy of 96.77%, 350.35 M FLOPS, 1.02 M parameters, and an inference time of 645.15 ms.The advanced networks thus did not produce satisfactory results on our dataset.Meanwhile, experimental results reveal that the difference in accuracy between shallow and deep networks is not as great as initially anticipated on our dataset.By avoiding the introduction of additional parameters and the high computational complexity associated with advanced network modules such as residual blocks, we reduced the demands on computational resources and time.Our proposed enhanced CNN thus has a simplified network structure and training process while maintaining high accuracy, making it easier to optimize and deploy.Through these improvements, we have achieved the goals of a lightweight design and high accuracy.Confusion matrices for the different neural networks on the same dataset are shown in Figure 7.

Advantages of Our Method
According to the order in which the 33 nodes are marked on the human body in Figure 3, the head, hands, and legs respectively correspond to the upper, middle, and lower parts of the RGB images.As the colors and textures of the RGB images of different activities vary greatly, we compared the node data to study the rich activity information that they contain.For the sake of illustration, we selected three contrasting activities (jumping jacks, wiping a window, and blow-drying hair) for analysis.As the data of the head nodes change little, we ignored these nodes and selected a total of 22 nodes starting from the right shoulder, nodes 11-32, for comparison as shown in Figure 8.

Graphical Analysis of Nodes on Different Body Parts
We compared changes in the x-axis data for the three continuous activities, the jumping jacks, wiping a window, and blow-drying hair.For the jumping jacks, as shown in Figure 8a2, the x-axis data of nodes 13-22 on the arms changed greatly whereas those of nodes 25-32 on the legs and feet changed only slightly.The changes correspond to the way that the activity is performed.For wiping a window, as shown in Figure 8b2, the data of nodes 14, 16, 18, 20, and 22 on the right arm were scattered whereas those of other nodes changed slightly.The changes correspond to the fact that the activity involves larger motions of the right hand and smaller motions of other body parts.For blow-drying hair, as shown in Figure 8c2, the node data hardly changed.In summary, the data distribution of nodes clearly presents the characteristics and changes of different activities, and RGB images obtained by encoding on this basis retain complete information of the activities.

Intensity Analysis of RGB Images
The intensity information of activities is shown in Figure 8a1-c1 and corresponding node box plots are presented in Figure 8a2-c2.
Figure 8a2 shows that the data of nodes 13-22 on the arms were scattered, those of nodes 25-32 on the legs and feet were relatively concentrated, and those of nodes 11 and 12 on the shoulders and nodes 23 and 24 on the waist were only slightly scattered.Therefore, the motion of the arms changed within a large range, that of the legs and feet changed within a small range, and that of the shoulders and waist changed little.A horizontal examination of Figure 8a1 reveals rich and bright colors and high contrast in the RGB image, especially in the middle part of the image corresponding to the arm, where the color and range vary greatly.This suggests that the intensity of motion of this body part was greater than that of other body parts.Moreover, the feature information reflected in the RGB image is consistent with the box plot.The intensity information on each body part in the jumping jacks activity is thus clearly observed in Figure 8a1.Figure 8b2 shows that only the data of nodes 14, 16, 18, 20, and 22 on the right arm were scattered, whereas those of other nodes were concentrated.This shows that in the case of wiping a window, the right arm moves over a large range, whereas the other body parts move little.A horizontal examination of Figure 8b1 reveals no great change in color on the whole.Although there is a partial color change in the middle part of the image corresponding to the arm, it is not easily distinguishable from the surrounding area, which is as expected.That is to say, in this activity, the motion of one arm has a high intensity whereas that of other body parts has low intensity, and only inconspicuous color changes are observed in the RGB image.Figure 8c2 shows a concentrated data distribution of the nodes on all body parts, suggesting that there was little body motion in the action of blowdrying hair.A horizontal examination of Figure 8c1 reveals almost no change in the color of the image, suggesting a very low motion intensity of all body parts in the activity, which is consistent with the data box plot.
A comparison of Figure 8a1-c1 reveals a sharp contrast between the RGB images, especially in the middle and lower parts of the image corresponding to the nodes on the limbs.Figure 8a1 shows the largest color change, Figure 8b1 shows a moderate color change, and Figure 8c1 shows the smallest color change, which indicates that the RGB images contain rich intensity information.By organizing and testing the RGB images of physical exercise, we achieved an accuracy of 99.61% AE 0.05%.Doing the same for the RGB images of housework activities and other simple daily activities, we achieved accuracies of 91.33% AE 1.06% and 98.98% AE 0.28%, respectively.In summary, we can obtain intensity information for different movements and nodes from the RGB images.

Periodic Analysis of RGB Images
The RGB images clearly show periodic information of motion, as shown in Figure 8a1-c1.We selected four nodes (i.e., nodes 15,  16, 27, and 28) to show the periodic characteristics of activities and observe their data changes.Nodes 15, 16, 27, and 28 correspond to the left wrist, right wrist, left ankle, and right ankle, respectively.The change curves are shown in Figure 8a3-c3.
Figure 8a3 shows that the data of all four nodes have obvious periodic changes.Figure 8a1 shows clear periodic lines in the middle part and less clear periodic lines in the lower part, which correspond to the curved paths of the wrists and ankles and the periodic motion of the arms, legs, and feet during jumping jacks.The characteristic information in the RGB images is consistent with the plotted curve, and we thus directly obtain periodic information on the jumping jacks from Figure 8a1.Figure 8b3 shows obvious periodic changes only for the data of the right wrist. Figure 8b1 only shows periodic lines in the middle part of the image.This is consistent with the right arm swinging and moving periodically and the other body parts remaining almost static during the window wiping activity.Figure 8c3 shows no periodic change in the data of the four nodes in the hair-drying activity, which aligns with there being no periodic pattern in Figure 8c1.The periodic information in the RGB image is thus consistent with the plotted curve; i.e., this activity does not involve periodic motion.
A comparison of Figure 8a1-c1 reveals periodic lines in Figure 8a1,b1 but none in Figure 8c1.The RGB images thus contain rich periodic information for judging the motion characteristics of different activities.We conducted testing on the RGB images with periodic lines and achieved an accuracy of 97.17% AE 0.32%.Similarly, for the RGB images without periodic lines, we obtained an accuracy of 99.39% AE 0.38%.In summary, the results obtained using the proposed encoding method correspond well to the characteristics of real-life activities and retain the main change points of each activity.We can judge whether an activity has periodic characteristics by directly observing whether there are periodic lines in the RGB image.

Conclusion
We propose an encoding method in which the 3D coordinate information of human body nodes is extracted and encoded into color pixels, on the basis of which a color-coded image is generated to fully describe the information of human activities and recognize the activities.The coded image is input to an improved network based on LeNet-5 for learning.Our results showed that our network model outperformed other advanced neural networks on a dataset constructed in this work, while balancing the need to be lightweight and have a high precision.By fundamentally explaining the proposed encoding method and analyzing the intensity and periodicity of activities based on the characteristics of the coded image, we verified the intuitiveness and effectiveness of the encoded image.In future work, we will continue to explore image encoding methods that cover node features and skeleton data, so as to learn more deep features and recognize human activities more effectively.

Figure 1 .
Figure1.Human activity recognition (HAR) studies from the literature.Compared with other studies on human daily activity recognition, the present study involved no physical contact with the participants, had the fewest cameras, recognized the most activities, and had the highest activity recognition accuracy.The range of the reciprocal of the number of nodes and cameras is between 0 and 1.The activity count ranges from 0 to 25 and the accuracy ranges from 80% to 100%.

Figure 2 .
Figure 2. Schematic of an HAR system.a) The activity video is captured by the camera and the human body node data are captured and encoded.b) The RGB images generated through encoding are input into the convolutional neural network (CNN) to recognize 25 daily and physical activities, and the advantages of the encoding method are analyzed.

Figure 3 .
Figure 3. Encoding method and 25 types of generated RGB image.

Figure 5 .Figure 6 .
Figure 5. Performance comparison of the neural network under different hyperparameters.a) Accuracy curve for training with different learning rates, b) accuracy curve for training with different weight decay values, c) accuracy curve for training with different batch sizes, d) per epoch time curve for training with different batch sizes, e) accuracy curve for training with different CA reduction values, and f ) accuracy curve for training with different loss functions.

Figure 7 .
Figure 7. Confusion matrices of different neural networks for the same dataset.a) Confusion matrix of AlexNet, b) confusion matrix of GoogLeNet, c) confusion matrix of MobileNet, d) confusion matrix of ShuffleNet, e) confusion matrix of MobileViT-XXS, f ) confusion matrix of VGGNet16, g) confusion matrix of LeNet-5, h) confusion matrix of this work.

Figure 8 .
Figure 8. a1-c1) Human activities and RGB images; a2-c2) box plots of some nodes on the x-axis; and a3-c3) change curves of node data on the x-axis.

Table 1 .
Comparison of previously proposed HAR methods and our method.

Table 2 .
Test results of various models and ablation experiments (C1-making changes in the neural network structure, C2-adopting coordinate attention, and C3-adopting the smooth maximum unit).