Video augmentation technique for human action recognition using genetic algorithm

Classification models for human action recognition require robust features and large training sets for good generalization. However, data augmentation methods are employed for imbalanced training sets to achieve higher accuracy. These samples generated using data augmentation only reflect existing samples within the training set, their feature representations are less diverse and hence, contribute to less precise classification. This paper presents new data augmentation and action representation approaches to grow training sets. The proposed approach is based on two fundamental concepts: virtual video generation for augmentation and representation of the action videos through robust features. Virtual videos are generated from the motion history templates of action videos, which are convolved using a convolutional neural network, to generate deep features. Furthermore, by observing an objective function of the genetic algorithm, the spatiotemporal features of different samples are combined, to generate the representations of the virtual videos and then classified through an extreme learning machine classifier on MuHAVi‐Uncut, iXMAS, and IAVID‐1 datasets.


| INTRODUCTION
Human action recognition (HAR) are predicted by visual and motion metaphors, either using traditional handcrafted features with prior information (that is, dense trajectories [1], BoF [2], and holistic representation [3,4]) or deep learning models, such as two-stream network [5], 3D convolutional neural network (CNN) [6], and recurrent neural networks [7,8]. Despite their successes, deep learning models require large training sets. These two categories of HAR techniques potentially improve the accuracy of the learning model based on the quality and quantity of the training data. However, data skewness and irregularity degrade performance due to missing feature representation and class imbalance problems.
Synthetic data generation and data sampling are effective measures for mitigating data irregularity.
Computer vision research has focused on synthetic data generation by applying data augmentation techniques to grow the training sets for stable HAR models.
Tiny videos [20] were generated using GANs by convolved streams from two-stream inputs: foreground motion and background frames contextual stream. However, a stable input video involving Random Sample Consensus (RANSAC) and Scale-Invariant Feature Transform (SIFT) feature computation at each video frame is required. Similarly, small videos were generated by extending recurrent neural networks (RNNs) for temporal recurrence and spatial convolutions [22]. GANsbased methods for video generation have higher computational overhead and are limited by nonconverging gradients and mode collapse. Thus, they produce limited sample variation and loss imbalance between the generator and discriminator networks [24].
Because data augmentation with standard techniques generates translated replications of existing examples, it might not result in significant improvements in learning. Moreover, existing data augmentation approaches lack semantic diversity that improves classification results [18]. Therefore, robust data augmentation approaches ensure dynamic feature generation for more accurate HAR prediction models.
To overcome these limitations, this paper presents a virtual video generation using a genetic algorithm (GA) through evolving deep spatiotemporal feature representation. An initial training set contains a deep spatiotemporal feature representation of action classes. The GA increases the number of instances for each class to improve action classification and resolve data irregularity. Our contribution revolves around ways of achieving stronger predictive models rather than creating physical video data. Sequel to this, we evaluate diverse virtual video representations for HAR and compare our technique with state-of-the-art video augmentation techniques. GA introduces semantic diversity by combining the spatiotemporal video representation of the same class. These spatiotemporal representations are the semantic and temporal contextual information of each action video. The combination of spatiotemporal information of action video results in a diverse training set, extending the training set for improving HAR models and reducing data irregularity. GA holds attributes to balance data irregularity previously presented in Haque et al. [25] and Yang et al. [26] and is effective in our approach. Our augmented video combines contextual information of two action videos to produce dynamic virtual videos, as shown in Figure 1.
Our contributions are as follows: First, we propose an end-to-end feature augmentation technique by integrating deep features and a GA for HAR. Second, the proposed approach extends the training set for HAR using virtual videos by inducing semantic diversity in training data to stimulate wider variations in scene contextual information ( Figure 2) and improve the learning accuracy of HAR models. Third, our proposed approach overcomes data irregularity to generate stronger recognition models.

| THE PROPOSED APPROACH
Consider the training data ðV i x,y,t ,l i Þ, where V i are the action videos and l i are corresponding action labels, and i ¼ ð1, 2, 3,…, mÞ are the total number of training samples such that each video sample V i x,y,t consists of t video F I G U R E 1 Graphical representation of proposed video augmentation technique for human action recognition using genetic algorithm. Images (A) and (B) are original frames from two training videos, and (C) is a visualization of generated frame through proposed approach frames. First, each video sequence V i x,y,t is processed to obtain the actor's silhouette f (x, y, t) using graph-cut segmentation [27] (Figure 2). The parameter x, y represents the pixel frame information and t denotes the temporal shift of video frames. The actor's silhouettes f(x, y, t) of each video frame are aggregated to form a single spatiotemporal actor's motion template Mt f (x, y, t) using motion history images [3]. Afterward, deep spatiotemporal features D are computed for the motion template Mt f (x, y, t) using a CNN network. To extend the training set for HAR, we generate optimal deep spatiotemporal feature D representations of virtual videos for ψ number of generations of η dimension using GA and select the elite deep spatiotemporal features D + representations that minimize the classification loss J(y p ) between the predicted y i p and actual action classes l i . The extended training set is evaluated using an extreme learning machine (ELM) classifier to obtain optimal decision boundaries for HAR. However, our technique is neither limited to specific types of CNN models nor the input video, that is, RGB video, 2D frames, or silhouettes data, rather, our approach extends the training set irrespective of input data. The complete architecture of the proposed method is presented in Figure 2.

| The deep spatiotemporal feature extraction
For creating an initial population for generating virtual video, action videos are processed to segment the actor silhouettes f (x, y, t) [27]. Then, the actor's silhouettes f (x, y, t) of the entire video form a single actor's motion template Mt f (x, y, t) [3], to encode spatiotemporal movement information for t video frames, as given by (1).
where τ is the total silhouettes frames segmented to form Mt f for all action videos. Later, the actor's motion templates Mt f are normalized using min-max normalization.
We further rescaled the N Â N dimensions for presenting to deep CNN pretrained models. Next, the CNN model generates deep spatiotemporal representation D by convolving 2D with a kernel w of dimension z Â z, and the results are summed up onto the receptive field, (2).

| GA for generating virtual action videos
For virtual video generation using GA, the set of features D serve as the initial population ϕ m, k of dimension η, where each row m of ϕ m, k represents a chromosome and each feature, a gene θ in the chromosome, m Â k represents the feature dimension of D with m samples of k features, and k denote the total genes θ in one chromosome. The initial population ϕ m, k was evolved for ψ generations using GA's selection and crossover operations to produce a new pair of offspring that fulfills the fitness criteria J of the evolution. In our modified two-point discrete crossover, the two crossover points [c 1 , c 2 ] randomly opted from the real number 0 < ℜ < k f gfor interchanging the sections of two parents P 1 , P 2 f g¼ ϕ α,k , ϕ αþ1,k È É , as expressed in (3) and (4). The crossover operation breeds new offsprings for inheriting the characteristics of parent chromosomes, and some evolved discriminative characteristics. We ensure that the generated new pair of offsprings are stronger and satisfy the fitness criteria.
The fitness function J was gaged for the entire population in each generation ψ to produce an elite deep spatiotemporal D + representation of virtual videos. In our work, the terminating condition for GA to evolved F I G U R E 2 Architecture diagram for production of virtually generated human action videos using deep feature and genetic algorithm chromosomes is to select elite deep spatiotemporal feature representation from generations of feature representation through observing minimum fitness measure. The generation of deep spatiotemporal D + representations is an evolved elite version of the initial feature 's chromosome population, which differs from the initial D computed using the human action videos. Our evolved representation extends the training set as an important function for model optimization. Here, we emphasize video analysis to recognize different human action classes and achieve better action prediction performance.

| Classification using ELM
After computing the evolved D + , the representations are passed to a feedforward ELM classifier for model generation. The ELM classifier predicts y p (D + ) as a single output unit using hidden nodes h, as described in (5): where γ p ¼ γ 1 ,γ 2 , γ 3 , …, γ h f gare the weights between the ELM's h hidden nodes, and ELM's output vectors H p . The ELM output vectors are y p ðD þ Þ ¼ y 1 , y 2 , y 3 ,…, y h f g , and ELM decision function for action recognition is a sigmoid activation function. The classification loss Jðy i p Þ of ELM between the predicted and actual action class is shown in (6).
3 | EXPERIMENTS AND FINDINGS

| Datasets
The performance evaluation was conducted on the activity recognition datasets of INRIA Xmas (iXMAS) [28], MuHAVi-Uncut [29], and IAVID-1 [30]. The INRIA Xmas motion acquisition sequence (iXMAS) [28] is a silhouette-based activity dataset that comprises 12 activities performed by 12 actors. iXMAS, as multiview dataset, captures human actions from five viewpoints. The MuHAVi-Uncut dataset is a multiview activity video dataset that contains 17 activities performed by 14 actors. It comprises 2898 single actor silhouette videos.
The IAVID-1 dataset was developed for the instructor activity recognition and contains 100 videos with 8 action classes performed by 12 actors. The action video samples in MuHAVi-Uncut and IAVID-1 are unbalanced, as shown in Figure 3.
Generating stronger decision boundaries for learning models with imbalanced training datasets is difficult. We selected an imbalance dataset to evaluate the strength of the proposed augmentation approach against data irregularity due to the class imbalance problem. Our proposed augmentation approach introduced semantic diversity and increased the training data to overcome the data irregularity. Thus, the proposed approach is evaluated on MuHAVI-Uncut and iXMAS datasets because they constitute class imbalance samples. Our proposed video augmentation approach improved the generation of stronger predictive models by overcoming data irregularity.

| Experimental setup
We employed Alexnet [22], VGG19 [18], ResNet [31], Google Net [32], and DenseNet201 [33] CNN models for feature extraction. The features extracted from the best performing layers are reported in the results. For simplicity, the deep spatiotemporal features extracted from the FC6 layers of Alexnet [22] and VGG19 [18], FC1000 layers of ResNet [31] and DenseNet201 [33], and AVPool5 layer of Google Net [32] are denoted as D A , D V , D R , D D , and D G , respectively. These deep spatiotemporal features D A , D V , D R , D D , and D G were augmented with the proposed augmentation approach to introduce feature diversity. The ELM classifier for the MuHAVi-Uncut dataset contains 129 060 hidden nodes and that of iXMAS and IAVID-1 datasets contains 2100 hidden nodes [30].
F I G U R E 3 Distribution of imbalance class samples of MuHAVi-Uncut and IAVID-1 dataset are represented as bar chart. The bar chart shows that video samples in MuHAVi-Uncut and IAVID-1 are imbalance, thereby causing skewness in data distribution. These data irregularity should be improved to generate stronger predictive model

| Evaluation criteria
We evaluated our method using the leave one actor out (LOAO) and leave one camera out (LOCO) validation schemes [30]. The performance is measured by evaluating the computing accuracy (Ac) and the F score (F*) over the feature set with and without the proposed augmentation.

| Evaluation of actor invariance (LOAO validation)
LOAO validation scheme allows all actors to evaluate the performance of HAR with the proposed augmentation technique using MuHAVi-Uncut, iXMAS, and IAVID-1 datasets. The averaged action prediction accuracy is reported in Table 1. To understand the impact of semantic diversity introduced using the proposed augmentation approach, we extracted deep spatiotemporal features from AlexNet, VGG, ResNet18, ResNet101, GoogleNet, and DenseNet, represented by D A , D V , D R18 , D R , D G , and D D , and trained the feedforward ELM classifier with and without the proposed augmented approach for HAR. The confusion matrices in Figures 4 and 5 show the perclass prediction accuracies. Those on the left represent the perclass prediction rates without the proposed augmentation technique, whereas those on the right represent the perclass prediction rates with the proposed augmentation technique, using MuHAVi-Uncut and iXMAS datasets. Our proposed augmentation approach improves the action prediction rate in all classes, as illustrated by the confusion matrices.
From Table 1, although the augmented deep spatiotemporal features extracted from Alexnet showed the highest accuracy on all three datasets, most significant improvements were observed for the augmented deep spatiotemporal features extracted from VGG19,  ). The impact of the proposed augmentation approach is shown in the confusion matrix using the perclass action recognition rate. The confusion matrix on the left represents the HAR without the proposed augmentation approach and that on the right represents the HAR with the proposed augmentation approach DenseNet, and Googlenet where HAR performance increased by 5% to 7 %, whereas, for augmented deep spatiotemporal features extracted from Alexnet, the performance gain was 2%. Overall, LOAO performance improved by 2% to 7% showing that the proposed augmentation approach improves the semantic diversity of the spatiotemporal feature representation of human actions. Furthermore, our proposed feature augmentation approach enhances the discriminative ability of features extracted from any CNN network, including sequential or directed acyclic graph architecture. Moreover, the MuHAVi-Uncut, iXMAS, and IAVID-1 datasets for HAR consist of imbalance action class samples that disturb the decision boundaries of the recognition model, as illustrated in Figure 3. To overcome data irregularity, GA was employed to increase the number of spatiotemporal feature instances for each class to balance the class samples and remove the skewness from the data. Our proposed augmentation approach evolved and combined deep spatiotemporal features and generated a new deep spatiotemporal representation of actions per class while keeping the same class labels. This implies that our proposed feature augmentation approach improves action classification and resolves data irregularity by balancing the class samples. Therefore, augmented features are effective in establishing strong decision boundaries and improved predictive models for HAR.

| Evaluation of multiview HAR (LOCO validation)
To understand the impact of semantic diversity introduced by the proposed augmentation approach in a multiview setting, we conducted a LOCO validation scheme for HAR. The LOCO scheme examines the view invariance attribute of the proposed augmentation approach for HAR using eight viewpoints for MuHAVi-Uncut and five viewpoints for iXMAS video datasets. The deep spatiotemporal features D A , D V , D R18 , D R , D G , and D D were used for training the feedforward ELM classifier with and without the proposed augmented approach for LOCO scheme. From Table 2, accuracy and F score improved by 4% to 5% on the MuHAVi-Uncut dataset (accuracy improved from 77.78% to 81.76%) and 3% to 7% on iXMAS dataset (71.9% to 77.4%). The perclass prediction accuracies are represented in Figures 6 and 7. Our proposed augmentation approach improves the action prediction rate in all classes for view-invariant HAR, as shown by the confusion matrices.
Like, LOAO validation scheme, our proposed augmentation approach improves the prediction accuracy for multiview HAR using the LOCO validation scheme. Therefore, the feature space augmented using the proposed approach proved as a more robust set for view invariance HAR.

| Comparison against video augmentation techniques
The extended training set was generated and compared against the state-of-the-art video augmentation methods for model learning ( Table 3). The inputted training videos of IAVID-1 were augmented with upsampling [13], adding perturbations to video frames (Gaussian and salt & pepper noise) [13], rotate & crop [34], inverting frame order [12], frame skipping [15], frame mirroring [14], F I G U R E 5 Confusion matrix of iXMAS dataset for the action recognition scheme evaluated without and with the proposed augmentation approach for person-invariant human action recognition (HAR) (leave one actor out [LOAO]). The impact of the proposed augmentation approach is shown in the confusion matrix using perclass action recognition rate. The confusion matrix on the left represents the HAR without the proposed augmentation approach and that on the right represents the HAR with the proposed augmentation approach F I G U R E 6 Confusion matrix of MuHAVi-Uncut dataset for the action recognition scheme evaluated without and with the proposed augmentation approach for view-invariant human action recognition (HAR) (LOCO). The impact of the proposed augmentation approach is shown in the confusion matrix using the perclass action recognition rate. The confusion matrix on the left represents the HAR without the proposed augmentation approach and that on the right represents the HAR with the proposed augmentation approach F I G U R E 7 Confusion matrix of iXMAS dataset for the action recognition scheme evaluated without and with the proposed augmentation approach for view-invariant human action recognition (HAR) (LOCO). The impact of the proposed augmentation approach is shown in the confusion matrix using the perclass action recognition rate. The confusion matrix on the left represents the HAR without the proposed augmentation approach and that on the right represents the HAR with the proposed augmentation approach T A B L E 2 Evaluation of leave one camera out (LOCO) validation scheme on MuHAVi and IAVID-1 datasets for multiview human action recognition (HAR) and generating videos using adversarial networks [20]. The HAR method for all augmentation techniques remained the same for a fair comparison. We augmented the human action videos with state-of-the-art augmentation methods and generated motion templates, encoding the spatiotemporal representation of human actions and extracting deep spatiotemporal features D A to train the ELM classifier. Our semantically diverse video augmentation technique improves the prediction accuracy by approximately 15% than frame skipping [15] and adding perturbations to video frames [13] on 70 to 30 training-testing splits, whereas 58.22% improvement is observed in contrast video augmented using the adversarial network [20]. Compared with Vondrick et al. [20], augmented videos generated using GANs have lower resolution with computational overhead, whereas frame skipping [15], mirroring [14], or inverting frame order [12] reflect existing training data; therefore, these methods generate lower data to improve the classification model. Table 3 shows the performance gain of our method from augmenting the feature space, inducing genetic diversity in feature set computed from AlexNet, and model relearning for performance optimization. From Table 3, our proposed method is computationally faster contrary to the state-of-the-art method due to single-pass feedforward learning of ELM regardless of backpropagation hyper-parameter tuning [35].
The analysis of the state-of-the-art video augmentation approaches using the MuHAvi-uncut dataset was not reported because the dataset consists of approximately 3000 action videos for 17 action classes. These video samples in each class cover almost all actions. IAVID-1 consists of 100 action samples for 8 action classes and required video augmentation to increase the training sample. Therefore, we evaluated all video augmentation approaches on IAVID-1 to compare the strength of the proposed video augmentation with standard video augmentation techniques.
Our method improves prediction accuracies using virtual video generation by approximately 30% than MHI's HOG [41] representation on LOCO and approximately 10% at LOAO on the MuHAVi dataset. Compared with C3D features using the SVM approach, an improvement of 32.66% was observed on IAVID-1 dataset [6].
Similarly, we extracted the spatiotemporal representation of action videos using I3D [42] and I3D with attention [43,44], for action recognition using iXMAS dataset. The RGB and optical-flow frames were provided at the input of the I3D model that convolved using unsymmetrical cubic filters to encode the feature representation, whereas for action recognition using I3D with attention, the empirical weight factor with I3D features was assigned to each RGB and opticalflow frames and was effective for action representation. However, the proposed video augmentation approach outperformed the I3D and I3D with attention, due to the genetic diversity introduced within the spatiotemporal features. Our technique outperforms state-of-the- The semantic diversity of the evolved virtual video representation is beneficial for HAR. The benefits and performance of the proposed augmentation approach can be attributed to the stronger decision boundaries established by evolving deep spatiotemporal features and sustaining the elite virtual video representation. The proposed augmentation approach improves the baseline action prediction accuracy on MuHAVI-Uncut and IAVID-1 datasets.

| Significance of virtual videos for HAR
For an efficient HAR model, a robust and discriminative feature representation of training data holds a significant role in learning because it ensures stronger decision boundaries, fast and cost-effective predictions, as validated in Tables 1 and 2. Therefore, robust feature representation of data is more beneficial than the physical existence of data for accurate predictive models, because they simplify interpretation by defining reliable predictive boundaries and enhance model generalization. Our proposed video augmentation technique provides significant improvement in model learning by inducing semantic diversity within the visual data. It augments human action video features to improve the decision boundaries of learning models for HAR. Contrary to standard augmentation techniques, our proposed approach augments the features, despite data irregularities.
Moreover, our feature augmentation approach extends the training samples in each class to balance the representation of each action sample and hence, improves the model learning. Thus, the proposed evolved video-set representation is responsible for efficient HAR through diverse deep spatiotemporal feature representations, regardless of the physical existence of video data.

| CONCLUSION
In this paper, we proposed a video augmentation technique to improve the deep learning models for HAR. We showed that model optimization stems from inducing semantic diversity in the feature sets lacking using common data augmentation methods. Furthermore, we achieved semantic diversity by introducing virtual video generation using GA. Our proposed video augmentation scheme improved HAR accuracy by 2% to 7% on MuHAVi-Uncut, iXMAS, and IAVID-1 datasets. Also, the robust performance of the proposed approach was attributed to the following factors: First, augmenting feature space is better than augmenting the dataspace to counter data irregularities. Second, inducing semantic diversity overcomes classifier instability. Third, the model relearning using diverse feature sets generates stronger decision boundaries. Moreover, our augmentation technique can be induced for HAR using RGB data, and in other domains including object T A B L E 4 Performance comparison of the proposed approach with state-of-the-art human action recognition (HAR) techniques