Gender discrimination, age group classiﬁcation and carried object recognition from gait energy image using fusion of parallel convolutional neural network

Age and gender are the two key attributes for healthy social interactions, access control, intelligence marketing etc. Likewise, carried object recognition helps in identifying owner of the baggage being abandoned or the person littering in the public places. The above-mentioned surveillance task displays discriminative characteristics in gait. Primates can accomplish scene context understanding and reacting to different circumstances with varying reﬂexes with ease. Human beings achieve this by recollecting prior experiences and adapting to new situations quickly. Modelling the human behaviour, this research work has combined customized and learnable ﬁlters so that knowledge database can always be kept up to date, as well as, provides ﬂexibility in learning new contexts. Thus, a specialized parallel deep convolutional neural network architecture with customized ﬁlters that extracts intrinsic characteristics and data driven learnable ﬁlters are fused to enhance the performance of single convolutional neural network is proposed. From the experimentation it is observed that, the learning is augmented when customized ﬁlters and learnable ﬁlters are fused together. Results show that the proposed system achieves better performance for CASIA B datAQ2abase and OU-ISIR gait database-large population dataset with age and real-life carried object.


INTRODUCTION
Understanding human behaviour helps in predicting imminent abnormal activities and consequently assists the security agencies to react immediately. Analysing far-field video sequences for human abnormality detection seems to be a challenging problem because of low-resolution imagery. Thus an alternative choice other than facial features is required for analysing human behaviour at a distance. Gait recognition has emerged as an attractive biometric technology for the identification of people by analysing the way they walk even at a distance and in uncontrolled scenarios. However, the main challenges in real uncooperative environment is the effects of inherent intra-class variations caused by covariate factors such as change in view angles, walking speed variation, object carried condition and type of foot wear. Gait energy image (GEI) the averaged silhouettes of one gait period, integrates both static and dynamic components of human walking including motion frequency, temporal and spatial changes that reflects in the appearance of human body.
The state-of-the-art methodology reported in the literature can be categorized into model-based and model-free approaches. Model-based approaches [1,2] rely on the stride parameters that describe the gait by human body structure. Model based approach is computationally expensive and require high-resolution images. Computationally effective and simple model-free based methods emphasize on the motion parameters extracted from silhouettes. Here, the success lies in extracting discriminative features.
Deep learning methods learn discriminative feature representations or metrics directly from raw data and reported promising results for gait recognition. Wu et al. [3] learned pairwise similarities by a 3D convolutional neural network (CNN) with temporal input. Takemura et al. [4] further improved the pair-wise similarity learning methods by optimizing the input, output and loss functions of CNN-based methods on large scale cross-view datasets. GEINet [5] is an end-to-end network that can discover discriminative information for gait recognition by integrating segmentation and recognition modules. Yu et al. [6,7] constructed Generative adversarial networks that can transform gait images obtained from any viewpoint to the side view irrespective of the view point, carried condition and clothing. He et al. [8] proposed multi-task generative adversarial networks for learning view-specific feature representations with new gait template called period energy image which is an extension of GEI to enrich the spatial and temporal information in cross-view gait recognition.
In this paper, the problem of gender discrimination, age group classification and carried object recognition from gait images is resolved using an ensemble model with parallel multiscale CNN.
The Contributions of this paper is summarized as follows: • Pyramidal representation of the input GEI facilitates the CNN to learn the multi-scale signal effectively. It mimics the behaviour of human brain where different areas of brain respond to stimuli with different scales • Parallel networks with varying number of convolution and pooling layers to automatically learn and recognize the most discriminative changes of gait features • An extensive empirical evaluation of various tasks like gender discrimination, age group classification and carried object recognition with deep network architecture and score level fusion of customized filters and learnable filters at the layer 1 boost the classification accuracy to a great extent The paper is structured as follows: Section 2 summarizes related work on gender discrimination, age group classification and carried object recognition. Section 3 describes the proposed method in detail. A comprehensive experimental study on gait large population dataset is detailed in Section 4. The concluding remarks of this paper are given in Section 5.

RELATED WORK
This section elaborates few state-of-the-art methods for gait based gender classification, gait based age estimation and carried object recognition.

Gait-based gender classification
Gender Classification is certainly a simple task for human. However, it is an active research problem due to its interesting applications in surveillance systems where gender specific restricted access to areas (train coach, lavatory), content based image retrieval, targeted marketing etc. Although human face contains important visual information for gender perception it requires the human subject's cooperation or attention and it is very difficult to perceive at long shots. Gender identification with gait signatures has added advantage as it is non-obtrusive and can be captured at a distance [9], without requiring cooperation or even awareness of the subject [10]. Yu et al. [11] learned local features by partitioning GEI into body parts like head, chest, and legs and classified the features using support vector machine (SVM). Lu and Tan [12] presented a view-invariant gait-based gender classification method with a subspace learning approach. Hu et al. [13] proposed a mixed conditional random field method for gait-based gender classification. As human tend to assume certain poses at each part of gait cycle, Issac et al. [14] proposed to delineate the gait instance as a sequence of poses or frames. Castro et al. [15] projected a CNN-based method with optical flow features for human identification in which, human gender is classified as supporting evidence for decision-making. Do et al., [16] used average gait image rather than a GEI and claimed to be computationally efficient and robust against view changes. Additionally, a viewpoint model and distance signal model is constructed to remove carried objects and worn coats from a silhouette to reduce the interference in the resulting classification. Sakata et. al. [17] proposed sequential multi-task learning where each CNN for nontarget tasks is trained one by one in sequence and the CNN for the target task is trained at last. Since gender classification is easiest compared with age estimation, gender classification is done first.

Gait based age estimation
Most of the studies on age estimation focused on the face images, which tend to become more wrinkled and sag with age [18,19]. Like Wrinkles, skin, swiftness, head to body ratio, gait and pitch or heaviness of the speech define the baselines for the age estimation. Davis [20] showed the gait difference between an adult and a child in terms of leg length, stride width, and stride frequencies. Ince et al. [21] showed that the head-tobody ratio of a child is different from an adult. Similar findings were also reported in [22] by analysing the most widely used appearance-based gait representation known as GEIs (GEIs) or averaged silhouettes [23], which contains both gait and shape information. There are obvious changes in head-to-body ratio as children grow and in addition, as people get older, a middleaged spread and stoop appear. Hence, varying appearances in gait representation serve as clues in gait-based human age estimation. Makihara et al. [24] learned an age regression model for each age group using support vector regression with a Gaussian kernel in conjunction with a manifold learning technique. Lu et al. [25] proposed a multi label-guided subspace to better characterize and correlate age and gender information, because gait features vary depending on a subject's attributes. Sakata [17] proposed three CNN stages: CNN for gender estimation, CNN for age-group estimation and a CNN for age regression. The OU-ISIR gait database comprising the large population dataset with age and gender details is created and its performance is evaluated using benchmarks like Gaussian process regression,

Gait based carried object recognition
Several prior approaches have been projected in the literature for detecting carried object, as it is a key component in varied security applications. Haritaoglu et. al. [27] proposed backpack detection framework with the presumption that shape of the human body is symmetric, and individuals display periodic motion while they are moving unhampered. Hydra a real-time framework for detecting and tracking multiple people was proposed by Haritaoglu et. al. [28] that combines head detection by projection histogram and correlation-based matching methods to segment multiple people in a crowd and track them. W 4 [29] learns and creates people's appearance model by locating their parts like head, hands, feet and torso and these body parts can even be tracked through interactions and occlusions. It can also determine whether a person is carrying an object, and segments the object so it can be tracked during exchanges. View and carried object category invariant detection is by detecting protrusions and comparing with the several exemplars, corresponding to different views of binary silhouettes [30]. Along with temporal templates, the colour information is fused to reduce the false positive detections [31]. The carried object detection is formulated as a superpixel classification problem using the codebook [32]. A human-baggage detector is modelled as the mixture of body parts as head, torso, leg, and baggage parts [33]. Ben Abdelkader [34] portrayed a technique to recognize person carrying an object using motionbased recognition approach that integrates shape and periodicity cues of the human silhouette. The deviation of the periodicity and the amplitude indicates the person carrying an object.

PROPOSED METHODOLOGY
The parallel architecture proposed for gender discrimination, age estimation and carried condition classification using GEIs is shown in Figure 1. The input GEIs are weighted using a Gaussian average and scaled down by pyramid representation. Apart from the put image of size p × q, the parallel architecture receives the down sampled versions of the input image that is, ( ). Parallel architecture is designed by minimizing the input image size and the associated number of convolutional layers to keep the number of connections between the last convolutional layer and the fully connected layer fixed. Convolution neural network consists of a stack of layers that takes in an input image of size (height (h), width (w), no of channels (c)) and predicts the class or label probabilities at the output.
The input image used for analysis from GEIs is normalized to the size (128, 88, 1). Here, parallel CNNs are trained from the scratch with progressively reduced input size by the factor of 2 thus the CNN A receives the input of size 128 × 88 pixels and second net CNN B 64 × 44 and third net CNN C with 32 × 22. The number of convolution and pooling layers of each CNNs are engineered to keep the input to the fully connected layer fixed (128 feature maps of size 4 × 2).
For gender discrimination the indicative of walker's gender are postural sway, waist-hip ratio, and shoulder-hip ratio [35]. Males have a tendency to swing their shoulders and females tend to swing their hips more than their shoulders while walking. Further, males have broader shoulders and females usually have thin waists and wider hips. For age estimation, gait characteristics like regularity, stability and pace that distinguishes young and old adults. Moreover, head-to-body ratio and presence of stoop are the indications from gait characteristics to predict the age of a person. The appearance and dynamics of the person get varied that when carrying objects and is well visualized from GEI.
In all the mentioned applications, the high frequency components of the GEI play a vital role. Thus the weights of the initial convolutional layer are handcrafted so as to detect specific features for gender discrimination, age group classification and carried object recognition.
The 8 filters used at the initial layer of customized stream is shown in Figure 2. Deep learning aims at learning feature automatically from data at multiple levels of abstraction and maps input to the output directly without depending on humancrafted features. Thus, the fusion of stream with customized weights and learnable weights at first convolutional layer aimed at boosting the performance characteristics.
The training of all CNNs in this work has been carried out by optimizing the cross-entropy objective function using the mini-batch Nesterov's accelerated gradient descent. Backpropagation of the gradient has been performed with an initial learning rate of 0.001 and the momentum of 0.9. The size of a minibatch has been set to 28 with 30 epochs. The outputs from the six instances are combined in a single ensemble by score level fusion of the corresponding softmax layers.
The number of learnable parameters defines the complexity of the CNN model. To calculate the learnable parameters in the convolutional layer, the kernel size is multiplied with the number of kernels and the bias term is added with it. For example, the first layer has 8 filters with kernel size 3 × 3, thus the learnable is 80 ((3 × 3 × 8) +8). Since back propagation is not involved in pooling layer it has no learnable parameters. Batch normalization acts to standardize the mean and variance of each unit and hence adds two trainable parameters for each depth. Fully connected layer has highest number of learnable parameters since every neuron in the previous layer and current layer needs to be connected. The summary of the number of learnable parameters for every layer for CNN A, CNN B and CNN C is tabulated in Table 1.

RESULTS AND DISCUSSION
In this section, robustness of the parallel CNN is evaluated for gender discrimination, age group classification and carried object classification with CASIA B database, OU-ISIR gait database large population dataset with age (OULP-Age)" and OU-ISIR large population gait database with real-life carried object (OU-LP-Bag).

OU-ISIR gait database, large population dataset with age (OULP-Age)
The "OU-ISIR gait database, large population dataset with age (OULP-Age)" [26] was used to assess the performance of the proposed gender discrimination and age estimation methodology. OULP-Age is the collection of 63,846 gait images (31,093 males and 32,753 females) with age and gender information. The dataset comprises of GEIs of size 88 × 128 pixels with  age ranges from 2 to 90 years. Figure 4 shows some sample gait energy Images from OULP-age dataset with varying ages and gender. The age group class for age group estimation by ensemble of CNNs is divided based on the human height. The kids and teens have rapid growing phase compared with adult and thus, finer scale of age interval for the people is allocated below 15 years. The people in early adulthood and midlife shows stabilized height thus grouped under a single category. The people in late adulthood experience a loss of height due to the compression of spine, and softening of muscle and bone tissues and hence form another category.

OU-ISIR large population gait database with real-life carried object (OU-LP-Bag)
The performance of the proposed carried object recognition is assessed using OU-ISIR large population gait database with real-life carried object (OU-LP-Bag) [37]. The OU-LP-Bag dataset is meant for vision-based gait recognition with carried objects (COs) and to estimate the position of the carried object. The dataset includes size-normalized (i.e. 128 × 88) GEIs annotated with seven class labels based upon the carried object with respect to the human body.
Some of the sample human blobs and the GEIs with varying carried state are shown in Figure 5. The categories include: subjects without any carried object (NoCO), subjects with COs in multiple body parts (MuCO), subjects changing the position of CO's within a GEI gait period (CpCO), the subjects with COs at the front (FrCO) and back regions (BaCO), side bottom region (SbCO) and side middle region (SmCO).

Gender discrimination
In CASIA B, database there are 124 subjects including 93 males and 31 females. Since gender classification is a two-class  problem, it is better to use equal number of subjects for different genders to avoid bias. As of the state-of-the-art works, all the 31 females and 31males are used for experimental analysis. Table 3 shows the performance measures like precision, specificity, recall, false positive rate, and accuracy for gender discrimination using CASIA B database. The discriminating features are learned simultaneously by six parallel CNNs with the input size progressively reduced. The proposed method with multiple CNN stages is trained more intensively and efficiently for the gender discrimination task. From the result, it is observed that, the decision level fusion of the scores from the characteristics features trained by CNN A, CNN B and CNN C yields virtuous results. CNN with learnable filters achieves marginal increase in performance compared with customized filters given the flexibility to adapt to the changing scenario.
In Yu et al [11], GEI is parsed into body parts like head, chest and legs and classified the features using SVM reporting 95.97% accuracy. Zhang et al. [38] represented the GEI as Tensor with multi-linear principal component analysis reporting 98.1% accuracy. Do et al. [39] used aggregate GEI with SVM reporting 98.8%. Isaac et al. [40] obtained 100% accuracy with Elliptic Fourier descriptors and linear discriminant analysis with Bayes' rule.
For OU-ISIR gait database, large population dataset with age (OULP-Age), collection of 63,846 gait images (31,093 males  Table 4 displays the comparison of the performance measures of the proposed method with the existing method [17] where, CNN is trained in sequential order for gender, age group, and age estimation that reports 97.74% accuracy.

Gender specific age estimation
This sub section focuses on performance of age group classification after gender discrimination by dividing the age group in to 5 age groups. Six parallel CNNs are trained from the base with varying image resolution and it is observed that specific networks had better favour certain categories. For gender specific age group classification, female class CNN A customized outperforms CNN A learnable stream, whereas, in other cases CNN learnable stream performs better than CNN customized. For male gender, CNN customized learnable stream performs better than CNN learnable. Additionally, the average recognition rate of male gender is larger than that of female gender in gender specific age group classification. This is due to the fact that, female show a large range of appearance variations due to variations in dress type, hair style and foot wear and this is greater than that in males.
Further, while classifying people in age group between (0-5) greater confusion occur between group (0-5) and (6-10) age groups since some of the samples share the characteristics from the adjacent age group. Significant confusion is reported between group (0-5), (6)(7)(8)(9)(10) and (11)(12)(13)(14)(15) age groups while processing for (6-10) age group category. Additionally, confusion between the group (>60) and (16-60) pull down the accuracy of the proposed architecture to some extent. Table 5 exhibits the comparison plot of the proposed architecture with Sakata et al. [17] and Li et al. [24] for gender specific age group classification and it implies that the proposed method outperforms the existing method with a large margin.

Carried object recognition
In this section, classification of carried condition based on the gait feature is analysed. There are copious applications such as intrusion of person with baggage in prohibited area, tracing person with a knapsack etc. In CASIA B database, each 124 person has 2 sequence with carried condition viewed at 11 different orientations. Here, the carried state is classified as a two class problem carrying object or not. The performance measures for carried object detection for CASIA B database is shown in Table 6.
In case of OSU database, for evaluating the performance of the proposed architecture, the number of subjects under each

>60
Yrs Average 0-5 Yrs   label is divided into a training set and test set. Tables 7-12 display the confusion matrix by training 75 % of the samples and remaining 25% of the samples for testing using CNN A, CNN B, CNN C using learnable and customized filters at the initial stage. Table 13 shows the decision level fusion of enactment from all CNNs. From the confusion matrix it is displayed it is observed that, learnable filters and customized filters alternatively give better performance for some classes thus fusion complements each other and gives good overall performance. It is observed that, successive pyramidal decomposition of GEIs fed to parallel CNNs enable discrimination of features at various scales contributing to better performance. As for the classification accuracy of each label, NoCO, FrCO and BaCO worked well because there was no carried object in NoCO, the shape and position of the carried object were stable in BaCO and in FrCO, the appearance change is well grasped and can be easily distinguished from other labels. However, SbCO was considerably confused with NoCO because of the shape similarity with respect to the upper part of the GEIs and with SmCO due the slight variation in the carried position. The classes of SmCO, MuCO and CpCO, the GEI features were unstable, hence sometimes samples of these labels were misclassified as other labels. Because of the occlusion of COs with the subject's body for SmCO, the GEI feature was confused with that of SbCO, NoCO, MuCO and BaCO, depending on the part of the COs that was occluded. Similarly, for the case of MuCO, it was confused with BaCO, subjects typically carried, for example, a backpack in the back region together with a small object in other regions in MuCo. In CpCO, person usually changed the CO's position from one region to another region through the front using the hands. Therefore, the GEI feature of CpCO was slightly confused with that of FrCO and considerably with MuCO and SmCO.  While considering the static shape and dynamic motion of the gait feature SbCO and SmCO, subjects frequently carried small and lightweight COs, which were occluded by the subject's body very often. Thus, SbCO and SmCO are better discriminated by the scaled down input versions that is, CNN C. For the case of BaCO, subjects typically carried a large CO, such as a backpack and the position of the CO was fixed and stable within a gait period. However, the large CO heavily affected the shape and posture, which is better defined by CNN A. Similarly, for MuCO, person carries a big backpack together with other types of carried objects that were carried in other regions of the human body. Here, the position of the carried object in the back region is fixed, other locations were arbitrary. The GEI samples for MuCO were largely affected not only by shape but also by motion which is well defined by CNN B. Regarding FrCO, the subjects typically carried a lightweight object by hand in the front region Therefore, the GEI samples of FrCO were affected slightly by shape and fairly affected by motion. Regarding CpCO, the carried location was arbitrary in any region of the body within a gait period. Due to the indiscriminately changing position of the object from one region to another the GEI samples for CpCO were rigorously affected by the motion feature along with shape features and better discriminated by CNN A. Table 14 shows the comparison of the proposed carried object detector with MSVM [37] and SIAME [37]. MSVM is a benchmark, with a third-degree polynomial kernel for the classification of the carried objects and Siamese is based on CNN network architecture, in which two input GEI features are used to train the two parallel CNN networks with shared parameters for gait recognition.

CONCLUSION
In this paper, gender discrimination, age group classification and carried object recognition is performed by analysing the traits displayed in GEIs. A unified model with six parallel CNNs learning the discriminative features from pyramidal decomposition of the input GEI is suggested. The individual learning by each separate components and decision level score fusion lead to obvious performance enhancement over separate learning. From the conducted comprehensive experiment for gender discrimination, age group classification and carried object recognition, the parallel Network architecture significantly out performs the state-of-the-art methods. In future, more efficient and effective network structures and metric learning strategies can be explored for finely dividing the age group labels and age estimation.