recognition using COTS WiFi device

This letter proposes a fully domain-independent WiFi-based gesture recognition system based on the multi-label adversarial network. Unlike pioneer machine learning works, our system does not require retraining in the target domain, which beneﬁts from our key idea of eliminating fully domain information such as orientations, locations, environment information. The system proposed by us includes three parts: Feature extractor, domain discriminator, and gesture recogniser. The feature ex- tractor attempts to deceive the domain discriminator based on the adversarial network structure so that it is difﬁcult to judge the input do- main label, thereby obtaining domain-independent features and real-ising fully domain-independent gesture recognition. Extensive experi-mentsareconductedontheWidar3.0datasetandourdatasettoevaluatesystemperformance.Theresultsshowthatoursystemcanachieveanin-domainrecognitionaccuracyof93.2%andacross-domainrecog-nitionaccuracyof87.1%,whichissuperiortootherstate-of-the-artworks.

Introduction: Gesture recognition is the current research hotspot in the field of human-computer interaction, which has a wide range of applications such as remote control, virtual vision, and security monitoring. Characterised by contact-free, low-cost and privacy-unintrusive, commercial WiFi is wild used for gesture recognition systems, such as E-eyes [1], CSI-based human activity recognition and monitoring system (CARM) [2], WiGest [3], CrossSense [4]. These approaches generally extract statistical or physical features from wireless signals to classify gestures. However, these features contain domain information such as orientations, locations, and environment information. Recently, the Widar 3.0 [5] proposes a zero-effort cross-domain gesture recognition solution, which is a breakthrough work with excellent performance. Besides, it provides a large-scale cross-domain gesture recognition dataset to promote research on gesture recognition. However, the calculation complexity of the Widar 3.0 is very high. Therefore, this solution is difficult to apply to a real-time system.
To address this challenge, we propose a fully domain-independent gesture recognition system based on the multi-label adversarial network. We aim to remove full domain information such as orientation, locations, and environmental information, so that our system does not require retraining in the target domain. Our work is inspired by EI framework proposed in [6], which is designed for identifying cross-environmental human activities such as walking, standing up, and sitting down. However, the EI framework only works well on removing coarse-grained domain information like environment diversity but not involving other factors. Nevertheless, gesture recognition is aimed to identify complex finegrained gestures that involve multiple body parts. Under the affected environment, location, and orientation of the person, the same gesture will produce different signal results. Therefore, we design a novel multi-label classification model for domain discriminator to further remove multiple domain information. We use softmax as activation of the last layer and binary cross-entropy as a loss function. As a result, domain discriminator can help feature extractors eliminate more domain information and make gesture recogniser better. Besides, a domain-independent feature extractor is the key part of our system, which extracts domainindependent features from the Doppler frequency shift (DFS) profile. Meanwhile, considering that DFS profiles cannot provide rich information like images, deep networks are not suitable for channel state information (CSI) feature extraction. Thus, we design a novel multi-channel shallow network based on convolutional neural network (CNN) for feature extraction. The main contributions of this work can be summarised as follows: 1. We design a novel fully domain-independent gesture recognition system based on the multi-label adversarial network, which can remove the effects of orientation, location, environment without any retraining in the target area. 2. We design a novel multi-channel shallow network based on CNN for CSI feature extraction that can extract domain-independent features from the DFS profile. 3. We conduct extensive experiments on the Widar 3.0 and our datasets to evaluate system performance. The results show that our system can achieve an in-domain recognition accuracy of 93.2% and a crossdomain recognition accuracy of 87.1%. Furthermore, we implement a prototype system, which demonstrates both the effectiveness and real time of our approach.

System design:
Overview: With different locations and orientations of the person and multipath environments, features of the same gesture may vary significantly and fail to serve successful recognition. For example, a person is asked to make the same gesture (push-pull) on four corners of a 2 × 2 m square located at the centre of a room, and the signal features are shown in Figure 1. Besides, a person is asked to make the same gesture (pushpull) at the centre of a room facing different orientations (rear, left, right and front), and the signal features are shown in Figure 2. We can find that location and orientation affect the signal features. Thus, our system is designed to remove multiple domain information.
Our system includes the data preprocessing module and multi-label adversarial learning module as shown in Figure 3. (1) Data preprocessing module: To remove data noise, we use a Butterworth filter and principal component analysis (PCA) for data preprocessing. Then, the short-time Fourier transform (STFT) is utilised to build DFS profiles from CSI data.
(2) Multi-label adversarial learning module: We construct a multi-label adversarial learning framework including feature extractor, domain discriminator, and gesture recogniser to remove all domain information. These three parts are cooperatively trained through the adversarial training method. First, we use a batch of data to train the domain discriminator. Since our final goal is to extract a feature containing no domain information and recognise the gesture, the feature extractor and gesture recogniser are trained to deceive the domain recogniser to make it difficult to judge the domain to which the feature belongs, so that the feature extractor can extract the domain-independent features. To remove multiple domain information, we design a multi-label method in the domain recogniser. Unlike the EI framework, we can remove multi-domain information such as orientation, location, and environment information.
Data preprocessing module: To get DFS profiles, we select two antennas and conjugate multiply their CSI data. Then, the Butterworth filter and PCA method are utilised to filter out the static and high-frequency components, where the filter order is 6 and the cutoff frequency of the filter is 0.25 Hz. Next, we use STFT to transform the CSI data from the time domain to the frequency domain, where the sampling rate is

Multi-label adversarial learning module:
Feature extractor: A DFS profile can be regarded as a channel of an image, and DFS profiles constructed by different transceiver devices can be regarded as different colour channels of an image. Therefore, we can reuse the image feature extraction method based on deep learning to construct a feature extractor. We design a novel multi-channel shallow network based on CNN to overcome overfitting problems in the model, as the DFS profile has less information than image. Specifically, we use the convolution-normalisation-pooling structure twice and construct three consecutive convolutional layers to compose a feature extractor. Through the feature extractor, we can obtain the feature representation Q through different gestures.
Gesture recogniser: A classifier is built for gesture recognition based on Q. Our classifier is a neural network including two fully connected layers to obtain the prediction resultŷ i of gestures. Besides, we add a dropout layer in the middle of the two layers to address the overfitting problem. To train this model, we use categorical cross-entropy as the loss function to compute the loss between the prediction resultŷ i and the true label y i : where |X| is the batch size that refers to the number of training examples utilised in one iteration, and |Y| is the number of gestures.
Domain discriminator: Domain discriminator is the key to crossdomain gesture recognition in our system, and its framework is inspired by the EI framework [6]. EI proposes an environment-independent activity recognition, which can only remove environmental information without considering other domain factors. Therefore, we design a novel domain discriminator for removing all domain information to recognise gestures. Cross-domain gesture recognition is a more difficult problem than the recognition task in the fixed domain. The reason is that the data in the new domain needs to be recognised without additional human intervention to achieve cross-domain recognition. As mentioned above, in a different domain, the CSI or DFS profiles of the same gesture have very different patterns, which makes traditional classification methods difficult to accurately recognise the gesture in a new domain. To achieve cross-domain recognition, the feature extractor and gesture recogniser To achieve this, the domain discriminator connects the output of the feature extractor Q and the output of the gesture recogniserŷ in series by concatenation operation⊕: Then, we input F into three fully connected layers, like gesture recogniser. Different from gesture recogniser, there are many domain factors such as the environment, location, and orientation of the gesture performer. To deal with the multi-label problem of domain factors, we use sigmoid as the activation function of the last layer to map the output of neurons to a value between 0 and 1 with the sum of them unequal to 1. Then we use the binary cross-entropy loss function to optimise the domain discriminator: where |D| is the number of domain categories and s i is the domain probability vector of the ith input data. The training process of domain recogniser is to minimise the value of Loss ℘ and maximise the accuracy of the prediction result. However, it conflicts with our final purpose, that is, to extract a domain-independent feature and to improve the performance of gesture recognition function in a new domain. As a result, we design the loss function of our whole adversarial network as below: where γ is the weight parameter, which can try its best to maximise the value Loss ℘ and to minimise the value of Loss ℘ , finally making feature extractor fool domain discriminator to get the domain-independent feature and improving the performance of gesture recogniser.

Experiments:
Experiment setting: We utilise the Widar 3.0 dataset [5] to evaluate our system performance. The Widar 3.0 dataset consists of the CSI, DFS and BVP data of 260,000 gestures under 75 different domains, and it was collected in five locations and five orientations by six receivers. In addition, our prototype system consists of one Wi-Fi transceiver pair as shown in Figure 4. The transmitter is a mini-PC equipped with a cheap Intel 5300 Wi-Fi card working on the 5.24 GHz band, and the receiver is equipped with three antennas. The transmission rate is 500 packets per second. We also build our gesture dataset, named crossdomain dataset, by collecting different gesture data at five positions and four directions in two scenes (living room and bedroom), wherein the total is 6000 gesture data.
Overall performance: In this section, we conduct experiments to evaluate our system's overall performance with no cross-domain test. We divide the training set and testing set randomly and the ratio is 9:1. On the Widar 3.0 dataset and the crossdomain dataset, the average accuracy of our system can reach 93.2% and 90.3%, respectively. The accuracy of the crossdomain dataset is slightly lower than that of the Widar 3.0 dataset because our dataset is collected using only a pair of transceiver devices, and the Widar 3.0 dataset uses six transceiver devices. Figures 5  and 6 are the accuracy confusion matrix of each gesture on the Widar 3.0 and crossdomain datasets, respectively. Overall, our system can perform very well on different gesture datasets.

Cross-domain evaluation:
In this section, we evaluate the cross-domain performance of our method. First, we conduct leave-one-out crossvalidation on the Widar 3.0 dataset with six DFS profiles of each gesture to evaluate the performance of our method under different kinds of domain factors. As shown in Figure 7, among the average accuracy of the six gestures, 'drawing N' gesture achieves the best performance of 94.31% when performed at a new location. We can also find that the overall cross-location accuracy can reach 89.7%. For the other domain factors, our method can achieve the average cross-orientation accuracy of 81.3%. We also train our model with crossdomain dataset and test its cross-domain performance using the data of new room, location and orientation, which are all different from the training set, and our model can achieve the average accuracy of 87.1%.

Method comparison:
In this section, we compare with other state-ofthe-art wireless sensing works (such as Widar 3.0 [5], CARM [2], EI [6] and CrossSense [4]) in performances of both in-domain and crossdomain recognitions on the Widar 3.0 dataset. CARM utilises the DFS profile as a feature and hidden Markov model as a classifier. EI uses adversarial networks, but it does not consider multiple domain labels but only removes environmental information. CrossSense proposes an artificial neural network-based model to achieve the signal feature translation  Figure 8, and we can find that our work achieves the best results in overall accuracy. The accuracy of the Widar 3.0 is relatively close to our work, but the time to calculate the speed spectrum requires a relatively long calculation time even under the condition of parallel calculation. It takes about 150 s to calculate the BVP from the Doppler spectrum. Our system's running speed is about 45 times that of the Widar 3.0, so our system may have better real-time performance.
Conclusion: This letter proposes a fully domain-independent WiFibased gesture recognition system based on the multi-label adversarial network. This system can remove fully domain information such as orientations, locations, environment information, and so on. We conduct an extensive experiment on the Widar 3.0 dataset and a designed prototype system. The results show that our system has a better performance than the current state-of-the-art gesture recognition system using the COTS WiFi device.