Restore DeepFakes frames via identifying individual motion styles

Recent advance on highly realistic AI-synthesised video approaches makes it hard to distinguish whether a speaker in video is real. Many existing DeepFakes detection approaches can be invalidated by using them as supervision in adversarial training, which in turn improves the performance of DeepFakes. This makes the problem a dilemma. How- ever, human can recognise the identity of familiar persons by observing their motion styles. Inspired by this, the paper proposes a novel method to recognise original speaker’s identity clues in DeepFakes videos by learning individual motion styles. As a biological signature, motion styles can neither be used in the training process of DeepFakes nor be modiﬁed to deceive detection methods without reducing realistic per- formance. Also, this paper proposes a novel pipeline to continuously restore the original frames from DeepFakes videos without knowing the DeepFakes approach in advance. Based on above ideas, our scheme makes it possible to restore DeepFakes videos. Through training the cross-modal transfer module, the appearance embedding can be inferred from the identity code. Extensive experiments are conducted to reveal the effectiveness of the method, it shows for the ﬁrst time that the original persons can be identiﬁed automatically and the original video from DeepFakes videos can be generated.

✉ Email: zheminglu@zju.edu.cn Recent advance on highly realistic AI-synthesised video approaches makes it hard to distinguish whether a speaker in video is real. Many existing DeepFakes detection approaches can be invalidated by using them as supervision in adversarial training, which in turn improves the performance of DeepFakes. This makes the problem a dilemma. However, human can recognise the identity of familiar persons by observing their motion styles. Inspired by this, the paper proposes a novel method to recognise original speaker's identity clues in DeepFakes videos by learning individual motion styles. As a biological signature, motion styles can neither be used in the training process of DeepFakes nor be modified to deceive detection methods without reducing realistic performance. Also, this paper proposes a novel pipeline to continuously restore the original frames from DeepFakes videos without knowing the DeepFakes approach in advance. Based on above ideas, our scheme makes it possible to restore DeepFakes videos. Through training the cross-modal transfer module, the appearance embedding can be inferred from the identity code. Extensive experiments are conducted to reveal the effectiveness of the method, it shows for the first time that the original persons can be identified automatically and the original video from DeepFakes videos can be generated.
Introduction: Recent advances in generative adversarial networks have made it significantly easier to create AI-synthesised media (popularly referred to as DeepFakes). The competition between DeepFakes methods and their adversarial methods will be keen for a long period, leaving the social media with periodic floods of fake news and slanders. However, current DeepFakes detection methods focus on finding artificial clues in specific forensics videos, with less attention paid to biological information. In other words, in many occasions, they limit themselves to be only effective on existing methods with defects analysis performed, instead of finding a common solution to unseen DeepFakes methods.
Motivated by this, this paper models the biological motion clues in order to provide reliable identity verification results. Thus, our scheme can not only provide a strong and consistent clue for DeepFakes detection problems but also, for the first time, make it possible to restore original media from DeepFakes. The illustration of this task is shown in Figure 1. Given a video modified by DeepFakes, the goal is to identify the original speaker, then infer the face appearance of the original speaker according to the identity code extracted by a specific model, and finally generate the original video. Specifically, as the appearance information in the input DeepFakes video cannot be trusted, all appearance information and other properties (such as poses) can be inferred from previous training results.

Fig. 1 A simple illustration of our approach
Our contributions can be listed as follows: (I) We propose two new tasks -DeepFakes identification and DeepFakes video restoration; (II) we design a spatial convolutional neural network (CNN) to identify speakers with only motion clues provided in DeepFakes identification, which overcomes the issue of previous methods that they are unsuitable for unseen DeepFakes methods; (III) for the first time, we restore the original appearance from DeepFakes, and design a novel transfer net.
Comparison to related works: Some existing studies on DeepFakes detection have been released in the past two years [1][2][3][4]. However, most of those approaches are based on manually detecting defects of current DeepFakes methods (or detecting defects by neural networks (NN)). These approaches are very popular but have several issues: (1) Finding defects and designing a special detection method are labour-consuming and have time delay. Researchers need to repeat this process for each DeepFakes method, while new ones continue to come out. (2) They can only perform defects detection on existing methods. Neither human nor NN can find defects on unseen methods including those that have not come out yet and those are not publicly available. (3) Even for existing methods, they may be easily improved once the corresponding defects were reported. (4) These defect-based approaches cannot identify the original speaker because they focus on the defects of each method. On the other hand, Agarwal et al. [5] adopted a two-class Support Vector Machine (SVM) to prove the feasibility that facial expressions and movements can represent an individual's speaking pattern because it is hard to mimic by actors. But they failed to find a well-designed pipeline and architecture to automatically model and identify multiple persons. Therefore, we propose using NN to model spatial motions and identify speakers in DeepFakes videos. We also make our approach hard to be attacked because modifying the motion styles in the fake video without losing realism is difficult. Furthermore, we utilise the identity clues to perform manipulated video restoration successfully for the first time.
Problem definition and our pipeline: Given a DeepFakes video, our goal is to identify the original speaker, and infer the original face appearance to simultaneously restore the original video, while keeping emotion, pose, gesture and other attributes unchanged. To tackle this challenging problem, we propose a novel pipeline and factorise it into three novel stages as shown in Figure 2. During the training stage, the pose sequence is firstly extracted from the input ground truth video. We then train the motion encoder to learn individual motion styles from the pose sequence. Next, we use the identity-to-appearance transfer network to bridge the appearance embedding and motion embedding. At last, we alternatively feed appearance or TransNet into the generator. During testing, we directly input the video which was manipulated by unknown DeepFakes methods. Specifically, since we only input pose sequences and pose images to synthesise the restored frame, we do not use the appearance encoder during the testing stage.
Stage I: We train an appearance encoder E a and a motion style based DeepFakes identification detector E id , based on the original video sequences x s and key-point sequences p s , respectively. Stage II: We train a pair of cross-modal transfer nets, respectively, denoted as T β→α and T α→β , to infer the appearance embedding from the identity code extracted by E id . Thus, features from two modalities are projected on the latent space under supervision.  Stage III: Given a DeepFakes video x s , we use E id to get the identity code β and transfer it into the predicted appearance embedding α with T β→α . Then, we use E a to infer a most closed appearanceα from the real video database. Finally, we disentangle the appearanceα and p n to generate a restored frame. During training, we alternately put α andα into the generator through the AdaIN layer [6]. During testing, we use E id and T β→α to get the identity code β and the inferred appearanceα.
Method: In Stage I, a novel motion style based DeepFakes identity detector is proposed. We utilise a CNN-based network to learn the overall motion styles, including gesture motion style, facial muscle motion style and subconscious head movement. As a body language, people use gesture, emotion, even head movement to express themselves. In many situations, their body language is subconscious, which shows in different ways on different persons based on their individual habits. Even different people have different rest poses according to Ginosar et al. [7].
DeepFakes is a kind of face-swap or puppet-master process. In many kinds of DeepFakes, face landmarks and other key points should be consistent with the original videos to increase the realism of results. So, the motion style is naturally suitable for DeepFakes identification problem. But identifying a person from skeleton sequences without appearance information is difficult, so we need to design a new architecture to learn motion-style patterns. The architecture of DeepFakes identity detector is shown in Figure 3. Firstly, we use the OpenPose [8] to estimate the key-point sequence p s from the input sequence x s . Then, we transform p s into a key-point image I p ∈ R f×k×c , where f and k are the dimensions of frames and key points of p s , respectively, and the dimension c consists of x-axis, y-axis and the confidence value of each key point. Then, we get a motion representation I m by temporal differencing on the frame dimension of I p . We put I p and I m into a two-stream CNN-based network to extract the motion style embedding γ from the whole frame sequence. Here γ can be calculated as where E ms denotes the convolution operation of the two stream CNN layers. We then use a full-connect net with a softmax function, which is after E ms to classify the identification embedding. Specifically, we denote part of full-connect layers and convolutional layers as the identity encoder E id . The calculation of the identity embedding β can be simplified to: β = E id (p s ). Here E id refers to the identity encoder.
According to the appearance encoder, our generator starts training with the appearance encoder E a to get initial parameters. With the input K frames f n (1 ≤ n ≤ K) of x s , the appearance α can be formulated as α = E a (x s ). Here E a refers to the appearance encoder.
We need to design a network named cross-modal transfer net (TransNet) to transfer the identity embedding β into the appearance α because the appearance information from DeepFakes videos is not trustworthy. Thus, we propose a novel cross-modal transfer net to transfer the motion-style identity information to the appearance code and vice versa. We point out that both appearance and motion-style code share the same identity information. We argue that there is a mapping between α and β, so we transfer them to each other through a latent identity variable.
As shown in Figure 2, we use a dual encoder and a decoder to simulate this mapping. The mapping of our transfer network can be defined asβ = Tα →β (E a (x s )),α = T β→α (E id (p s )), where the subscript of T is the transfer direction, andα refers to the referred appearance embedding. The objective function of our TransNet can be formulated as L Trans = λ 1 L β→α + λ 2 L α→α + λ 3 L α→β . Here, λ 1 , λ 2 and λ 3 are hyper-parameters. And the feature projection losses L β→α , L α→β and the appearance reconstruction loss L α→α are calculated as Here L l2 is the L 2 loss function. As shown in Figure 2, the generator continuously synthesises the original frame with the pose, by taking an image which is drawn according to the key points estimated by the OpenPose. The AdaIN [6] layer aligns the content information of the appearance code α orα, and transfers it into the auto-encoder as adaptive parameters. Meanwhile, to train a generator capable of generating the frame from motion styles directly, the appearance code is alternatively extracted by the appearance encoder or the motion style based identity encoder with TransNet. We use two discriminators to provide the adversarial loss and supervise the consistence of poses between the synthesised frame and the image with key points.
During the alternative training process, our generator takes the drawn image with key points of a single frame to get the pose information, α andα through AdaIN layers. Our generator can be formulated aŝ Heref is the synthesised image and p n is the pose of the single frame f.
During the training process of the generator, we disentangle the appearance embedding α and the image with key points p s to synthesise the video frames with appearance of the original identity, in which the appearance is alternatively taken from E a and E id through T. So, our alternative adversarial training loss L al is calculated as where L is formulated as where λ adv and λ face are hyper-parameters, and L adv and L face refer to the adversarial loss and the facial perceptual loss, respectively. When our generator G takes the appearance embedding from α, the adversarial loss L adv for G is calculated as L adv (p n , x s ) = −E xs∼pdata (xs ) (log D (G (p n , x s ))) .
Here D is our discriminator, and when the appearance embedding is taken fromα, p data (x s ) is the distribution of x s .
Following [9], we add a facial perceptual loss with a pre-trained VGG-19 face recognition network to calculate the facial appearance distance between the synthesised frame and the ground-truth, while keeping the identity consistent with the appearance code α.
Experimental results and discussions: In our simulation, we use the speaker-specific gesture dataset [7] which is the only large personspecific 'in-the-wild' speech video dataset with full gestures, faces and upper body motions. 50% of the dataset is adopted as the training set, 20% for validation and 30% for testing. For the DeepFakes detection task, we use a pre-trained face recognition model to calculate the facial identity distance between the person in input DeepFakes videos and each identity in the database. If the ground truth identity is different from our results, we think the input video was modified by DeepFakes methods. Because our model identifies the speaker in DeepFakes videos for the first time, we propose using the Top-1 accuracy to measure the identification ability. Table 1 shows DeepFakes detection and identification performance. The 'Not-working' in the table means this baseline cannot perform on

Fig. 4 Qualitative results
this task. Our motion-based style identification model gets higher performance, and it is the only model capable of DeepFakes identification in above baselines. FF++ [4] and FWA [10] are current existing methods based on detecting the trace left by DeepFakes [11]. They cannot identify the original speakers in DeepFakes videos. Our method outperforms the baselines in the DeepFakes detection task, and it is the only method capable of DeepFakes identification. Figure 4 shows the qualitative generation results. This experiment is implemented in an unequal condition that our model only takes the input DeepFakes video, while other baselines have extra labels or manual help. The face part in Figure 4(c) is synthesised by Faceswap with manually selected pre-trained weights according to identity labels each time. The inferred information from previous training makes our results have higher degree of restoration. Our model generates the original frames with high degree of restoration in an unequal condition. As there is no existing work exactly comparable with our model in the DeepFakes restoration task, we implement three baselines with extra labels or manual operations to make them comparable with our model. During testing, our model only takes the pose of input DeepFakes videos, which means there is no DeepFakes video in our training dataset. This proves that our model is capable of accurately and automatically restoring the original video frames from DeepFakes videos. It also indicates that our model has the ability to identify the speaker in DeepFakes videos processed by unseen DeepFakes methods and infer appearance from previous training to restore the original frame. Table 2 shows the quantitative analysis of our generation task. Our method gets higher performance compared to the StarGAN [12] baseline in an unequal condition. Our model gets a better score in structural similarity and Fréchet inception distance (), even in a condition where all appearance and RGB information are inferred by our motion style based encoder. We do not use the DeepFake baseline because it only generates little part of the image. It indicates that our TransNet and generator proposed in this work enable a robust learning on generation of restored DeepFakes frames. Conclusion: Methods capable of identifying the person from DeepFakes videos are urgently needed. Thus, restoring the appearance of original speakers from DeepFakes videos is an interesting and difficult topic which can be a supplement to current forensic technology. Our approach overcomes the shortcoming of previous DeepFakes detection works that they need to know which kind of DeepFakes scheme is used to produce DeepFakes videos. Also, we utilise the link between motion styles and identities to restore the original appearance from pose sequences in case of fake information polluting the restored results.
We present a pipeline to model the movement styles of speakers and identify the identities in DeepFakes videos based on their movement styles. Also, we infer the appearance embedding of the original speaker from the identity embedding to restore the speaker in the original video. We make experiments on the most popular appearance changed Deep-Fakes, and it shows that our method can automatically synthesise the original video based on inferred appearance and accurate identification.