Unknown Presentation Attack Detection against Rational Attackers

Despite the impressive progress in the field of presentation attack detection and multimedia forensics over the last decade, these systems are still vulnerable to attacks in real-life settings. Some of the challenges for existing solutions are the detection of unknown attacks, the ability to perform in adversarial settings, few-shot learning, and explainability. In this study, these limitations are approached by reliance on a game-theoretic view for modeling the interactions between the attacker and the detector. Consequently, a new optimization criterion is proposed and a set of requirements are defined for improving the performance of these systems in real-life settings. Furthermore, a novel detection technique is proposed using generator-based feature sets that are not biased towards any specific attack species. To further optimize the performance on known attacks, a new loss function coined categorical margin maximization loss (C-marmax) is proposed which gradually improves the performance against the most powerful attack. The proposed approach provides a more balanced performance across known and unknown attacks and achieves state-of-the-art performance in known and unknown attack detection cases against rational attackers. Lastly, the few-shot learning potential of the proposed approach is studied as well as its ability to provide pixel-level explainability.

Consequently, biometric and forensic systems face new challenges every day as they have to become secure against a wider range of attacks happening at a higher frequency. Making the matters worse, the existing detection solutions are often designed against a specific attack (or set of attacks) in controlled environments and lack the capacity to face the challenges of real-life deployment. This is evident from the results of the recent Deepfake detection challenge 3 organized by Facebook where the best performing algorithm had a detection rate of only 65% when faced with unknown generation techniques. As such, addressing vulnerabilities of existing solutions and introduction of methods to mitigate these vulnerabilities is of utmost importance for deployment of these systems in practice.
One aspect of challenges for deployment that is rarely studied is the selection process of a rational attacker. It is expected for an attacker with an ever-growing menu of options for attacking to behave rationally and choose the most powerful attack available to him to maximize the chance of infiltration. Furthermore, as the defender does not have knowledge or access to massive amounts of data for all possible attacks available to attackers, his detector would probably be tasked with the detection of unknown attacks or attacks from which only a few training examples are available. Additionally, lack of explainability limits the use of a system in high-stake applications where explainability increases its utility when operated by a human supervisor.
In this article, to address these challenges, a game-theoretic approach is considered for the formulation of the interactions between the attacker and the detector. Resulting from this, an optimization criterion is formulated and a set of requirements are defined for designing the detector accordingly. To tackle the problem of unknown attack detection and few-shot learning, the use of unbiased compressed feature sets is proposed, and for targeting the optimal performance, a new loss function is defined faithful to the formulated optimization criteria. Finally, the explainability of the proposed method is demonstrated with a few examples. The rest of this article is organized as follows: In section 2, the related literature is reviewed and a theoretic basis for the proposed approach is established in section 3. Afterward, the proposed method is introduced in section 4 and the case study experiment setup is explained in section 5. Finally, the results of the experiments are reported and analyzed in section 6 and 7 and the article is concluded in section 8.

Literature Review
Considering the task of forgery detection or presentation attack detection on the face modality, there exist three relevant threads of research. First is the field of multimedia forensics, and more specifically, anti-counter forensics (CF). This thread of research takes an adversarial view on the problem and tries to optimize the performance of the detection system facing an adversary who is actively working towards undermining the performance of the detector. Second is the field of presentation attack detection (PAD) in which the objective of the detector is to secure a biometric system against attacks from different presentation attack species (PAS). Lastly, the newly established thread of Deepfake detection is considered that was initiated to address new phenomenon of availability of automated open-source photo-realistic digital video manipulation techniques on the internet. In this article, the terminology proposed for the field of presentation attack detection is relied on. Consequently, the act of forgery is called attacking the detector and generation techniques used by the forger are called attack species.

Anti Counter Forensics
The majority of solutions in the literature are designed neglecting the fact that an attacker works actively to undermine the performance of the detection system [1]. To address the vulnerability to CF attacks, many anti-CF techniques have been developed, with a focus on detecting the traces left by CF techniques. Anti-forensic techniques often target a specific CF technique, and as a result, an obvious problem occurs when the attacker anticipates the use of the anti-CF technique and adjust accordingly. In turn, the defender would need to resort to the introduction of a new detection system to detect the anti-CF attacks, resulting in a never-ending iterative loop with unforeseeable outcomes [2]. A possible solution to this problem is to design techniques that are intrinsically more resistant to CF attempts [3] [4]. For example, in [5] the authors proposed the use of second-order statistics derived from co-occurrence matrices and show robustness against CF attacks. Zhang et al. [6] used a reduced feature set based on assumptions on the attacker's data manipulation strategy. A combination of one-class and two-class classifiers is proposed in [7]. Another interesting approach is the randomization of the feature selection process [8]. In [9] and [10], the authors propose the reuse of the original feature space for the detection of CF attacks by retraining for the task of double JPEG compression detection. The third group of solutions rely on game theory to model the interactions between the detector and the attacker and improve the performance of the detector at the final equilibrium [11] [12] [13]. All aforementioned methods address the case where attacker has a limited choice of CF attacks and do not consider selection process of attacks in optimization of the detector.

Presentation Attack Detection
Similar to anti-CF techniques, the existing PAD research can be categorized into three branches: (1) PAD systems that address specific PASs, (2) PAD systems that increase or optimize the feature set to detect a higher variety of attacks, and finally, (3) PAD systems that rely on game theory to model the interactions between attacker and defender and optimize the PAD performance accordingly.
The early PAD methods addressed PAD for specific PAS, examples of which are methods relying on features such as blinking, head movement, and textures [14,15,16]. Different Features have been used [17], [18], [19], [20], such as 2D Fourier spectrum [21], local binary patterns (LBP) [22]. Authors in [23], [24] and [25] presented central difference convolutional networks, layer-by-layer progressive compact space generation, and style transfer techniques, respectively. Many PAD methods rely on an augmented feature set using additional hardware. Examples include 3D depth camera [26], multi-spectral camera [27], and microphones [28]. However, these techniques require the addition of often expensive hardware to the pipeline, which may not be feasible in all applications. A few studies tried to use generalizable feature sets for PAD. In [21], the authors propose the use of image distortion analysis. The use of 25 general image quality features for PAD is investigated in [29]. In [30], a regression function is learned to map the image quality assessment scores. The use of pixel-level supervision for improving features is investigated in [31] and regional self-supervision in [32]. A limited number of studies tried to address the generalizability of PAD systems [33] [34] using a one-class classification approach [35] [36], deep metric learning model [37], and zero-shot [38]. To the best of our knowledge, no game-theoretic approach is proposed to model interactions between attacker and defender.

DeepFakes Detection
Several approaches have been proposed for detecting DeepFakes, such as lack of asymmetry in computer-generated imagery [39], spatio-temporal deformations of a 3D face model [40], use of periodic blood flow [41], generation flaws [42], blinking [43], and blood flow [44], face warping artifacts [45], use of face landmark locations [46], head pose consistencies [47], mesoscopic features [48], architecture-specific GAN fingerprints [49] [50] [51], convolutional neural networks (CNNs) [52], attention mechanism [53] and capsule networks [54], long short-term memory (LSTM) networks [55], recurrent CNNs [56], and optical fields [57]. However, such detectors tend to overfit to the known attacks and show limited generalizability [58]. The problem of generalization has been studied in a few articles using, e.g., auto-encoder in [59] and [60], incremental learning in [61], pre-processing artifacts in [62], transferability of the network in [63], time dimension with attention mechanism in [64]. Other works are [65], [66], [67] and [68]. Most studies have a heavy focus on DNNs/GAN generated artifacts and do not consider other types of manipulations. Also, none of the aforementioned studies take into account the rationality of the attacker nor the case in which the attacker has multiple choices of attack species. A summary of the most representative works in anti counter forensics, presentation attack detection and deepfakes detection is presented in Table 1.

Theory
In this section, I introduce the definition of a rational attacker and formulate such an attacker's pay-off equation and decision-making process. Furthermore, I discuss the detection strategy facing such an attacker and define the requirements for a PAD system accordingly. Lastly, I justify the use of one-class detection techniques based on generative models for unknown attack detection.

Rational Attacker
In most existing literature the selection process of the attackers for which attack species to use is neglected and assumed to be that of random selection, resulting in the proposed detectors having fundamental weaknesses. A rational attacker is defined as an attacker who, knowing the pay-offs to his possible choices, selects the one with the highest pay-off. From a game-theoretic perspective, the interactions between an attacker x and the defender can be modeled by a sequential asymmetric game in which the defender chooses a detector after which the attacker administers their attack of choice. An attacker would have to choose among a set of attack species A x which represents all his options. The pay-off u i for the attacker for an attack a i ∈ A x can be formulated as: where r > 0 is the reward for a successful attack, p i is the probability of detection (detection rate) for the attack species a i , c f > 0 is the cost of failure for the attacker, and c i > 0 is the cost of the attack. To account for the budget of the 3 A PREPRINT -JULY 6, 2021 The attacker can, with the help of trial and error as well as consultation from the experience of other attackers, have an accurate estimate of p i for a i ∈ A x . The attacker's goal is to choose an attack species that maximizes the pay-off function if the highest pay-off is higher than the pay-off of not attacking the system. As r + c f is constant for every individual attacker, the optimization corresponds to the selection of an attack species with the lowest weighted sum depending on p i and c i . In practice, it is fruitful for the defender to take c i into account, and low-cost attack species are expected to occur more frequently than the high-cost ones. However, because measuring c i for individual attack species falls outside the scope of this study, I assume the worst-case scenario in which the cost of all possible attack species are assumed zero, enabling all attackers to use more effective attacks regardless of the cost of the attack, as long as their budget allows the attack to be included in A x . Consequently, the pay-off formula boils down to u i ∼ = −p i , and the choice of the attacker would be the attack with the lowest p i , referred to as the most powerful attack (MPA). The values for p i s depends solely on the choice of the detector by the defender.

Multiple Attackers
A detection system faces not only one attacker but different attackers with different sets of A x . Gathering statistics about the availability of attack species to the attackers would provide further knowledge about the probability of observing a specific MPA during the detection scenario. However, as such statistics are often not available for individual attackers, a conservative approach would be to construct a union set of all possible attack species for groups of attackers A X k and assume all attack species in A X k are available to all attackers from category k. By doing so, the PAD scenario is further simplified as the distinction between individual attackers collapses and all attackers in each category become identical.
For example, using the budget as a categorizing factor, the attackers can be categorized to low-budget and highbudget and the attack set for low-budget attackers A X l and high-budget attackers A X h can be constructed. Next, using the probability of an attacker belonging to each category p(X k ) and the performance of the detector D on the MPA from that category perf (A X k |D), the expected overall performance of the system can be estimated as Other examples of categorizing factors are expertise, time-budget, and access to unknown attacks or anti-forensic attacks. As the categorization of the attackers and calculation of the probability of attackers belonging to each category falls outside the scope of this study, I assume a single category A X for all attackers. From here on, I use the term attacker to refer to the hypothetical attacker that can administer all attacks in A X .

Detection Strategy
For deciding the best detection strategy, the accurate estimate of detection rate for individual attack species by the attacker can be interpreted as equivalent to having full knowledge over the detection performance over all a i ∈ A X . Due to the sequential nature of the game, the defender needs to choose p i s for individual attack species before the attacker decides which attack to choose. Subsequently, the rational attacker will choose the MPA which has the lowest detection rate depending on the defender's choice of detector.
Let us assume the set A denotes all possible attack species. In A, two attack species are considered different if they have different manufacturing/generation process, including generation parameters such as manufacturer expertise, quality, and obfuscation. From the perspective of an attack detection system, an attack species can be categorized into one of three subsets: (1) Known attack species (A k ) to which detector is exposed in training process and its performance optimized, (2) Unknown attack species (A u ) to which detector is not exposed to and its performance is unknown, and (3) Anti-forensic attack species (A a ) signifying the attack species that are designed with knowledge over the weaknesses of the detector in mind and render the detector useless. These three subsets cover the whole set A. It is important to mention that these subsets can be expanded as new attacks are invented (become possible) and added to A.
To the extent of the knowledge available to the defender, A k constitutes the set of all possible attack species, all while the attacker may be able to administer attacks falling outside A k . The defender can know the detection rate for attack species in A k and optimize them accordingly, however, he cannot know the detection rate for attack species in A u . The best the defender can do in this case is to make an educated guess of what the minimum detection rate can be for attack species in A u . To achieve this, every individual attack species in A k can be left out as an imaginary unknown attack species during training, and the minimum detection rate across all leave-one-out (LOO) trials can be used as a rough estimate of the detection rate across MPA in A u .
The pay-off for the defender can be formulated as where c d is a constant cost of detection, c m is the constant cost of missed detection, and p i is the probability of detection of attack a i which matches the definition of p i for the attacker. Knowing that the attacker will choose MPA, i.e. the attack species with the lowest p i , the defender's best strategy would be to maximize the minimum p i across both A k and A u to maximize v i . There is a further objective of reducing the detection cost such that c d is not prohibitively large, i.e. c d << c m (1 − p i ). The defender needs to choose to maximize p i either for a i ∈ A k or a i ∈ A u , while limiting c d according to the application dependant c m . As mentioned in Section 3.2, it is also possible to categorize the attackers to the ones with access to attack species from A u and the ones without, and define an objective function that takes into account the minimum detection rate over both A k and A u . Yet, as the defender does not possess any knowledge over A u , it logically follows that he does not have any knowledge about the probability of the attackers being able to use attacks that belong to A u either, and would need to resort to an educated guess of the probability instead. In this study, I try to maximize the detection rate for MPA from A k and A u independently, corresponding to the cases where A X ⊂ A k and ∃a i ∈ A X , a i ∈ A u respectively, and propose a fusion scheme that can be used to combine the resulting detectors without a significant loss of performance in either case.

Requirements
Following the aforementioned explanations, it is evident that the common approach towards improving the average detection performance across known attacks is not viable when the detectors are deployed and face rational attackers. Consequently, a more sophisticated approach is needed to be taken based on these analyses where the performance of a system is optimized considering the MPAs, unknown attacks, and adversarial attackers. To this end, the following set of requirements can be defined as guidance for the development of a robust detection system: • It should have an optimal minimum detection rate across known attack species.
• It should have an acceptable minimum expected detection rate across unknown attack species.
• It should be able to learn to detect an unknown attack species optimally once it becomes known by a few examples.
• The cost of detection should not outweigh the cost of miss-detection.
• It should be robust against adversarial attacks.
The first two requirements can be directly justified according to the formulation of the problem provided in Sections 3.3. The third requirement follows directly from the first two for the case when an unknown attack species becomes known. In this case, the newly known attack species qualifies for a known attack species and should follow the first requirement, even though there might exist only a limited number of available examples from it. Consequently, the detector should be able to learn to increase the detection rate of the previously unknown attack species to match that of known ones.
There are certain solutions in the literature that attempt to address the last requirement [8], however, to the best of our knowledge, there exists no method to prove the robustness mathematically, and empirical proofs would be limited to the specific anti-CF attacks that are considered. Consequently, for a detector to achieve robustness against adversarial attacks, it needs to survive the test of time. As such, fulfilling this requirement falls outside the scope of this study.

Generation-based Feature Sets
It is common practice to rely on discriminative models for the detection of attacks. However, the objective of a discriminative model requires it to focus on the discriminative features between bona fide (BF) and known attack species. Consequently, these models do not learn discriminative features that are not directly useful for the detection of the presented known attacks. As such, these models often fail to infer information on unknown attacks where the discriminative feature set is different from the learned ones. In contrast, the objective of a generative model trained on BF data requires it to model all variability in the BF data to the capacity of the model, and because of this, does not over-represent some features while under-representing the others. Using feature sets extracted by a generative model, a detector is expected to be more robust to unknown attack species as it has access to more informative feature sets [69], only limited by the capacity of the generator in learning the feature set corresponding to BF data [70]. Namely, GANs have shown to be more effective for open-set recognition [71]. Hence, generative models can be used for anomaly extraction more effectively in unknown attack detection scenarios. Even though the features extracted using the generative model are not optimized for detection and might not outperform the discriminative features used by a discriminative model on known attacks, it can be demonstrated that they would generalize better on unknown attack species as they have no bias regarding what the attack should look like [71][70] [69].

Minimax Objective Function
Considering the known attack detection scenario, another limitation of most existing discriminative detectors is the reliance on the average loss for optimizing the parameters. However, as argued in Section 3.3, the performance of a detector against a rational attacker is not determined by the average detection rate, but the detection rate on the MPA. Accordingly, optimizing the average detection rate does not necessarily translate to the optimization of the detection rate against the MPA all while posing challenges for the detection of the under-represented attack species. In response to this limitation, objective functions that rely on minimizing the maximum loss (or maximizing the minimum gain) are proposed as a reliable alternative, for which the GAN loss [72] is a famous example.

Proposed Method
According to the requirements defined in Section 3.4, two separate detection methods are proposed for both scenarios of known and unknown attack detection. Furthermore, a fusion mechanism is introduced to combine the decision of the two detectors for a unified solution with few-shot learning capabilities. Both proposed methods rely on pixellevel generator-based anomaly features and its compact representation extracted to achieve better performance across unknown attack species. For the purpose of known attack detection, a new loss function is introduced which follows the defined objective of maximizing the minimum detection rate. For the purpose of unknown attack detection, I construct a generator-based one-class detector that relies on attack-unspecific anomaly-sensitive information extracted from the detection pipeline.

Pixel-Level Probability Distribution Modelling
A distribution model for BF images can provide an ideal model for presentation attack detection, as it would be a generative model that contains the complete feature-set and can also provide a single detection score in the form of the likelihood of an observation to the BF distribution. However, due to the complexity of the distribution of BF images, the large amounts of data needed to train such distribution properly, and finally the curse of dimensionality, it is deemed impractical. However, by breaking down the problem into modeling segments of an image rather than the whole image, there exist practical solutions.
PixelRNN [73] is a generative model that models the pixel intensity value probability distribution conditioned on previous pixel values in raster order. This approach can be used to calculate log-likelihood values for observing individual pixels in an image, and once these values are aggregated, they can be used to estimate the log-likelihood of observing the input image as a whole. The pixel-level log-likelihood values can further be used for the localization of low-likelihood pixels (anomalies) in the input. In the proposed approach, the aggregated log-likelihood value is used as the first anomaly measurement for the one-class classifier, and a dimensionality reduction scheme is proposed for simplification of the description of the localization information for extracting the second anomaly measurement which are also used for training the proposed discriminative detector for the known attack detection (Fig. 1).

Dimensionality Reduction
The pixel-level log-likelihood values provide valuable information about the severity of the anomalies at each location in the image. However, dealing with features the same size as the input video proves challenging, especially when the amount of training data is limited. To tackle this problem, the following dimensionality reduction scheme is proposed: As the location of anomalies in expected to remain roughly constant in a video, one can average the pixel-level log-likelihood values across the cropped face frames across the whole input video. This step will serve two purposes, firstly it collapses the data in the time dimension, and secondly, it reduces the noise in the frame-level representations. Next, I use a principal component analysis (PCA) transformation learned on BF data to reduce the dimensionality further ( Fig. 1).
PCA transformation extracts the directions where the variability of the BF data is most explained. It can also be used to extract the directions in which the input data shows little variability. The components for which the BF data shows little variability fits well with the definition of anomaly features, and they are a good representation of the similarities between the BF samples. Additionally, the unexplained variability of input after transformation to the PCA space can provide further anomaly clues. This unexplained variability can be measured as the distance between the input and its projection on the PCA hyper-plane. Thus, I augment the PCA transformed features with the measurement of unexplained variability. The resulting compact representation manages to conserve the discriminative information in the input video effectively while reducing the dimensionality further by a factor of ≈ 1000. The amount of shift across the PCA dimensions where BF samples show little variability, along with the unexplained variance measurement can directly be used for one-class detection. To reduce it to a single score, the energy of the input across these dimensions can be measured by calculating the norm of the signal across them. However, as the unexplained variability is on a different scale compared to the PCA transformation values, a normalization step is required. Normalization can be done by making the distribution of the BF samples across these dimensions zero-mean unit-variance.

Categorical Margin Maximization Loss
As the performance of a system in deployment is measured according to its performance for the MPA, a new minimax loss function needs to be introduced that optimizes the detector towards achieving the highest MPA detection rate possible. In this approach, motivated by the success of the triplet loss [74] , I introduce categorical margin maximization loss (C-marmax) that weighs attacks exponentially according to the difficulty of classification, and thus focuses on reducing the loss from the most difficult samples (MPAs) at each batch during the training. Using C-marmax, the network transforms the aforementioned compact representations to embeddings on a unit hyper-sphere where the distance between the BF data and attacks are maximized while the distance between attacks from the same species, as well as BF samples to each other, is minimized. In this loss, the distances between attacks from one species to other species are ignored as we don't have any information about the similarity or dissimilarity between distribution across any two attack species. Hence the detector is categorical as it only considers distances between observations from different categories (i.e. BF vs attacks) for calculating the loss value. Finally, to exaggerate the loss from samples belonging to the MPA and suppress the loss from other attack species, the loss attributed to the anchors is exaggerated according to the distances such that the network pays more attention to marginal anchors to fulfill the objective of maximizing the minimum detection rate.
In attack detection scenarios, there are a few classes, and it is possible to rely on the distance to the center of distribution in a batch rather than the distance between individual samples. To this end, in each batch, I compute the location of the center of distribution for each attack species as well as BF data on the unit hyper-sphere, and according to the label of the inputs, I use these centers to measure the distance of the anchor to the positive distribution d p and the negative distribution d n . To achieve the maximum margin possible between the distribution of BF samples and PA samples in the embedding space, a fixed margin is not defined. Instead, the ratio dp dn is used for the maximum d p and minimum d n in a batch from each class, requiring the numerator to be minimized to zero, while the denominator is maximized to the maximum possible value of 2 on the unit hyper-sphere. To avoid the loss value to become infinity when d n is zero, the ratio is modified to dp dp+dn which is equivalent to dp dn when d p << d n . Furthermore, to exaggerate the loss for marginal observations (where d p is high) in comparison to non-marginal observations (where d p is low), exponentiation is used, and the resulting formula becomes ( dp dp+dn ) g .

8
As the defined loss does not maximize the distance between centers of distributions directly, to assure that the center of distributions are far from each other, the minimum distance between two centers are floored at √ 2 corresponding to 90 degrees on the unit hyper-sphere, with a second loss term. The final loss function is summarized as follows: where, d stands for euclidean distance, a signifies the anchor, C p is the center of the positive class, C n is the center of the negative class, g is the exaggeration factor, loss m is the margin loss, and loss c is the center loss. During decision making, the euclidean distance to the center of BF distribution can be used for scoring. This distance can further be converted to an attack detection probability value by division by 2.
In comparison to the triplet-loss, the proposed modifications result in a tunable exaggeration of the loss in misclassified samples and suppression of the loss in the correctly classified ones and relax the need for a fixed margin constant. Having no constant margin, the network can continue training even after a specific margin is achieved between the classes until the maximum margin on the hyper-sphere is reached. Furthermore, by using the center of the distribution instead of the distance between individual anchors, the loss becomes less stochastic, allowing faster convergence. To the same effect, the categorical nature of the proposed loss relaxes the untrue assumption that all attacks come from the same distribution regardless of their corresponding attack species.

Unknown Attack Detection
As argued in Section 3.5, a discriminative model may overfit to certain discriminative features that correspond to the bias in known attack species used in training. This also holds true for the presented C-marmax loss, as even though it tries to achieve a balanced attack detection performance across known attack species, it may exclude discriminative features that may be important for the detection of unknown attack species. As such, to detect unknown attacks, a one-class detector is proposed which does not have a bias towards any specific attack species, or in other words, for it all attacks are unknown. As explained in Section 4.1, the log-likelihood value of observing an image serves as a good general-purpose anomaly detection measure. However, this metric does not include the other important discriminative feature available in the pixel-level log-likelihood data, namely the location information. As explained in Section 4.2, the location relevant anomalies can be represented by the components in a PCA transform trained on BF where the BF data show the least variability. Furthermore, this representation can be augmented by the unexplained variance in the form of the distance of an observation to the PCA hyper-plane. Finally, the energy of the signal across the resulting representation after normalization can be used as an anomaly score. Following these steps, a second location-sensitive anomaly measure is derived. Assuming a Gaussian distribution for BF scores for both anomaly measures, using the BF score distribution, one can calculate the likelihood of an observation belonging to this distribution as the final probability score. For the final score of the one-class detection scheme, I simply average the two resulting likelihood scores from the log-likelihood measure and the PCA-based measure (Fig. 1).
To fuse the probability scores from the discriminative detector and the one-class detector when they are employed together, I use the following logic: If the discriminative detector decides that a sample is an attack, it most certainly is one. However, if the discriminative detector decides that the sample is a BF, the defender cannot be sure that the sample is a BF as it might come from an unknown attack. So the one-class detector is to be consulted for a decision. This two step decision logic can be interpreted as using an OR gate on the decision of the discriminative and the one-class detector decisions. However, as both systems provide a probability scores rather than a decision, considering that A ∨ B = A + B − AB = A × B, the following fusion formula is proposed that mirrors the logic level decision making: where p P A corresponds to the probability of belonging to the attack category, p BF corresponds to the probability of belonging to the BF category, and O and D correspond to one-class and discriminative detector models.

Experiment Setup
For measuring the effectiveness of the proposed method, its application on both tasks of presentation attack detection and Deepfake detection are considered. In this section, a description of the datasets used is provided, followed by the parameters used in training. Lastly, the measures used for evaluation of the method are described. 9

Datasets
To show the performance of the proposed method for presentation attack detection, the SiW-M dataset 4 [38] is selected due to its large collection of presentation attack species. Similarly, the FaceForensics++ dataset 5 [52] is chosen for the task of Deepfake detection as it contains the widest choice of species between the available datasets.

SiW-M
This dataset consists of 660 BF videos from 493 subjects from diverse ethnicity and age. Furthermore, it includes 966 PA videos from 13 different PAS collected under various environmental conditions, extreme face pose angles and lighting conditions. The videos are around six seconds in length. This dataset is specifically designed for the evaluation of generalization performance across unknown PAS. The attack species in this dataset are categorized into replay, print, mask, makeup, and partial attacks. The PAS available in this dataset are form a diverse set of attacks including print and display attacks as well as transparent masks and impersonation makeup. This dataset also includes PAS corresponding to partial attacks.
For training the models, 530 randomly chosen BF videos are used, while 65 randomly chosen BF videos were kept for development purposes, leaving 65 videos for testing. For training the classifier in the unknown case, a LOO setup is used and for each attack species, all the videos from other attack species are used for training, along with the training and development BF data. For few-shot learning, an additional randomly chosen one or five videos from the targeted attack species are included in the training, while in the known case 50% of the videos are included.

FaceForensics++
FaceForensics++ dataset contains four PAS corresponding to Deepfakes 6 , Face2Face [75], Faceswap 7 , and Neural Textures [76]. The dataset contains 1, 000 BF videos and 1, 000 videos from each PAS, each split into three sets, reserving 72% for training, 14% for validation and allocating 14% for evaluation. The videos are collected from YouTube and after manipulation, recompressed in three video qualities for evaluation of performance under various compression levels. For the purpose of analyzing performance over unknown attacks, only the non-compressed version of the data is used. Similar to the SiW-M dataset, both known and LOO unknown attack detection experiments are considered.

Parameters
The proposed method has a number of parameters corresponding to face detection, the pixel-level log-likelihood extraction model, the PCA model, and finally the classifier. In this study, the videos are considered as a set of frame images. The face region is extracted in each frame after face detection using the Dlib toolkit [77], and the cropped faces are resized to 128 × 128.
The overall pipeline of the proposed detection mechanism is visualized in Fig. 1 along with information about where the training data, development data, and known attack data is used. The input image is first processed by the PixelCNN++ model trained using the training data, resulting in an aggregated observation log-likelihood and pixel-wise log-likelihood matrices. The aggregated observation log-likelihood is compared to the distribution of BF values learned from development data to acquire the first generator-based anomaly measure. The pixel-wise log-likelihood matrices are further normalized to zero-mean unit-variance using the distribution of pixel values in the training data before applying the PCA transform. The PCA transform is learned using the training data, and the PCA transformed representation is augmented with the unexplained variance measure and normalized to zero-mean unit-variance across all dimensions using the development data. Then after sorting the components based on the explained variance of training data in descending order, the last components are used for calculating the norm. This value is then compared to the distribution of BF scores learned on development data for calculating the second generator-based anomaly measure. The first and second probability scores are combined by averaging, resulting in a single one-class classification score. The augmented and normalized PCA representations are then passed to the discriminative classifier trained on BF data from training and development set along with attack data from known attacks. 4 http://cvlab.cse.msu.edu/siw-m-spoof-in-the-wild-with-multiple-attacks-database.html 5 https://github.com/ondyari/FaceForensics 6 https://github.com/deepfakes/faceswap 7 https://github.com/MarekKowalski/FaceSwap/

PixelCNN++
For pixel-level log-likelihood matrix extraction, a PixelCNN++ 8 [78] model is trained on the resized cropped face images extracted from the BF training data. The model consists of three hierarchies with five ResNet layers in each, with 160 filters with a receptive field of 3 × 3 in each layer, resulting in 95 million parameters. Concatenated ELU [79] is used for activation and pixel intensity values are modeled using 10 logistic distributions. For regularization, dropout with a probability of 50% is used. The model is trained with a batch size of one and the ADAM [80] optimizer with a learning rate of 10 −5 is used for 500 epochs on a single randomly chosen frame per training video in each epoch.
The log-likelihood matrix is then generated by concatenating the pixel log-likelihood values for each of the 10 logistic distributions for each color channel, resulting in a matrix of size 128 × 128 × 30. For calculating the log-likelihood of observing the video, the likelihood of observing each individual frame is calculated using the weighted sum for the individual logistic distributions across the whole cropped face image. These values are then averaged across time to measure the average log-likelihood of the observed input video to be used for one-class detection. For extracting location-sensitive features, after averaging the pixel-level log-likelihood matrix values across the whole input video, at each pixel location, the distribution of log-likelihoods are normalized such that the BF training data has a distribution of zero-mean unit-variance, resulting in a matrix of size 128 × 128 × 30 per video.

Principal Component Analysis
In the next step, these matrices are extracted from the BF training data to train a PCA model with sorted components according to the explained variance across these components in descending order. Unexplained variance is measured by calculating the euclidean distance between each input and its projection on the PCA hyper-plane and added to the end of the PCA representation. The PCA representation is normalized to have zero-mean unit-variance for BF data from the validation set. For one-class detection, to measure the energy of the input video across the last 10% of the PCA representation, the norm after normalization is used. Using the distribution of the norm values across the validation data, a single Gaussian model is trained for calculating the likelihood of a given input to the BF distribution. The same approach is taken for the video log-likelihood values collected directly from the output of the PixelCNN++ model. These two likelihood values are averaged to calculate the final score of the generator-based one-class detector.

Classifier
The PCA representation is also used for the training of the discriminative classifier using the aforementioned loss function. A DNN model with four hidden layers, each with 512 ReLU activated units is trained for mapping its input to the L2 normalized embedding space of six dimensions (Fig. 2). Due to the limited amount of training data available for training the classifier, dropout regularization with a rate of 50% is used on the output of each hidden layer, along with L2 regularization with a factor of 10 −6 . Oversampling is done by using random segments of the training videos and their vertically flipped copies while testing is done on the whole test videos. The training data is balanced by repetition to have 50% BF samples and 50% #P AS . The loss function only has one tunable parameter g, which was set to two to achieve fast conversion. Training is done with a batch size of 128 for 100 epochs with a fixed learning rate of 10 −3 using the ADAM optimizer. Finally, the detection probability score is calculated by measuring the Euclidean distance of the embedding to the average of the validation data embeddings divided by two. The fusion between the probability score calculated by the generator-based one-class detector and the discriminative detector is done using the formula in Section 4.4.

Metrics
To evaluate the performance of the proposed system, the threshold less equal-error-rate (EER) metric is used. EER measures the error rate when the missed detection percentage is equal to the false alarm percentage. For evaluation of performance across all attack species, the EER value for the MPA is chosen by measuring the maximum EER across all species following the arguments represented in Section 3.3. Furthermore, the detection error trade-off (DET) curve is used for showing the missed detection rate for each false alarm value. Missed detection corresponds to the bona fide presentation classification error rate (BPCER) and false alarm corresponds to attack presentation classification error rate (APCER) in ISO/IEC 30107 terminology 9 .In the experimental result tables, we report are ACER@APCER=5%. The BPCER@APCER = 5% can be calculated as BPCER = (ACERx2)-5%.

Presentation Attack Detection
In this section, the adequacy of the proposed generator-based anomaly representations is first explained. Later, the performance of the proposed method based on these representations is evaluated and compared to the existing solutions in both known and unknown attack detection scenarios. Lastly, the few-shot learning capacity of the proposed method is investigated and the computational cost of the pipeline is reported. Fig. 3 shows examples of the log-likelihood matrices extracted by the PixelCNN++ model for sample frames from BF data as well as each attack species. It can be seen that BF data shows few single anomaly pixels corresponding to the natural variations in the BF frame as well as anomalies around the location of the glasses. However, each attack species shows its own pattern of anomalies corresponding to the locations where it is observed. For example for the obfuscation makeup attack, the anomalies correspond to where the eyebrow and beard lines are drawn, for the mannequin attack they correspond to the skin regions, for the paper mask to the fold locations, and for the replay attack to the overexposed regions of the face. These examples show the capacity of the representation to provide explainability at pixel-level.  To further analyze the unique patterns from each attack species, the average log-likelihood matrix for each species is presented in Fig. 4. The average and standard deviation of log-likelihood values for training BF data are shown in the first column. From these two images, it can be seen that most of the natural variability in the training data corresponds to the eye and the nasal dorsum as well as the background, while the periocular region of the face contains a lower natural log-likelihood. After normalization of the average log-likelihood matrices for test data using these two matrices, it can be seen that the test BF data matches the training BF data average, while each attack species show a different pattern for low likelihood and high likelihood regions. Attacks with unusually high likelihood over the skin region are cosmetic makeup, impersonation makeup, half mask, mannequin, silicone mask, print, and partial cut attacks. This effect can be interpreted as the over-smoothness of skin texture in these attacks. Attacks with unusually low likelihood over the skin are obfuscation makeup, paper mask and to some extent transparent mask, which can, in turn, be interpreted as severe anomalies in the skin texture. As expected, partial attacks show anomalies in the region of the image where the attack is applied to. Fig. 5 shows the t-SNE embeddings [81] of the normalized average pixel log-likelihood matrices from each video. From this figure, it is evident that the representation manages to cluster attacks from the same species together with few exceptions. Furthermore, it shows a good separability between BF data and presentation attack data, while the training BF data distribution overlaps with the test BF data. These are remarkable characteristics for the features generated by the proposed anomaly extraction which was trained in an unsupervised manner on only BF examples. This separation is however not perfect, as a cluster of BF samples are located inside the attack distribution with high overlap with partial funny eye and partial paper glass attacks. In addition, clusters of presentation attacks exist inside the BF distribution. The majority of these samples are from transparent mask, obfuscation makeup, and partial paper glass attacks. By looking at Fig. 4 it can be seen that all these attacks have a shared characteristic where the average log-likelihood matrix has lower values on the skin region in contrast to other attacks where the skin region shows higher log-likelihood values. Figure 5: The t-SNE graph on the average log-likelihood matrices for all the data available in the SiW-M dataset. Each point represents a video, and each attack species is visualized with a different shape and color. The training BF data is shown with gray dots while the test BF data is shown with pink pluses. A clear separation is visible between BF data and attack data. 13

One-class classification
The performance of both anomaly measures in the proposed one-class classification scheme, along with the combined one-class detection score for each species is presented in Table 2. Even though the EER values for the detection of individual attacks, with the exception of impersonation makeup attack, are far from acceptable, these anomaly measures show a balanced performance across all attack species. For all attacks, it can be seen that the fusion of these two anomaly measures successfully reduces the EER close to the smaller value of the two, and subsequently the MPA EER is reduced by 10%. The method performs significantly better on impersonation makeup attacks compared to the other attacks while transparent mask and paper glasses attacks are the most challenging for the system.
To see the effect of the number of PCA components in the detection rate, Fig. 6 shows the average as well as the maximum EER over all species after filtering out the first n components from the PCA representation. It can be seen that, as hypothesized, the last PCA dimensions contain a significant amount of attack-unspecific discriminative information. The correlation between the aggregated log-likelihood measure and the anomaly norm measures is 0.15 signifying the complementing potential of these measures on each other. The combination scores reflect the complementary nature of these measures and results in a detector with an MPA attack detection EER of 27.1%. The DET curve for the resulting one-class detector is shown in Fig. 7 for all attack species. This plot reveals three clusters of curves corresponding to transparent mask, silicone mask, partial paper glass, paper mask, partial funny eye, obfuscation makeup, and cosmetic makeup attacks with above 20% EERs, partial paper cut, replay, half mask, print, and mannequin attacks with EERs between 10% and 20%, and finally impersonation makeup with less than 5% EER. The attacks with higher than 20% EER reflect the overlaps observed in Fig. 5. The attacks with an EER below 20% show similarities in their average log-likelihood images in Fig. 4 while the other attacks each have their individual dissimilar patterns.

Detection Performance
In the following, the detection performance in terms of MPA EER is presented and analyzed for the detection of known attacks, unknown attacks, and few-shot learning.

Known attacks
The performance of the proposed methods in comparison to the existing detection methods which are applied to the SiW-M dataset is reported in Table 3. It can be seen that even though the proposed method is outperformed on most individual attacks, the focus of the loss function on the MPA resulted in a lower EER on the difficult attack species, namely cosmetic makeup. As a result, the proposed discriminative detector achieves 9.7% EER on the MPA, reducing the MPA EER by 37% compared to the best existing detector. The proposed fusion mechanism further reduces the MPA EER to 8.5%. The DET curve for the proposed discriminative detector is shown in Fig. 8. It is worth noting that the clusters visible in Fig. 7 are merged together and the curves follow a similar course, representing a more balanced detection performance. Furthermore, the curve for impersonation makeup is almost identical to the one-class classification curve, showing that the proposed C-marmax loss successfully avoided optimization of performance on     this attack which was the easiest to detect using its input. A similar pattern is observable in Table 3 where print and impersonation makeup attacks achieved the smallest boost in performance after the application of the discriminative classifier. Due to the small number of test samples, the DET curve shows abrupt changes, showing that more data is needed for a more precise measurement of EERs.

Unknown Attacks
The results for the proposed method along with the performance of existing detectors in unknown attack conditions are presented in Table 4. It can be seen that, as expected, the one-class detector performs better than all discriminative detection methods in terms of MPA EER, including the proposed method. However, it is worth mentioning that the discriminative detectors gain an advantage over certain PASs where there is a similarity of the discriminative features between the unknown PAS and the known ones used in training. This distinction is visible in cases where a significant difference exists between the one-class classifier performance compared to the discriminative classifier such as in the case of silicone mask and mannequin attacks. A close observation of Fig. 5 reveals that samples from these two attacks are not clustered together in the anomaly feature space. The proposed fusion method managed to cap the EER for partial paper glasses and partial funny eye attack where the proposed discriminative detection method fails while not hindering the performance on cases where the discriminative detection method performs well. Considering the existing solutions, it can be seen that there only exists one approach that has a better than chance detection rate for MPA, namely LLIG [32]. This is concerning as it shows that all other existing methods would be ineffective against a rational attacker, and in the case of [31], would actually increase the efficacy of attacks. The proposed method achieves an MPA EER of 27.8% and outperforms all baseline methods.

OULU-NPU dataset
In Table 6, we report results of proposed framework as well as previously presented strategies in the literature on OULU-NPU dataset [82]. Fig. 10 shows the average and standard deviation of log-likelihood matrices for training data along with the average matrices for test data. The OULU-NPU face presentation attack detection database is composed of 4950 real access and attack videos. The videos were captured utilizing the front cameras of six mobile devices in three sessions with different background scenes and illumination conditions. There are two attacks, i.e., print and video-replay, which were generated via two printers and two display devices. In this study, we adopted OULU-NPU Protocol II because it presents the unknown attack detection scenario, namely, the effect of attack variation is assessed by introducing previously unseen print and video-replay attacks in the test set. We can observe in Table 6 that the performance obtained using proposed framework is better than prior methods. For instance, the presented method with c-marmax achieved 2.4% EER, whereas the scheme proposed in [84] obtained 6.0% EER. It is also worth noticing that proposed system with one-class classifier did not perform well, but the proposed system with fusion scheme could avoid a major loss and attain notable accuracy.

Few-shot learning
In Table 5, the performance of the proposed method is presented on the task of few-shot learning when having one or five examples, and compared to unknown and known cases. It can be seen that by observation of even one example from an unknown PAS, the performance of the system improves, and the MPA EER is reduced by 45% from 33.5% to 18.3% by observation of five examples. As such, the proposed system shows the capacity of significantly reducing the EER after the presentation of a few examples of a new PAS. It can also be seen that, specifically in the case of impersonation makeup attack, the observation of new samples does not reduce the EER. This can be explained by the fact that the EER is already low in the zero-shot case, and the proposed C-marmax loss does not reward further improvement in the EER of this attack as it does not improve the overall MPA EER.

Detection Cost
Due to the big size of the PixelCNN++ model, the extraction of each individual pixel log-likelihood matrix for each frame is the bottleneck and takes roughly 75 milliseconds in our setup. Considering the average length of six seconds for the 24 FPS videos in the dataset, processing each video takes 9 seconds, corresponding to ×1.5 real-time speed. This may account for a prohibitively high detection cost in certain applications such as smartphone-based detection or social media monitoring. However, according to Eq. 2 the proposed method can find applications where the cost of a missed detection is high, such as border control and authenticity verification in journalism.   7 Deepfake Detection Fig. 9 shows the average and standard deviation of log-likelihood matrices for training data along with the average matrices for test data. It can be seen that most variations in the data are from the background, forehead, and cheeks, while the eye and mouth regions had little variability with a low log-likelihood average. The BF test data average matches that of training BF data. However, there are distinct patterns corresponding to each attack species. In the case of Deepfakes and NeuralTextures, there is a high log-likelihood region on the lower half of the face, corresponding to the possible over-smoothness of the texture. In the case of Deepfakes, there is a low-likelihood region around the eyebrows and the chin line which corresponds to the locations where the artifacts that are the characteristic of Deepfakes often occur. For the Face2Face technique, the pattern corresponds to points with low log-likelihood around the nose and chin line, while for the FaceSwap technique, the pattern corresponds to the eyes, nose, and mouth regions.    Table 7 shows the performance of the one-class detector and the proposed discriminative detector as well as their fusion. It can be seen that the one-class detector managed to achieve acceptable MPA EER of 8.21% while the discriminative detector achieved near-perfect video level detection. The Fusion did not degrade the performance of the discriminative detector significantly. It is important to mention that known attack detection on the raw subset of the dataset is a solved problem with near-perfect frame-level detection rates reported in the baseline [52]. Table 8 reports the detection performance on the LOO unknown attack detection scenario. The low EERs of the discriminative detector shows that there are mutually discriminative features across the known and unknown attacks, especially for Face2Face and NeuralTexture methods. However, the face swap method shows less similarity to other methods and this results in a increase in the EER of the discriminative detector compared to the one-class one. Furthermore, the fusion mechanism managed to lower the MPA EER significantly, and an MPA EER of 2.5% is achieved for the unknown attack detection. Due to the easiness of spotting digital manipulation traces in raw videos, the overall performances in terms of MPA are much lower than for PAD experiments.

Conclusion
The choice of the attack by a rational attacker can have a significant negative impact on the performance of the detection systems in real-life scenarios. In response, after relying on game theory to build a theoretic basis and formulating the interactions between the attacker and the defender, a new detection method is proposed to optimize the performance against attacks from such attackers. Experiments on the tasks of presentation attack detection and Deepfake detection show effectiveness of proposed method in improving detection rate on most powerful attacks both in known attack cases and when the detector faces unknown attacks. Furthermore, the proposed feature set is capable of enabling few-shot learning and explainability at pixel-level. The proposed method shows generalizability across widely different types of attacks ranging from Deepfakes and replay attacks to 3D masks and makeup attacks and is able to show where the artifacts commonly occur for each specific attack species. Also, unsupervised anomaly detection method used is able to produce representations that cluster attacks from the same species together and separate BF samples from attacks in an unsupervised manner with limited training data in unconstrained recording conditions. However, this method has two specific short-comings. First, the extraction of the anomaly representations is computationally expensive and thus the system cannot be deployed in applications where processing an input video should be done faster than in real-time such as automated content monitoring on social media. Secondly, despite the proposed method outperforming the state-of-the-art in the task of presentation attack detection, its expected 27.8% performance against the most powerful unknown attack is still far from acceptable for real-life applications, showing the need for further research in this direction. However, the availability of more training data from a more diverse set of attacks may alleviate this limitation.