Statistical inference for generative adversarial networks and other minimax problems

This paper studies generative adversarial networks (GANs) from the perspective of statistical inference. A GAN is a popular machine learning method in which the parameters of two neural networks, a generator and a discriminator, are estimated to solve a particular minimax problem. This minimax problem typically has a multitude of solutions and the focus of this paper are the statistical properties of these solutions. We address two key statistical issues for the generator and discriminator network parameters, consistent estimation and confidence sets. We first show that the set of solutions to the sample GAN problem is a (Hausdorff) consistent estimator of the set of solutions to the corresponding population GAN problem. We then devise a computationally intensive procedure to form confidence sets and show that these sets contain the population GAN solutions with the desired coverage probability. Small numerical experiments and a Monte Carlo study illustrate our results and verify our theoretical findings. We also show that our results apply in general minimax problems that may be non-convex, non-concave, and have multiple solutions.


Introduction
A generative adversarial network (GAN) is a machine learning method introduced in Goodfellow et al. (2014).The basic purpose of a GAN is to learn how to generate synthetic data based on a training set of real-world examples.While a traditional use of GANs has been to generate authentic-looking photographs based on real example images (for a recent example, see Karras et al., 2021), they are now in use across various scientific disciplines.To give brief examples, variants of GANs have been used in biology to generate synthetic protein-coding DNA sequences (Gupta and Zou, 2019), in physics to simulate subatomic particle collisions at the Large Hadron Collider (Paganini et al., 2018), in astronomy to de-noise images of galaxies (Schawinski et al., 2017), and in medicine to improve near-infrared fluorescence imaging (Ma et al., 2021).GANs have also been used in popular culture and arts to re-create video games (Kim et al., 2020), to make computer-generated art (Miller, 2019), and to compose music (Briot et al., 2020).To put the recent popularity of GANs into perspective, the original Goodfellow et al. (2014) article received over 10.000 Google Scholar citations in the year 2020 alone.For recent surveys of GANs and for further references, see Creswell et al. (2018), Pan et al. (2019), or Goodfellow et al. (2020).
A GAN typically comprises two neural networks called a generator and a discriminator.(For background material on neural networks, see the books of Bishop (2006) or Goodfellow et al. (2016).)The generator network produces synthetic data whose distribution aims to mimic that of real data, while the discriminator network evaluates whether the data produced by the generator is real or fake.Suppose the observed vector x mathematically represents an observed image, DNA sequence, or some other object of interest.This x is viewed as a realization from an underlying random vector X whose distribution remains unknown to us.Obtaining a large number of realizations from X may be difficult or costly, and the researcher desires to produce synthetic replicas in an easy manner.These replicas are the output of a generating mechanism G(z, γ) that in practice is a complicated neural network.This generator takes as an input noise variables z that are drawn from some underlying random vector Z and depends on parameters γ that are tuned so that the output G(z, γ) would be a close replica of x.To assess the quality of the synthetic data produced, a discriminating mechanism D(•, δ) indicates how likely it is an input, whether original data x or a replica G(z, γ), is real data from the underlying distribution X.In practice, the discriminator D(•, δ) is again a complicated neural network with parameters δ to be estimated.The 'adversarial principle' of generative adversarial networks works as follows.Given an original x and any synthetic G(z, γ), the discriminator aims to give a high rating D(x, δ) to the real x and a low rating D(G(z, γ), δ) to the artificial G(z, γ) by choosing δ to maximize the objective function for any given fixed γ at a time.In contrast, the generator aims to make G(z, γ) as hard to distinguish from x as possible by choosing the γ to minimize the discriminator's maximized objective function.
To formalize the above discussion, the GAN problem can be expressed as the minimax problem where the infimum and supremum are taken over sets Γ and ∆ denoting the ranges of permissible values for γ and δ, respectively, and E denotes expectation with respect to the joint distribution of X and Z.The formalities will be discussed in detail in Section 2. For now it suffices to remark that X and Z are typically of rather large dimension, that the neural networks G(•, •) and D(•, •) are in practice quite complicated and parameterized via very high-dimensional γ and δ, and that in GAN applications f is as a rule non-convex and non-concave (in contrast to the traditional convex-concave minimax setting in which f (•, δ) is convex for all fixed δ ∈ ∆ and f (γ, •) is concave for all fixed γ ∈ Γ).Given solutions, say γ 0 and δ 0 , to problem (1), especially γ 0 is of interest as G(Z, γ 0 ) gives the researcher a mechanism to produce the desired synthetic data.This description of GANs corresponds to the original formulation in Goodfellow et al. (2014).
The original GAN and its numerous extensions and variants have in recent years attracted remarkable interest in applications in which having access to large quantities of synthetic data is beneficial.Examples of such applications were mentioned above, and the surveys listed above contain further references as well as details of the many extensions of GANs.
The main object of interest in this paper is the GAN minimax problem (1) and its solutions.On a more general level, recent machine learning literature has experienced a notable surge of interest in general minimax problems.As GANs have arguably been one of the major reasons for this, we focus on the GAN case as a prominent representative example of minimax problems; other minimax problems will be discussed in Section 6.
In practical GAN applications, one would use a large training sample of observations to estimate the GAN.Let the n observations correspond to independent and identically distributed (IID) random vectors X 1 , . . ., X n , all distributed as the variable X, and suppose the n IID random vectors Z 1 , . . ., Z n have the same distribution as the noise variable Z. Then the objective is to solve the sample minimax problem (cf.Goodfellow et al., 2014;Biau et al., 2020) (2) inf The primary objective in GAN applications in the machine learning literature is to find solutions to this sample problem, or to 'train' the GAN.In statistical terminology, this corresponds to estimating the GAN parameters γ and δ. (Of course, choosing the parametric forms of the neural networks G(•, •) and D(•, •), or specifying the network architecture, is done before this.)Devising algorithms that solve the sample minimax problem (2) is very challenging and a large body of machine learning literature on GANs focuses on this.Discussions of convergence and stability of various GAN training algorithms can be found for instance in Arjovsky and Bottou (2017), Nagarajan andKolter (2017), andMescheder et al. (2018); more recent contributions include Diakonikolas et al. (2021), Fiez and Ratliff (2021), and Mangoubi and Vishnoi (2021).
Another important theoretical aspect of GANs studied in the machine learning literature is how well the distribution of G(Z, γ) can approximate the target distribution of X; contributions to this line of research include Arora et al. (2017), Liu et al. (2017), Singh et al. (2018), Lu and Lu (2020), Biau et al. (2021), Liang (2021), andSchreuder et al. (2021).This paper studies the GAN minimax problem from the perspective of statistical inference.We do not address algorithmic issues and simply assume a method is available for solving the sample GAN problem (2).The focus of this paper are the statistical properties of these solutions, say γn and δn , as estimators of the solutions γ 0 and δ 0 to the population problem (1).Such questions of statistical inference are quite orthogonal to much of the machine learning literature on GANs: In most GAN applications the ability to produce synthetic data is the ultimate goal and, as discussed above, theoretical works on GANs often focus on convergence of algorithms or the ability of GANs to mimic the target distribution.Our study can be seen as complementary to these existing works.Although the parameters γ and δ have no particular interpretation in GANs, studying properties of their estimators nevertheless contributes to a more complete understanding of the statistical properties of GANs.In the broader context of general neural network models, it is also typically the case that the network parameters (weights) are not of major interest.Nevertheless, statistical questions regarding these parameters have been explored in the literature.To give a few examples, research in this direction includes early frequentist works on neural network parameter estimation (White, 1989a;White, 1989b), as well as more recent works on Bayesian posterior distribution of the neural network parameters (Blundell et al., 2015;Izmailov et al., 2021) and tests to assess the statistical significance of neural network variables (Horel and Giesecke, 2020;Fallahgoul et al., 2024).The present paper relates to this earlier literature and views the GAN minimax solutions as objects of statistical interest.
The study of statistical inference for the GAN solutions was initiated in the recent important paper of Biau et al. (2020).These authors were the first to consider the consistency and asymptotic normality of the sample GAN solutions γn and δn as estimators of the population GAN solutions γ 0 and δ 0 .A key assumption these authors make is that the population GAN problem (1) has a single, unique solution (γ 0 , δ 0 ).However, as Biau et al. (2020Biau et al. ( , pp. 1560Biau et al. ( -1561) ) acknowledge, this assumption is unrealistic in practical GAN applications.On the contrary, as the objective function f (γ, δ) in ( 1) is parameterized using complicated neural networks G and D, it is essentially guaranteed to have an extremely large number of solutions (for related discussion, see Goodfellow et al., 2016, Sec 8.2).One reason for this prevalence of multiple solutions is the inherent non-identifiability of heavily parameterized neural networks.
In this paper, we consider statistical inference for GANs in the empirically relevant case of multiple solutions.The theoretical framework required for this is rather different from the one employed in Biau et al. (2020).We focus on two key issues, consistent estimation and confidence sets, and the assumptions we make are weak and hold in many real GAN applications.To briefly describe our results, let Θ 0 and Θn denote the sets of solutions to the population minimax problem (1) and the sample GAN problem (2), respectively.We first consider an appropriately defined notion of consistency for the set-valued estimator Θn , namely Hausdorff consistency (precise definition will be given in Section 3).Without any restrictions on the (potentially infinite) number of solutions, we show that Θn is a Hausdorff consistent estimator of Θ 0 .We then consider confidence sets, random sets that contain Θ 0 with a prespecified coverage probability.In the traditional point-identified setting in which Θ 0 is a singleton, confidence sets would often be formed based on the asymptotic distribution of an estimator of the parameters of interest.In the present set-identified case, we follow an alternative approach.We devise a computationally intensive resampling procedure based on appropriate lower contour sets of a particular criterion function to form the confidence sets, and show that these sets contain the population GAN solutions with (at least) the desired coverage probability.This turns out to be technically challenging, and the details will be given in Section 4 and Appendix B.
The theoretical developments in this paper build on two strands of literature.On the one hand, our results for consistent estimation and for constructing confidence sets are based on recent developments in estimation and inference for general set-identified parameters in the partial identification literature in econometrics; our approach relies in particular on the pioneering work of Chernozhukov et al. (2007) and Romano and Shaikh (2010) (for further related references, see the recent survey of this literature in Molinari, 2020).Note that conventional theory of pointidentified extremum estimators (see, e.g., van der Vaart, 1998, Ch 5) is not applicable in our setting.The research in the partial identification literature does not, however, consider minimax problems such as (1) or GANs, and adjusting to the minimax setting requires some work.On the other hand, general minimax problems (but no GANs) have previously been considered in the stochastic programming literature in operations research.Our results are particularly closely related to those in Shapiro (2008) and Shapiro et al. (2009, Sec 5.1.4)who consider consistent estimation (but no confidence sets) in minimax problems, as well as to the foundational results for minimization problems in Shapiro (1991).Further comparison to all these previous works will be given later in the paper.
The present paper contributes to the statistical literature in several ways.Although other theoretical aspects of GANs have been considered previously, the present paper provides the first results for statistical inference in GAN minimax problems in the practically relevant case of multiple solutions.Our results also apply to other general minimax problems, not just to GANs.The confidence sets we provide for the solutions of minimax problems are novel and, in contrast to previous literature, the minimax problems may be non-convex, non-concave, and have multiple solutions (see Section 6 for details).Furthermore, as a technical device in our proofs, we establish new Hadamard directional differentiability results of certain mappings related to GANs and general minimax problems, and these results may be of independent interest (see Lemma 3 in Appendix B).
Finally, it should be noted that the results in this paper are asymptotic, taking the sample size n to infinity while keeping the dimensions of the parameters γ and δ fixed.This traditional asymptotic framework is appropriate in minimax problems with a moderate number of parameters but is somewhat idealized in GAN applications where the number of parameters may be extremely large or even substantially exceeding n.Our asymptotic results aim to contribute towards a better understanding of statistical inference for set-identified parameters in GANs and other minimax problems; exploring non-asymptotic results would also be very interesting but would require a different mathematical framework and is beyond the scope of this paper (we note that, to the best of our knowledge, non-asymptotic theory of estimation for set-identified parameters has not yet been developed in the literature).
The rest of the paper is organized as follows.The next section sets the stage by considering the GAN minimax problem more formally.Consistent estimation of Θ 0 is the topic of Section 3, while Section 4 discusses confidence sets for Θ 0 .These results are illustrated in small numerical experiments in Section 5. Other general minimax problems are discussed in Section 6, and Section 7 concludes.All technical derivations and proofs are relegated to Appendices A-C (with Appendix B containing the most interesting results).

The GAN minimax problem
We now consider the GAN problem outlined in the Introduction more formally.In what follows, all the random quantities are defined on some appropriate underlying probability space but typically there is no need to emphasize this.The available data corresponds to a random vector X taking values in some Euclidean subset X ⊆ R d X and whose distribution remains unknown.The noise variables Z with values in Z ⊆ R d Z come from a distribution chosen by the researcher, often the multivariate uniform or Gaussian distribution.The generator function G(•, •) is a neural network that transforms the noise variables Z into synthetic data G(Z, γ).The discriminator function D(•, •) is another neural network indicating how likely it is that a real observation x or a replica G(z, γ) comes from the distribution of X.The following assumption summarizes these concepts from a technical point of view.

Assumption GAN.
(a) Suppose X 1 , . . ., X n and Z 1 , . . ., Z n are independent and identically distributed random vectors with the same distributions as X and Z and taking values in X ⊆ R d X and Z ⊆ R d Z , respectively.
This assumption contains some minimal requirements for the original GAN framework of Goodfellow et al. (2014).The IID assumption in part (a) is standard in the GAN setting, and the same holds for the compactness assumption for the permissible parameter space in part (b).Part (c) is a minimal continuity assumption.Similar assumptions have been used for instance by Biau et al. (2020).
Descriptions of neural network architectures that can be used to specify the generator G and the discriminator D can be found in the book of Goodfellow et al. (2016) and in the many references therein.Often the generator and discriminator networks would satisfy additional differentiability and moment condition assumptions.Although such extra assumptions are not necessarily required for the results that follow, such a 'smooth GAN' setting serves as a convenient example that will be used to illustrate our results.In the following assumption and in what follows, we use either notation θ or notation (γ, δ) for the elements of Θ = Γ × ∆; for instance, F (x, z, θ) and F (x, z, γ, δ) are used interchangeably.Also, |•| denotes the Euclidean norm.
These conditions require the GAN problem to be somewhat more well-behaved.Differentiability of the generator and discriminator networks is a common requirement for training methods employed in GAN applications, such as variants of the gradient descent-ascent algorithm (the set Θ * is introduced to ensure derivatives are well-defined throughout the proofs).The mild integrability condition facilitates checking later high-level assumptions and holds, for example, if the discriminator is bounded away from zero and one and the derivatives of the generator and discriminator are bounded (this can be seen by straightforward differentiation).The last condition rules out degenerate cases; a prime example of such a degenerate case is the discriminator D(x, δ) being constant in x for some δ.Overall, these conditions are quite mild and would hold in many practical GAN applications.
Using the notation in Assumption GAN, we can now restate the GAN minimax problem (1) from the Introduction as In the traditional convex-concave setting (f (•, δ) convex for all fixed δ ∈ ∆ and f (γ, •) concave for all fixed γ ∈ Γ), the classical von Neumann minimax theorem implies that inf γ∈Γ sup δ∈∆ f (γ, δ) = sup δ∈∆ inf γ∈Γ f (γ, δ) under mild conditions.In contrast to this, Jin et al. (2020) among others have emphasized that in GAN applications f is as a rule non-convex and non-concave and the order in which minimization and maximization are performed matters.Another point to note is that in GAN applications the object of interest is not the optimal value of problem (3), that is, (4) but rather the optimal solutions, the parameter values (γ 0 , δ 0 ) that solve problem (3).Of main interest are the γ-parameters appearing in the generator network as these facilitate producing synthetic data according to G(Z, γ 0 ); for completeness we consider also the δ-parameters in the discriminator network.Let Θ 0 ⊆ Θ denote the set of optimal solutions to (3).To describe Θ 0 , we introduce the max-function ( 5) A point (γ 0 , δ 0 ) ∈ Θ solves the GAN problem (3) when it is a solution both to the inner maximization problem (hence satisfying f (γ 0 , δ 0 ) = sup δ∈∆ f (γ 0 , δ) = φ(γ 0 )) and to the outer minimization problem (hence satisfying φ(γ 0 ) = inf γ∈Γ φ(γ) = V 0 ).Therefore the set of solutions Θ 0 can be expressed as This motivates us to define the (population) criterion function ( 7) Therefore the set of solutions ( 6) can alternatively and concisely be characterized as Now consider the corresponding sample GAN minimax problem (2), which can be written as and define the sample analogues of the quantities in ( 4)-( 7) as These quantities have interpretations similar to their population counterparts: Vn is the optimal value of the sample GAN problem (9), φn (γ) is a sample max-function, Θn denotes the set of (exact) solutions to the sample GAN problem (9), and Qn (θ) is a (nonnegative) sample criterion function that allows us to express the set Θn as Finding a solution to the sample GAN problem (9), let alone the entire set of optimal solutions Θn , is a challenging task.From a practical machine learning perspective, this 'training of the GAN' is of principal interest and extensive research efforts have been made to devise algorithms to do this (see the references listed in the Introduction).In this paper, we do not consider these algorithms but rather just assume that the sample GAN problem is solved using some available method.These algorithms typically search for approximate rather than exact solutions: for some small non-negative constant τ , approximate solutions to the sample GAN problem (9) are points (γ n , δn ) Such points (γ n , δn ) approximately solve both the inner maximization problem and the outer minimization problem, with τ determining the slackness allowed in these maximization and minimization problems.Somewhat more generally, let τ n be a sequence of non-negative random variables such that τ n p → 0 (where p → denotes convergence in probability).Define in analogy with the above as the set of approximate solutions to the sample GAN problem (9).Noting that the two inequalities in ( 15) can be expressed as φn (γ n )− fn (γ n , δn ) ≤ τ n and φn (γ n )− Vn ≤ τ n , respectively, and recalling the definition of the sample criterion function Qn (θ) in ( 13) allows us to characterize the set Θn (τ n ) as ( 16) Setting τ n = 0 one obtains as a special case the set of exact solutions, Θn = Θn (0).In the next section we consider the properties of Θn (τ n ) as an estimator of Θ 0 .Before proceeding, a remark about the interpretation of Θ 0 is in place.In GAN applications, it is typically not realistic to assume that the generated synthetic data would perfectly mimic the real observations.In line with this, the GAN formulation allows for misspecification: It is not assumed that G(Z, γ) would for some γ have the same distribution as X does.In this sense, the elements of Θ 0 do not correspond to any 'true' parameter values.Nevertheless, an interpretation can be given.To do so, momentarily consider a somewhat idealized version of the GAN problem where in (1) the supremum over δ and the parametric class of discriminator functions D(•, δ) (from X to (0, 1)) is replaced with the supremum over all measurable functions D(•) from X to (0, 1).It turns out that in this case the GAN problem in (1) reduces to the minimization problem inf γ∈Γ 2[JSD(P X , P γ )−ln 2], where P X denotes the probability distribution of X, P γ the probability distribution of G(Z, γ), and JSD(P X , P γ ) the so-called Jensen-Shannon divergence between these two distributions (see Biau et al., 2020, Sec 2, andBelomestny et al., 2021, Secs 1-2, for the technical details and the definition of the Jensen-Shannon divergence).This yields an idealized interpretation of the GAN problem: heuristically, one can interpret the elements of Θ 0 to correspond to γ 0 that minimize the Jensen-Shannon divergence between the true target distribution P X and the generated distribution of G(Z, γ 0 ) (of course, this interpretation is not entirely accurate as in practice the discriminators employed are parametric neural networks).Note also that this interpretation is akin to conventional maximum likelihood (ML) estimation in misspecified models, where (under appropriate assumptions) the ML estimator converges to a parameter, say again θ 0 , that minimizes the Kullback-Leibler divergence between the true distribution and the distribution corresponding to parameter θ 0 (see, e.g., van der Vaart, 1998, Ex 5.25).

Consistent estimation
In GAN minimax problems both Θn (τ n ) and Θ 0 are typically set-valued and not singletons and an appropriate notion of distance between sets is required.One commonly used generalization of the Euclidean distance | • | is the Hausdorff distance (see, e.g., Rockafellar andWets, 2009, Sec 4.C, or Molchanov, 2017, Appendix D).For any two non-empty bounded subsets A and B of some Euclidean space, the Hausdorff distance between A and B is defined as hold.Heuristically, the former condition in (17) guarantees that Θn (τ n ) is not too large compared to Θ 0 , whereas the latter condition ensures Θn (τ n ) is large enough to cover all of Θ 0 .Establishing (17a) follows the pattern of a standard consistency proof and relies on a suitable uniform law of large numbers combined with an appropriate set-identification condition for Θ 0 .Proving (17b) relies on also knowing the rate of this uniform convergence.The following assumption formally states the required high-level conditions; here O p (1) stands for a sequence of random variables that is bounded in probability.
The high-level conditions in this assumption can be verified using various sets of sufficient conditions.For instance, Assumption 1(a) holds when Assumption GAN is combined with the mild moment condition E[sup θ∈Θ |F (X, Z, θ)|] < ∞, and 1(b) holds under Assumption Smooth GAN (justifications for these statemens are given in Appendix C).Assumption 1 thus holds in most practical GAN applications.
Our consistency results also require certain conditions for the slackness sequence τ n .
(a) τ n is a sequence of non-negative random variables such that τ n Part (a) of Assumption 2 allows for the possibility that τ n is identically zero while this is ruled out in part (b).In (b) it is additionally assumed that the convergence of τ n to zero is slower than that of n −1/2 in the sense that n −1/2 /τ n p → 0; for instance, τ n = n −0.49 is a possibility.Of course, (b) implies (a).
We can now state our consistency theorem for the solutions of the GAN minimax problem (the proof is given in Appendix A).
(a) Assumptions 1(a) and 2(a) imply that (17a) holds (and also that Vn Part (a) of Theorem 1 establishes the former one-sided Hausdorff consistency condition in (17); for completeness, consistency of the optimal value Vn is also given.Under the stronger conditions in part (b), the latter condition in ( 17) is also obtained, thus establishing the desired Hausdorff consistency result.
Theorem 1 is based on the Hausdorff consistency results of Chernozhukov et al. (2007, Thm 3.1) in general set-identified situations.Shapiro et al. (2009, Thm 5.9) give (an almost sure version) of part (a) of Theorem 1 in a general minimax setting.In the GAN setting, a consistency result in the case of a unique solution is given in Biau et al. (2020, Thm 4.2).(Note that when Θ 0 is a singleton-set consisting of one point θ 0 , Theorem 1(a) shows that any element of Θn (τ n ) is consistent for θ 0 .)We are not aware of consistency results in the GAN setting in the case of multiple solutions and the result of Theorem 1 is reassuring in that estimation in the GAN setting will be consistent regardless of the (potentially infinite) number of solutions.The onesided consistency result of Theorem 1(a) covers the case of exact solutions (τ n = 0), whereas the two-sided case in part (b) requires us to (somewhat arbitrarily) choose a strictly positive slackness sequence τ n .Such a choice is not needed in the procedure for forming confidence sets for Θ 0 that we consider next.

Confidence sets
A confidence set for the GAN minimax solutions Θ 0 is a random set that covers the entire Θ 0 with a prespecified probability.(As in GANs the individual elements θ 0 of Θ 0 do not have any special interpretation attached to them, a confidence set covering all of Θ 0 and not just some particular element of it is appropriate.)Let 1 − α denote the desired coverage probability (such as 95%) where α ∈ (0, 1).We aim to construct confidence sets ĈS n,1−α that contain Θ 0 with at least probability 1 − α, that is, sets that are conservatively asymptotically consistent at level 1 − α. (Here P refers to the probability measure of (X, Z) which we consider fixed.) In the traditional point-identified setting in which Θ 0 consists of a single point θ 0 , a conventional route to forming confidence sets is to consider a Taylor approximation of a certain function around θ 0 ; this leads to the (often Gaussian) distribution of an appropriately rescaled estimator from which confidence sets for θ 0 are then obtained in a straightforward manner.Biau et al. (2020) consider this point-identified case and, focusing on the generator network γ-parameters, in their Theorem 4.3 state that n 1/2 (γ n − γ 0 ) converges in distribution to a Gaussian random variable.When Θ 0 is not a singleton set alternative approaches are called for.Given the results of Biau et al. (2020), we focus only on the case of multiple (more than one) solutions.This is the typical case in GAN applications.Our approach follows the recent developments in the partial identification literature in econometrics; we rely in particular on the results in Chernozhukov et al. (2007) and Romano and Shaikh (2010) (for further related references, see the survey of this literature in Molinari, 2020).
The confidence sets we consider are based on appropriate lower contour sets of the rescaled criterion function n 1/2 Qn (θ).To motivate the subsequent technical developments, we begin with some informal remarks.First, if the 1 − α quantile c n,1−α of the distribution of sup θ∈Θ 0 n 1/2 Qn (θ) was known, one could form the (infeasible Second, consider the statistic sup θ∈S n 1/2 Qn (θ) where S is some nonempty subset of Θ; under appropriate conditions, it can be shown that as n → ∞, sup θ∈S n 1/2 Qn (θ) remains stochastically bounded for S ⊆ Θ 0 and diverges to infinity for S ⊈ Θ 0 .Now, to form our confidence sets we approximate the 1 − α quantile of sup θ∈S n 1/2 Qn (θ) for various sets S using a suitable resampling method, and then locate a confidence set for Θ 0 based on the different behavior of sup θ∈S n 1/2 Qn (θ) in the two cases S ⊆ Θ 0 and S ⊈ Θ 0 .As our situation involves non-standard features, resampling based on standard bootstrap is not appropriate and instead we use a procedure based on subsampling (see Politis et al., 1999).
To formalize this discussion, we next introduce some notation.For any nonempty subset S of Θ, let L n (S, x) denote the cumulative distribution function (cdf) of the statistic sup θ∈S n 1/2 Qn (θ) and c n,1−α (S) the corresponding (smallest) 1 − α quantile, that is, We can now state the iterative procedure that we use to construct the desired confidence sets.This procedure is akin to step-down procedures used in multiple hypothesis testing problems and follows Lehmann and Romano (2005, Sec 9.1) and Romano and Shaikh (2010).
This procedure starts from the full parameter space S 1 = Θ and iteratively discards θ's from the S j -sets until a suitable confidence set is formed.Overall, the procedure is certainly computer-intensive yet feasible (to reduce computational costs, one can use just a subset of the N n subsamples without affecting the validity of our results; cf.Politis et al., 1999, Cor 2.4.1).Implementation of the procedure is illustrated in a small numerical example in the next section.Theorem 2 below shows that, under appropriate assumptions, the confidence sets ĈS n,1−α formed using Procedure 1 are valid confidence sets in the sense that (18) holds.The key requirement for this subsampling-based procedure to work is the following high-level assumption.
The convergence requirement in Assumption 3 is a standard high-level condition needed for the validity of subsampling procedures (see Politis et al., 1999, Sec 2.2).Ensuring that Assumption 3 holds in GAN minimax problems turns out to be particularly challenging and we resort to empirical process theory for this.(A thorough account of the needed empirical process theory can be found in the monograph of van der Vaart and Wellner (1996).)With the set Θ compact as before, let l ∞ (Θ) denote the space of all bounded real-valued functions on Θ equipped with the supremum norm and C(Θ) stand for the subspace of those functions that are continuous.Random functions such as fn (•) are viewed as maps from appropriate underlying probability spaces to l ∞ (Θ), and ⇝ denotes weak convergence as defined in van der Vaart and Wellner (1996, Sec 1.3).
A convenient and broadly applicable high-level assumption that we make is that a suitable functional central limit theorem (Donsker property) holds.
and G a tight mean-zero Gaussian process taking values in C(Θ) with probability one and such that inf Assumption 4 is very general and rather technical.Importantly, we note that Assumption Smooth GAN implies the validity of Assumption 4 (justification of this is given in Appendix C).Alternatively, Assumption 4 also holds under much weaker conditions and can be verified using a variety of different methods in empirical process theory (for details, see van der Vaart and Wellner, 1996).
Showing that Assumption 3 follows from Assumption 4 (and other additional conditions) requires rather long and technical details.In order to not distract from the main issue, we outline the key points of the argument here and relegate the technical details to Appendix B. First, it is shown that for a suitably defined map ϕ : l ∞ (Θ) → R, the statistic sup θ∈Θ 0 n 1/2 Qn (θ) in Assumption 3 can be expressed as sup θ∈Θ 0 n 1/2 Qn (θ) = n 1/2 (ϕ( fn )−ϕ(f )).Second, the map ϕ is shown to be (directionally) differentiable in an appropriate sense -this step is the technically most delicate one (and Lemma 3 in Appendix B containing the key novel results may be of independent interest).Third, the previous facts enable us to apply a particular version of the functional delta method to deduce that n 1/2 (ϕ( fn ) − ϕ(f )) ⇝ ϕ ′ f (G) where ϕ ′ f denotes a certain (directional) derivative of the mapping ϕ at f (and G is the Gaussian process in Assumption 4).Thus Assumption 3 follows.To prove the differentiability result mentioned, additional assumptions are needed and we focus on the following leading case.Assumption 5.The set Γ 0 = {γ 0 ∈ Γ : (γ 0 , δ 0 ) ∈ Θ 0 for some δ 0 ∈ ∆} is finite.
Assumption 5 requires that the outer minimization problem in the population minimax problem (3) is solved at a finite number of γ-values only (the inner maximization problem is allowed to have infinitely many solutions).This assumption is quite reasonable in practical GAN applications and is needed to prove the mentioned differentiability result.
All the details of the preceding discussion are available in Appendix B, where we prove the following lemma.
Lemma 1 offers one convenient way to verify that Assumption 3 holds in the GAN setting.Even with the finiteness requirement of Assumption 5, the technical details in Appendix B are rather long and the limiting distribution in (26) quite complicated.Assumption 3 could potentially be verified using weaker conditions but we do not pursue this.
After these preparations, we are now ready to state our main result regarding the confidence sets formed by Procedure 1.Let L(•) denote the cdf of the limiting random variable L in Assumption 3, c 1−α = inf{x ∈ R : L(x) ≥ 1 − α} the corresponding 1 − α quantile, and lim c↑c 1−α L(c) the limit from the left of L(•) at c 1−α .The following theorem is proved in Appendix B.
Theorem 2. Suppose Assumptions GAN and 3 hold, the population GAN minimax problem (3) has multiple solutions, the confidence sets ĈS n,1−α (n = 1, 2, . ..) are formed using Procedure 1, and the subsampling size b is such that b → ∞ and b/n → 0 as n → ∞.Then the confidence sets ĈS n,1−α satisfy lim inf Theorem 2 shows that the confidence sets ĈS n,1−α have an asymptotic coverage probability of at least 1 − α when the cdf In general the limiting distribution L is rather complicated and verifying the continuity of L(•) requires some more concrete assumptions.In Appendix B we show that this continuity holds for instance when Θ 0 is a finite set of the form Θ 0 = {(γ 01 , δ 01 ), . . ., (γ 0K , δ 0K )} for some finite K > 1 and with distinct γ's and δ's (i.e., γ 0i ̸ = γ 0j and δ 0i ̸ = δ 0j for all i ̸ = j).This case with multiple but finite number of distinct solutions covers many practical GAN applications -in particular, in this case Assumption Smooth GAN suffices for the validity of Theorem 2.
In previous work, confidence regions for general set-identified parameters were considered by Chernozhukov et al. (2007) and Romano and Shaikh (2010).Our Theorem 2 is based on their results, especially on Theorem 2.2 of Romano and Shaikh (2010).However, these previous works do not consider minimax problems.Shapiro (2008, Thm 3.1) and Shapiro et al. (2009, Thm 5.10) study the asymptotic distribution of the optimal value in convex-concave minimax problems but do not consider inference for the solutions of minimax problems.The confidence sets we provide are novel in the context of general minimax problems.Biau et al. (2020, Thm 4.3) consider the GAN problem assuming a single unique solution exists and give a result on the asymptotic normality of the generator network parameters; however, they remark (pp.1560-1561) that the assumed uniqueness is "hardly satisfied in the high-dimensional context of (deep) neural networks" and call for generalizations.Theorem 2 builds on these previous results and provides the first inference procedure for GAN parameters in the presence of multiple solutions.Our construction of the confidence sets is specifically tailored for the case of multiple solutions; in the case of a single solution, confidence sets can be formed based on the results of Biau et al. (2020) (for details, see Appendix B).
The confidence sets of this section facilitate statistical inference for the solutions of GAN minimax problems.Regarding the potential practical use of these confidence sets in GAN applications, recall from the Introduction that when (γ 0 , δ 0 ) ∈ Θ 0 , the generator G(Z, γ 0 ) gives the researcher a mechanism to produce synthetic data.In recent work, Karras et al. (2021, Sec 3.2, Fig 4) have considered the effect of small variations in the input noise on the synthetic images produced.Similarly, one could consider how the produced images (or, more generally, the distribution of G(Z, γ 0 )) change when the generator parameters vary, say, within a certain confidence set.Related to this, the effect of varying the generator parameters (through parameter interpolation) has recently been considered in Wang et al. (2019) and Pan et al. (2022, Secs 4.2 and 5).Exploring the effect of such variations on the synthetic data produced in other GAN applications (such as DNA sequences, art, or music) would also be interesting.We leave these issues for future research.

Numerical illustration
We next illustrate our results in small numerical experiments.We consider a toy example that is as simple as possible, thus allowing us to both analytically solve the population GAN problem as well as to graphically illustrate consistent estimation and confidence sets for Θ 0 in two dimensions.It should be emphasized that real GAN applications used in practice are of course remarkably more complicated than our toy illustration.
In our example, we use a GAN to mimic data from a more complicated univariate distribution by simple Gaussian noise.The set-up is quite similar to one of the examples Biau et al. (2020Biau et al. ( , p. 1552-) -) use to illustrate their results.The real data X is assumed to follow the so-called claw distribution (Marron and Wand, 1992): Letting N (x; µ, σ 2 ) denote the probability density function (pdf) of a normal random variable with mean µ and variance σ 2 , the claw density is defined as p claw (x) = 1 2 N (x; 0, 1) + 4 k=0 1 10 N (x; k 2 − 1, 0.01); this pdf together with the standard normal one are illustrated in the top left graph of Figure 1.The random noise Z is assumed to be standard normally distributed, and the generator function is a simple shift formulated as With this very simple formulation, the distribution of G(Z, γ) never matches that of X but is close to it when γ = 0.5 or γ = 1.5 (corresponding to no shift at all).The cosine function is used to incorporate multiple solutions in a transparent manner.The discriminator we employ is a simple one parameter function (motivated by Goodfellow et al., 2014, eqn (2), andBiau et al., 2020, eqn (2.1)) given by Intuitively, the two densities p claw (x) and N (x; cos(δπ), 1) are close when δ = 0.5 or δ = 1.5.More formally, in this toy example the population GAN problem (3) can be easily solved (see Biau et al., 2020, Sec 2, andBelomestny et al., 2021, Secs 1-2, for further helpful details): For any fixed γ, the inner maximization problem in (3) is solved for those δ that satisfy cos(δπ) = cos(γπ) as in these cases the discriminator D(x, δ) coincides with the so-called optimal discriminator; therefore problem (3) reduces to minimizing the Jensen-Shannon divergence between the probability distributions of X and G(Z, γ) (cf. the last paragraph of Section 2) which happens for γ = 0.5 or γ = 1.5.Therefore the population GAN problem has four solutions and Θ 0 = {(0.5,0.5), (0.5, 1.5), (1.5, 0.5), (1.5, 1.5)}.These are the four points that solve inf γ∈Γ sup δ∈∆ f (γ, δ) in (3) or, alternatively, that solve Q(γ, δ) = 0 (see ( 7) and ( 8)).The second and third plots in the first row of Figure 1 illustrate functions f (γ, δ) (second plot) and Q(γ, δ) (third plot) as well as the four solutions in Θ 0 (the four red dots in these plots).Note that already in this toy example, the landscapes of f and Q are somewhat non-trivial.Now consider the sample GAN problem (9) and consistency of the estimator Θn (τ n ).Taking n IID draws X 1 , . . ., X n and Z 1 , . . ., Z n from X and Z as above, the second row of Figure 1 illustrates Θn (τ n ) for four different choices of n and τ n : n = 100 with τ n = 0.1n −0.25 and τ n = 0.1n −0.49(first two plots) and n = 10.000 with τ n = 0.1n −0.49 and τ n = 0 (last two plots).
Comparing the first two estimators demonstrates the effect of a slower vs faster convergence of τ n to zero, while the effect of an increasing sample size is seen by comparing the second estimator with the third.The fourth plot illustrates that the exact estimator Θn (0) is not necessarily Hausdorff consistent as it may not contain all four elements of Θ 0 .We next illustrate forming confidence sets using Procedure 1: One starts from the full parameter space S 1 = Θ and iteratively discards θ's from the sets S j based on the appropriate quantile of suitable subsample statistics.The third and fourth rows of Figure 1 illustrate the working of Procedure 1 in one simulated data set, where we used sample size n = 1000, subsample size b = 501, α = 0.20, and 200 randomly chosen subsamples (as mentioned after Procedure 1, one can use only a subset of the possible subsamples and not all of them; note that the total number of possible subsamples in our exercise is astronomical as 1000 501 ≈ 10 299 ).The third row of Figure 1 plots the sets S 2 , S 3 , S 4 , and the final confidence set ĈS n,1−α = S 6 .Below these are shown (for j = 2, 3, 4, 6) the empirical distribution function Ln,b (S j , x) together with the 1 − α sample quantile ĉn,b,1−α (S j ) (red dotted line) and the quantity sup θ∈S j n 1/2 Qn (θ) (blue dotted line).In this particular data set, on round 6 of Procedure 1, sup θ∈S 6 n 1/2 Qn (θ) < ĉn,b,1−α (S 6 ) and thus the procedure stops and the confidence set is formed as ĈS n,1−α = S 6 .We note that all the quantities required in Procedure 1 can be calculated based on the expressions in ( 9)-( 13) and ( 19).
We also check whether Assumption 3 (required in Theorem 2) seems reasonable -that is, does the statistic sup θ∈Θ 0 n 1/2 Qn (θ) converge in distribution to some random variable.The top-right graph of Figure 1 plots the empirical cdf of sup θ∈Θ 0 n 1/2 Qn (θ) based on 200 separate IID samples of X 1 , . . ., X n and Z 1 , . . ., Z n of size n, for three different sample sizes (n = 100 in blue, n = 1000 in red, n = 10.000 in black), and for the set Θ 0 solved above.These empirical distributions appear, even for the largest sample size considered, non-degenerate (and even continuous), suggesting Assumption 3 holds.Finally, to empirically verify the result of Theorem 2 in a small Monte Carlo exercise, we inspect whether the confidence sets ĈS n,1−α have correct coverage.The set-up of our Monte Carlo exercise is as follows.We consider three different nominal coverage probabilities 1 − α (80%, 90%, 95%) and three sample sizes n (100, 500, 1000).For each sample size n, we consider three subsample sizes b: 25, 40, and 63 for n = 100; 77, 144, and 269 for n = 500; and 126, 251, 501 for n = 1000 (these correspond to b ≈ n 0.7 , b ≈ n 0.8 , and b ≈ n 0.9 that conform with the requirements b → ∞ and b/n → 0 as n → ∞ in Theorem 2).As for the number of subsamples, we always use 200 randomly chosen subsamples.On each simulation round we form the confidence set ĈS n,1−α according to Procedure 1 and check whether it covers the entire set Θ 0 or not.This exercise is repeated 1000 times and Table 1 presents the resulting empirical coverage rates (in %).Inspection of the results indicates that the choice of subsample size b greatly affects the results; this is commonly the case with subsampling, see Politis et al. (1999, Ch 9).It can also be seen that both the sample size n and the subsample size b need to be large enough for the coverage rates to be close to the desired levels.Nevertheless, for the largest n and b considered, the empirical coverage rates 83.4%, 92.1%, and 95.4% are reasonably close to the desired 80%, 90%, and 95% nominal levels, suggesting that the confidence sets formed using Procedure 1 have the appropriate coverage property stated in Theorem 2. (ii)-(iii) Functions f (γ, δ) (second plot) and Q(γ, δ) (third plot), with the four solutions in Θ 0 = {(0.5,0.5), (0.5, 1.5), (1.5, 0.5), (1.5, 1.5)} marked with red dots.(iv) Empirical cdf of sup θ∈Θ0 n 1/2 Qn (θ) for n = 100 (blue), n = 1000 (red), and n = 10.000(black).Second row: The estimator Θn (τ n ) for four different choices of n and τ n : n = 100 with τ n = 0.1n −0.25 and τ n = 0.1n −0.49(first two plots) and n = 10.000 with τ n = 0.1n −0.49 and τ n = 0 (third and fourth plots).Third and fourth rows: Illustration of Procedure 1, with sets S 2 , S 3 , S 4 , and the final confidence set S 6 = ĈS n,1−α on row 3, and the corresponding empirical distribution function Ln,b (S j , x) for j = 2, 3, 4, 6 (together with the 1 − α sample quantile ĉn,b,1−α (S j ), red dotted line, and the quantity sup θ∈Sj n 1/2 Qn (θ), blue dotted line) on row 4.

Other minimax problems
This paper has so far focused on the GAN minimax problem (1) as GANs have arguably been one of the key reasons for the recent surge of interest in minimax problems in machine learning literature.In this section we widen our scope and briefly discuss the application of our results to other minimax problems.We note that the existing literature on general minimax problems is vast, originating nearly a century ago with the seminal contribution of von Neumann (1928), and it is beyond the scope of this article to review this literature properly; one could, for instance, see von Neumann and Morgenstern (2004) for minimax problems in game theory, and Bertsekas et al. (2003) for minimax problems in optimization theory.
One standard formulation for general minimax problems appearing in the literature is the one used in Shapiro (2008); this formulation is also used in the textbooks Shapiro et al. (2009, Sec 5.1.4)and Lan (2020, Sec 4.3).The theory developed in the present paper applies, with minor modifications, also to this general minimax problem.Let us outline the required changes.Instead of the population and sample GAN problems (3) and ( 9), consider the general population and sample minimax problems respectively.No particular interpretation is attached to the random vector Z and the parameters γ and δ.Furthermore, replace Assumptions GAN and Smooth GAN with the following two conditions.
Assumption Minimax.(a) Suppose Z 1 , . . ., Z n are IID random vectors with the same distribution as Z and taking values in With these modifications, all the results of this paper continue to hold also in this general minimax setting.In particular, it is straightforward to see that Theorems 1 and 2 on consistent estimation and confidence sets remain valid.In previous literature, Shapiro et al. (2009, Thm 5.9) have given a counterpart of Theorem 1(a) in general minimax problems, and Shapiro (2008, Thm 3.1) and Shapiro et al. (2009, Thm 5.10) have studied the asymptotic distribution of the optimal value in convex-concave minimax problems.Our results complement these earlier works, in particular by providing confidence sets for the solutions of the minimax problem.A novelty is that all our results hold also in the non-convex and non-concave setting (previous literature typically assumes that f (•, δ) is convex for all fixed δ ∈ ∆ and f (γ, •) is concave for all fixed γ ∈ Γ) and in the presence of multiple solutions.
The general minimax formulation considered above is very common and so similar to the GAN problem that only cosmetic adjustments to our theory were needed.However, in many examples in the literature the precise formulation of the minimax problem is quite different from the one above.For instance in the machine learning literature, somewhat different variants of minimax-type problems have been considered at least in adversarial learning (Madry et al., 2018;Javanmard et al., 2020), multi-agent reinforcement learning (Wai et al., 2018;Zhang et al., 2021), distributionally robust optimization (Delage and Ye, 2010;Duchi and Namkoong, 2021), federated learning (Mohri et al., 2019), signal processing (Lu et al., 2020), and AUC (area under the ROC curve) maximization (Ying et al., 2016).Adapting the theory of the present paper to such variations of the minimax problem would require more substantial changes and is left for future research.

Conclusion
In this paper, we have considered statistical inference for GANs and other minimax problems in the empirically relevant case of multiple solutions.We first considered the consistency of (approximate) solutions to the sample GAN problem, and showed that the set of these solutions is a Hausdorff consistent estimator of the corresponding set of solutions to the population GAN problem, Θ 0 .We then presented a subsampling-based iterative procedure for forming confidence sets for Θ 0 , and showed that these confidence sets are conservatively asymptotically consistent.The consistency result was shown to hold without any restrictions on the number of solutions to the GAN problem, whereas our results for confidence sets were given for the common case of multiple but finite number of solutions.These results extend on the results of Biau et al. (2020) who considered the case of a single unique solution.For other general minimax problems, our results complement previous works by providing confidence sets for the solutions of the minimax problem; our assumptions allow for the non-convex and non-concave setting and the presence of multiple solutions.
The present paper considered only the original Goodfellow et al. (2014) formulation of the GAN minimax problem and the standard formulation of the general minimax problem in Section 6. Extensions to other existing GAN variants (such as the popular Wasserstein GAN of Arjovsky et al., 2017) and to other minimax problems in machine learning and elsewhere would be useful.Furthermore, the focus in this paper has been theoretical and exploring the use of our results in practical applications would be interesting.

Appendix A: Proof of Theorem 1
Proof of Theorem 1.For completeness, first note that measurability issues and non-emptyness of Θ 0 and Θn (with probability one) are discussed in Shapiro et al. (2009, pp. 170-171).For ease of reference, also note that for any two bounded real-valued functions u and v defined on a domain D (a subset of some Euclidean space), ( 20) where all the supremums and infimums are taken over x ∈ D.

Appendix B: Proofs of Lemma 1 and Theorem 2
We begin with some preparatory discussion.As was outlined in Section 4, the statistic sup θ∈Θ 0 n 1/2 Qn (θ) can be expressed as n 1/2 (ϕ( fn )−ϕ(f )) for a suitably defined map ϕ : l ∞ (Θ) → R (to be given below in the proof of Lemma 1), this map will be shown to be differentiable in an appropriate sense, and these facts enable us to apply a suitable functional delta method to obtain the result of Lemma 1.To derive the necessary results, it is convenient to define the relevant differentiability concepts for maps between abstract Banach spaces; the specific spaces used below include l ∞ (Θ), C(Θ), and the Euclidean space R.
Let X and Y be real Banach spaces with norms ∥•∥ X and ∥•∥ Y , respectively.Let the domain X D ⊆ X be some subset of X, and consider an arbitrary map ϕ : X D → Y .Let X 0 ⊆ X be another subset of X.The map ϕ is said to be Hadamard directionally differentiable at x ∈ X D tangentially to X 0 with a derivative ϕ ′ x : for all sequences {h n } ⊂ X and {t n } ⊂ R + such that t n ↓ 0 and h n → h ∈ X 0 as n → ∞ and x + t n h n ∈ X D for all n.Two related differentiability concepts are defined as follows.If the convergence ( 25) is required to hold only for all fixed h n ≡ h ∈ X 0 , the map ϕ is said to be Gateaux directionally differentiable (at x ∈ X D tangentially to X 0 ).Alternatively, if in the above definition the requirement "{t n } ⊂ R + such that t n ↓ 0" is replaced with "{t n } ⊂ R such that t n → 0", the map is said to be (fully) Hadamard differentiable (at x ∈ X D tangentially to X 0 ).In statistical literature, these differentiability concepts are discussed and developed in Reeds (1976), Gill (1989), and Shapiro (1990;1991), among others.
Our main interest in these differentiability concepts comes from the fact that the functional delta method may become applicable.For (fully) Hadamard differentiable maps this is well known; see, e.g., van der Vaart and Wellner (1996, Thm 3.9.4).For Hadamard directionally differentiable maps the functional delta method is given in Shapiro (1991, Thm 2.1); recently attention to this result has been drawn for instance by Fang and Santos (2019, Thm 2.1) and Cárcamo et al. (2020, Propn 2.1).For convenience, we reproduce this result as the following lemma.
Lemma 2. Suppose the map ϕ : X D → Y is Hadamard directionally differentiable at x ∈ X D tangentially to X 0 with a derivative ϕ ′ x : X 0 → Y .Let x 1 , x 2 , . . .be X D -valued random elements such that r n (x n − x) ⇝ x 0 in X for some random element x 0 taking its values in X 0 with probability one and for some constants Gateaux directional differentiability is in general too weak of a differentiability concept to ensure the validity of the functional delta method.However, when the map ϕ is also locally Lipschitz at x ∈ X D (in the sense that there exists a C > 0 such that for all x ′ , x ′′ ∈ X D in some neighborhood of x, ∥ϕ(x ′ ) − ϕ(x ′′ )∥ Y ≤ C∥x ′ − x ′′ ∥ X ), Gateaux and Hadamard directional differentiability become equivalent (Shapiro, 1990, Propn 3.5).This is convenient and holds for the maps considered in the next lemma.In this lemma, we let Γ and ∆ denote any compact and non-empty Euclidean sets (which need not equal the sets Γ and ∆ elsewhere in the paper; the same notation is used for convenience).Furthermore, for any function x ∈ C(∆) we denote ∆ 0 (x) = {δ 0 ∈ ∆ : x(δ 0 ) = sup δ∈∆ x(δ)}, and for any function x ∈ C(Γ × ∆) we let x(γ) = sup δ∈∆ x(γ, δ) denote the max-function and define the sets For any functions x, y ∈ l ∞ (Γ × ∆), we also allow ourselves to write y(γ, δ).
Finally, for notational ease the three different maps in parts (a)-(c) of the following lemma are all denoted by ϕ as confusion should not arise.
(a) The map ϕ : l ∞ (∆) → R given by ϕ The proof of Lemma 3 is given at the end of this appendix.The result in part (a) is wellknown and versions of it can be found in Shapiro (1991, Thm 3.1), Fang and Santos (2019, Lemma S.4.9), and Cárcamo et al. (2020, Corollary 2.3), among others.Result (b) in a convexconcave special case (x and h convex in γ and concave in δ) has been obtained by Shapiro (2008, Propn 2.1), but the extension to the present non-convex non-concave case appears to be new and facilitates our analysis of minimax problems and GANs.Our main interest is in part (c), the proof of which relies on parts (a) and (b).The finiteness of Γ 0 (x 0 ) is assumed in part (c) because the mapping taking functions x in l ∞ (Γ × ∆) to functions sup δ∈∆ x(γ, δ), γ ∈ Γ, in l ∞ (Γ) is in general not Hadamard directionally differentiable without some additional assumptions.
With the general Lemmas 2 and 3 available, we now return to the GAN problem and first give the proof of Lemma 1.
We close this appendix with the proof of Lemma 3.
Taking an infimum over γ ∈ Γ and then the limes inferior as n → ∞ establishes the latter inequality in (31), hence completing the proof of part (b).
a∈A d(a, B), sup b∈B d(b, A) , where d(a, B) = inf b∈B |a − b| is the shortest distance from the point a to the set B. That is, the Hausdorff distance is the greatest distance from an arbitrary point in one of the sets to the closest neighboring point in the other set.The Hausdorff distance d H is a metric for the family of non-empty compact sets and for such sets d H (A, B) = 0 if and only if A = B.The consistency result we aim to prove is Hausdorff consistency in the sense that d H ( Θn (τ n ), Θ 0 ) p → 0. Establishing this requires us to show that both of the 'one-sided Hausdorff consistency'

Table 1 :
Results of a Monte Carlo study, empirical coverage rate