Predictive Inference Based on Markov Chain Monte Carlo Output

In Bayesian inference, predictive distributions are typically in the form of samples generated via Markov chain Monte Carlo or related algorithms. In this paper, we conduct a systematic analysis of how to make and evaluate probabilistic forecasts from such simulation output. Based on proper scoring rules, we develop a notion of consistency that allows to assess the adequacy of methods for estimating the stationary distribution underlying the simulation output. We then provide asymptotic results that account for the salient features of Bayesian posterior simulators and derive conditions under which choices from the literature satisfy our notion of consistency. Importantly, these conditions depend on the scoring rule being used, such that the choices of approximation method and scoring rule are intertwined. While the logarithmic rule requires fairly stringent conditions, the continuous ranked probability score yields consistent approximations under minimal assumptions. These results are illustrated in a simulation study and an economic data example. Overall, mixture‐of‐parameters approximations that exploit the parametric structure of Bayesian models perform particularly well. Under the continuous ranked probability score, the empirical distribution function is a simple and appealing alternative option.


Introduction
Probabilistic forecasts are predictive probability distributions over quantities or events of interest. They implement an idea that was eloquently expressed already at the beginning of the 20th century in the context of meteorological prediction: predictive distributions in a wide range of applications, including economic, ecological and meteorological problems, among many others. Bayesian posterior predictive distributions naturally account for sources of uncertainty-such as unknown model parameters, or latent variables in state space models-that are not easily captured using frequentist methods; see, for example, Clark (2005) for an ecological perspective.
Formally, posterior predictive distributions arise as mixture distributions with respect to the posterior distribution of the parameter vector. In the following, we assume that the parameter vector contains all quantities that are subject to Bayesian inference, including also latent state variables, for example. For a real-valued continuous quantity of interest, the posterior predictive distribution, F 0 , can be represented by its cumulative distribution function (CDF) or the respective density. The posterior predictive CDF is then of the generic form for x 2 R, where P post is the posterior distribution of the parameter, Â , over some parameter space, ‚, and F c ( |Â ) is the conditional predictive CDF when Â 2 ‚ is the true parameter. Harris (1989) argues that predictive distributions of this form have appeal in frequentist settings as well. Often, the integral in (1) does not admit a solution in closed form, and so the posterior predictive CDF must be approximated or estimated in some way, typically using some form of Markov chain Monte Carlo (MCMC); see, for example, Gelfand & Smith (1990) and Gilks et al. (1996). Given a simulated sequence .Â i / m i D1 of parameter values from P post , one approach, which we call the mixture-of-parameters (MP) technique, is to approximate F 0 by However, this method can be used only when the conditional distributions F c ( |Â ) are available in closed form. An alternative route is to simulate a sequence .X i / m i D1 where X i F c ( |Â i ), and to approximate F 0 based on this sample, using either nonparametric or parametric techniques. The most straightforward option is to estimate F 0 by the empirical CDF (ECDF), Alternatively, one might employ a kernel density (KD) estimate of the posterior predictive density, namely, where K is a kernel function, that is, a symmetric, bounded and square-integrable probability density, such as the Gaussian or the Epanechnikov kernel, and h m is a suitable bandwidth (Rosenblatt, 1956;Silverman, 1986). Finally, much extant work employs a Gaussian approximation (GA) to F 0 , namely, whereˆis the CDF of the standard normal distribution and O m and O m are the empirical mean and standard deviation of the sample .X i / m i D1 . Following Rubin (1984) and Little (2006), it is now widely accepted that posterior predictive inference should be evaluated using frequentist principles, without prior information entering at the model evaluation stage. For the comparison and ranking of probabilistic forecasting methods, one typically uses a proper scoring rule (Gneiting & Raftery, 2007) that assigns a numerical score or penalty based on the predictive CDF, F, or its density, f, and the corresponding realisation, y, such as the logarithmic score (LogS; Good, 1952), LogS. F; y/ D log f .y/; or the continuous ranked probability score (CRPS; Matheson & Winkler, 1976), While the LogS and CRPS are the two most popular scoring rules in applications, they feature interesting conceptual differences, which we discuss in Section 2.2. In practice, one finds and compares the mean score over an out-of-sample test set, and the forecasting method with the smaller mean score is preferred. Formal tests of the null hypothesis of equal predictive performance can be employed as well (Diebold & Mariano, 1995;Giacomini & White, 2006;Clark & McCracken, 2013;DelSole & Tippett, 2014). Table 1 of the supporting information summarises the use of evaluation techniques in recently published comparative studies of probabilistic forecasting methods that use Bayesian inference via MCMC. As shown in the table, the MP technique has mainly been applied in concert with the LogS, whereas the ECDF method can be used in conjunction with the CRPS only. However, to this date, there are few, if any, guidelines to support choices in the table, and it is not clear how they affect practical model comparisons. The present paper provides a systematic analysis of this topic. We focus on the following questions. First, what defines reasonable choices of the approximation method and scoring rule? Second, under what conditions do extant choices from the literature satisfy this definition? Third, for a given scoring rule, how accurate are alternative approximation methods in practically relevant scenarios?
In studying these questions, our work is complementary to Gneiting & Raftery (2007) who develop the broader theory of scoring rules and portray their rich mathematical and decision theoretic structure. While Gneiting & Raftery (2007) mention simulated predictive distributions (see in particular their Section 4.2), the empirical literature surveyed in the supporting information has largely evolved after 2007, giving rise to the applied techniques that motivate the present paper.
We emphasise that the present study-and the use of scoring rules in general-concerns the comparative assessment of two or more predictive models: the model with the smallest mean score is considered the most appropriate. Comparative assessment is essential in order to choose among a large number of specifications typically available in practice. This task is different from absolute assessment, which amounts to diagnosing possible misspecification, using the probability integral transform (Dawid, 1984;Diebold et al., 1998), posterior predictive checks (Gelman et al., 1996;Held et al., 2010;Gelman et al., 2014a, Chapter 6) and related methods.
The remainder of this paper is organised as follows. Section 2 introduces the notion of a consistent approximation to F 0 . This formalises the idea that, as the size of the simulated sample becomes larger and larger, and in terms of a given scoring rule, the approximation ought to perform as well as the unknown true forecast distribution. In Section 3, we provide theoretical justifications of approximation methods encountered in the literature. Sections 4 and 5 present simulation and empirical evidence on the performance of these methods, and Section 6 concludes with a discussion. Overall, our findings support the use of the MP estimator at (2) in order to approximate the posterior predictive distribution of interest. If this estimator is unavailable, the ECDF estimator at (3) is a simple and appealing alternative. Technical material and supplementary analyses are deferred to Appendices A-E. The supporting information contains a bibliography of the pertinent applied literature and additional figures.

Formal Setting
In this section, we discuss the posterior predictive distribution in Bayesian forecasting, give a brief review of proper scoring rules and score divergences and introduce the concept of a consistent approximation method based on MCMC output.
As discussed earlier, the posterior predictive CDF of a Bayesian forecasting model is given by where Â 2 ‚ is the parameter, P post is the posterior distribution of the parameter and F c ( |Â ) is the predictive distribution conditional on a parameter value Â ; see, for example, Greenberg (2013, p. 33) or Gelman et al. (2014a, p. 7). A generic MCMC algorithm designed to sample from F 0 can be sketched as follows.
This generic MCMC algorithm allows for two general options for estimating the posterior predictive distribution F 0 in (1), namely, where m typically is on the order of a few thousands or ten thousands. Alternatively, some authors, such as Krüger et al. (2017), generate, for each i D 1, : : : , m, independent draws X ij F c ( |Â i ), where j D 1, : : : , J; see also Waggoner & Zha (1999, Section III.B). The considerations in the succeeding text apply in this more general setting as well.

Approximation Methods
In the case of Option A, the sequence .Â i / m i D1 of parameter draws is used to approximate the posterior predictive distribution, F 0 , by the MP estimator O F MP m in (2). Under the assumption of ergodicity, This estimator was popularised by Gelfand & Smith (1990, Section 2.2.), based on earlier work by Tanner & Wong (1987), and is often called a conditional or Rao-Blackwellised estimator. The latter term hints at variance reduction that may result from conditioning on the parameter draws (see Theorem 4). We refer to O F MP m as the MP estimator. In the case of Option B, the sample .X i / m i D1 is employed to approximate the posterior predictive distribution F 0 . Various methods for doing this have been proposed and used, including the (5). Approaches of this type incur 'more randomness than necessary', in that the simulation step to draw .X i / m i D1 can be avoided if Option A is used. That said, Option A requires full knowledge of the model specification, as the conditional distributions must be known in closed form in order to compute O F MP m . There are situations where this is restrictive, for example, when the task is to predict a non-linear transformation of the original, possibly vector-valued predictand (see the set-up in Feldmann et al. 2015, Section 6d, for an example from meteorology). We emphasise, however, that the MP estimator is readily available in the clear majority of applied examples that we encounter in our work.
The implementation of the approximation methods (based on either Option A or B) is typically straightforward, except for the case of KD estimation, for which we discuss implementation choices in Section 3.3.

Proper Scoring Rules and Score Divergences
Let Â R denote the set of possible values of the quantity of interest, and let F denote a convex class of probability distributions on . A scoring rule is a function that assigns numerical values to pairs of forecasts F 2 F and observations y 2 . We typically set D R but will occasionally restrict attention to compact subsets. Throughout this paper, we define scoring rules to be negatively oriented; that is, a lower score indicates a better forecast. A scoring rule is proper relative to F if the expected score for all probability distributions F; G 2 F. It is strictly proper relative to the class F if, furthermore, equality implies that F D G. The score divergence associated with the scoring rule S is given by d S . F; G/ D S. F; G/ S.G; G/: Dawid-Sebastiani score log 2 For a probability distribution with CDF F, we write F for its mean, F for its standard deviation and f for its density.
Clearly, d S (F,G) 0 for all F; G 2 F is equivalent to propriety of the scoring rule S, which is a critically important property in practice. 1 Table 1 shows frequently used proper scoring rules, along with the associated score divergences and the natural domain. For any given scoring rule S, the associated natural domain is the largest convex class of probability distributions F such that S(F,y) is well defined and finite almost surely under F. Specifically, the natural domain for the popular LogS [Equation (6)] is the class L 1 of the probability distribution with densities, and the respective score divergence is the Kullback-Leibler divergence. The LogS is local (Bernardo, 1979); that is, it evaluates a predictive model based only on the density value at the realising outcome. Conceptually, this means that the LogS ignores the model's predicted probabilities of events that could have happened but did not. For the CRPS [Equation (7)], the natural domain is the class M 1 of the probability distributions with finite mean. The LogS and CRPS are both strictly proper relative to their respective natural domains. In contrast to the LogS, the CRPS rewards predictive distributions that place mass close to the realising outcome, a feature that is often called 'sensitivity to distance' (e.g. Matheson & Winkler, 1976, Section 2). While various authors have argued in favour of either locality or sensitivity to distance, the choice between these two contrasting features appears ultimately subjective. Finally, the natural domain for the Dawid-Sebastiani score (DSS; Dawid & Sebastiani, 1999) is the class M 2 of the probability distributions with strictly positive, finite variance. This score is proper, but not strictly proper, relative to M 2 .

Consistent Approximations
To study the combined effects of the choices of approximation method and scoring rule in the evaluation of Bayesian predictive distributions, we introduce the notion of a consistent approximation procedure.
(A) The process (Â i ) i D 1,2, : : : is stationary and ergodic with invariant distribution P post . As noted, assumption (A) implies that (X i ) i D 1,2, : : : is stationary and ergodic with invariant distribution F 0 . Consider an approximation method that produces, for all sufficiently large positive integers m, an estimate O F m that is based on .Â i / m i D1 or .X i / m i D1 , respectively. Let S be a proper scoring rule, and let F be the associated natural domain. Then the approximation method is consistent relative to the scoring rule S at the distribution F 0 2 F if O F m 2 F for all sufficiently large m, and This formalises the idea that under continued MCMC sampling, the approximation ought to perform as well as the unknown true posterior predictive distribution. We contend that this is a highly desirable property in practical work.
Note that O F m is a random quantity that depends on the sample .Â i / m i D1 or .X i / m i D1 . The specific form of the divergence stems from the scoring rule, which mandates convergence of a certain functional of the estimator or approximation, O F m , and the theoretical posterior predictive distribution, F 0 . As we will argue, this aspect has important implications for the choice of scoring rule and approximation method.
Our concept of a consistent approximation procedure is independent of the question of how well a forecast model represents the 'true' uncertainty. The definition thus allows to separate the problem of interest, namely, to find a good approximation O F m to F 0 , from the distinct task of finding a good probabilistic forecast F 0 . 2 We further emphasise that we study convergence in the sample size, m, of MCMC output, given a fixed number of observations, say, T, used to fit the model. Our analysis is thus distinct from traditional Bayesian asymptotic analyses that study convergence of the posterior distribution as T becomes larger and larger (see, e.g. Gelman et al., 2014a, Section 4), thereby calling for markedly different technical tools.

Relation to Total Variation and Wasserstein Distances
Our focus on score divergences (in particular, on d LogS and d CRPS ) is motivated by their natural relation to scoring rules, which in turn are popular tools in the applied literature on probabilistic forecasting. As reviewed by Gibbs & Su (2002), many other distance metrics for comparing two probability distributions have been proposed in the literature. Among these metrics, the total variation distance (d TV ) has received much attention in theoretical work on MCMC (e.g. Tierney, 1994;Rosenthal, 1995) and is thus particularly relevant in our context. The total variation distance between two absolutely continuous probability measures with densities f and g is defined as Barron et al., 1992), convergence in terms of d LogS implies convergence in terms of d TV .
The Wasserstein distance is a divergence function motivated by optimal transport problems (Villani, 2009) and has received much attention in statistics and machine learning (Panaretos & Zemel, 2019). Here, we limit our discussion to the Wasserstein distance of order 1, which is most common in practice, and denote the corresponding metric by where F 1 and G 1 are the quantile functions of F and G, respectively. Bellemare et al. (2017) discuss shortcomings of Wasserstein distances in estimation with stochastic gradient descent methods and suggest d CRPS as a superior alternative. This recommendation relates to the observation that there is no proper scoring rule with d W as score divergence (Thorarinsdottir et al., 2013, Theorem 2). As

respectively, for pre-processing and for the exact computation of the CRPS, Dawid-Sebastiani score (DSS) and logarithmic score (LogS).
Approximation method Pre-processing CRPS DSS LogS CRPS, continuous ranked probability score; ECDF, empirical cumulative distribution function; KD, kernel density; MP, mixture-of-parameters.

Consistency Results and Computational Complexity
We now investigate sufficient conditions for consistency of the aforementioned approximation methods, namely, (5). Table 2 summarises upper bounds on the computational cost of pre-processing and computing the CRPS, DSS and LogS under these methods in terms of the size m of the MCMC sample .Â i / m i D1 or .X i / m i D1 , respectively. Consistency requires the convergence of some functional of the approximation, O F m , and the true posterior predictive distribution, F 0 . The conditions to be placed on the Bayesian model F 0 , the estimator O F m and the dependence structure of the MCMC output depend on the scoring rule at hand.

Mixture-of-Parameters Estimator
We now establish consistency of the MP estimator O F MP m in (2) relative to the CRPS, DSS and LogS. The proofs are deferred to Appendix B.

Theorem 1. (Consistency of MP approximations relative to the CRPS and DSS).
Under assumption (A), the MP approximation is consistent relative to the CRPS at every distribution F 0 with finite mean, and consistent relative to the DSS at every distribution F 0 with strictly positive, finite variance. Theorem 1 is the best possible result of its kind: it applies to every distribution in the natural domain and does not invoke any assumptions on the Bayesian model. In contrast, Theorem 2 hinges on rather stringent further conditions on the distribution F 0 and the Bayesian model (1), as follows.
(B) The distribution F 0 is supported on some bounded interval . It admits a density, f 0 , that is continuous and strictly positive on . Furthermore, the density f c ( |Â ) is continuous for every Â 2 ‚.

Theorem 2. (Consistency of MP approximations relative to the LogS). Under assumptions (A) and (B), the MP approximation is consistent relative to the LogS at the distribution F 0 .
While we believe that the MP technique is consistent under weaker assumptions, this is the strongest result that we have been able to prove. In particular, condition (B) does not allow for the case D R. However, practical applications often involve a truncation of the support for numerical reasons, as exemplified in Section 4, and in this sense, the assumption may not be overly restrictive.
Computing the LogS and the DSS for a predictive distribution O F MP m of the form (2) is straightforward. To compute the CRPS, we note from equation (21) of Gneiting & Raftery (2007) where Z i and Z j are independent random variables with distribution F c ( |Â i ) and F c ( |Â j ), respectively. The expectations on the right-hand side of (8) often admit closed-form expressions that can be derived with techniques described by Jordan (2016) and Taillardat et al. (2016), including but not limited to the ubiquitous case of Gaussian variables. Then the evaluation requires O.m 2 / operations, as reported in Table 2. In Appendix A, we provide details and investigate the use of numerical integration in (7), which provides an attractive, computationally efficient alternative.

Empirical Cumulative Distribution Function-Based Approximation
The ECDF-based approximation O F ECDF m in (3), which builds on a simulated sample .X i / m i D1 , is consistent relative to the CRPS and DSS under minimal assumptions. We prove the following result in Appendix C, which is the best possible of its kind, as it applies to every distribution in the natural domain and does not invoke any assumptions on the Bayesian model.

Theorem 3. (Consistency of ECDF-based approximations relative to the CRPS and DSS).
Under assumption (A), the ECDF technique is consistent relative to the CRPS at every distribution F 0 with finite mean, and consistent relative to the DSS at every distribution F 0 with strictly positive, finite variance. Table 2

As stated in
see Jordan (2016, Section 6) for details. A special case of Equation (8) suggests another way of computing the CRPS, in that The representations in (9) and (10)

Kernel Density Estimator
We now discuss conditions for the consistency of the KD estimator O f KD m . In the present case of dependent samples .X i / m i D1 , judicious choices of the bandwidth h m in (4) require knowledge of dependence properties of the sample, and the respective conditions are difficult to verify in practice.
The score divergence associated with the LogS is the Kullback-Leibler divergence, which is highly sensitive to tail behaviour. Therefore, consistency of O f KD m requires that the tail properties of the kernel K in (4) and the true posterior predictive density f 0 be carefully matched, and any results tend to be technical (cf. Hall, 1987). Roussas (1988), Györfi et al. (1989), Yu (1993) and Liebscher (1996)

Theorem 5. (Consistency of KD estimator-based approximations relative to the LogS). Under assumptions (A), (B) and (H), the KD estimator-based approximation technique is consistent relative to the LogS at the distribution F 0 .
The result is a direct consequence of Hansen (2008, Theorem 7) who further provides optimal convergence rates. However, the respective conditions are stringent and difficult to check in practice. Indeed, Wasserman (2006, p. 57) opines that 'Despite the natural role of Kullback-Leibler distance in parametric statistics, it is usually not an appropriate loss function in smoothing problems'.
Under the conditions of Theorem 5, consistency of O F KD m relative to the CRPS follows directly; see Section 2.4. KD estimation approximations are generally not consistent relative to the DSS due to the variance inflation induced by typical choices of the bandwidth. However, adaptations based on rescaling or weighting allow for KD estimation under moment constraints; see, for example, Jones (1991) and Hall & Presnell (1999).
As this brief review suggests, the theoretical properties of kernel density estimators depend on the specifics of both the MCMC sample and the estimator. However, under the CRPS and DSS, a natural alternative is readily available: the ECDF approach is simpler and computationally cheaper than KD estimation and is consistent under weak assumptions (Theorem 3).
In our simulation and data examples, we use a simple implementation of KD estimatorbased approximations based on the Gaussian kernel and the Silverman (1986) plug-in rule for bandwidth selection. This leads to the specific form whereˆdenotes the CDF of the standard normal distribution, and where Á is the minimum of the standard deviation and the (scaled) interquartile range IQR m of .X i / m i D1 . The pre-processing costs of the procedure are O.m/, as shown in Table 2. This choice of h m , which is implemented in the R function bw.nrd (R Core Team, 2019), is motivated by simulation evidence in Hall et al. (1995). Using the Sheather & Jones (1991) rule or cross-validation-based methods yields slightly inferior results in our experience. 3

Gaussian Approximation
A parametric approximation method fits a member of a fixed parametric family, say F , of probability distributions to the MCMC sample .X i / m i D1 . The problem of estimating the unknown distribution F 0 is thus reduced to a finite-dimensional parameter estimation problem. The most important case is the quadratic approximation or GA, which takes F to be the Gaussian family, so that where O m and O m are the empirical mean and standard deviation of .X i / m i D1 . If F 0 has a density f 0 that is unimodal and symmetric, the approximation can be motivated by a Taylor series expansion of the log predictive density at the mode, similar to GAs of posterior distributions in large-sample Bayesian inference (e.g. Kass & Raftery, 1995;Gelman et al., 2014a, Chapter 4).
If F 0 is not Gaussian, O F GA m fails to be consistent relative to the LogS and CRPS. However, the Ergodic Theorem implies that the GA is consistent relative to the DSS under minimal conditions.

Theorem 6. (Consistency of GAs relative to the DSS). Under assumption (A), the GA technique is consistent relative to the DSS at every distribution F 0 with strictly positive, finite variance.
We also note that the LogS for the GA O F GA m corresponds to the DSS for the ECDF-based for y 2 R. Therefore, the GA under the LogS yields model rankings that are identical to those for the ECDF technique under the DSS. From an applied perspective, this equivalence suggests that the inconsistency of the GA may not be overly problematic when the approximation is used in concert with the LogS, an assessment that is in line with empirical findings by Warne et al. (2016). However, researchers should be aware of the fact that they are effectively using a proper, but not strictly proper, scoring rule (viz. the DSS) that focuses on the first two moments of the predictive distribution only.

Simulation Study
We now investigate the various approximation methods in a simulation study that is designed to emulate MCMC behaviour with dependent samples. Here, the posterior predictive distribution F 0 is known by construction, and so we can compare the different approximations to the true forecast distribution. For simplicity, our choice of F 0 is fixed and does not correspond to a particular Bayesian model. 4 In order to judge the quality of an approximation O F m of F 0 , we consider the score divergence In order to simplify notation, we typically suppress the superscript that identifies the Monte Carlo iteration. The results in the succeeding text are based on K D 1 000 replicates.

Data Generating Process
We generate sequences .Â i / m i D1 and .X i / m i D1 in such a way that the invariant distribution, whereˆdenotes the standard normal CDF, is a compound Gaussian distribution or normal scale mixture. Depending on the measure H 0 , which assumes the role of the posterior distribution P post in the general Bayesian model (1), F 0 can be modelled flexibly, including many wellknown parametric distributions (Gneiting, 1997). As detailed in the succeeding text, our choice of H 0 implies that where T( |a,b,c) denotes the CDF of a variable Z with the property that .Z a/= p b is standard Student's t distributed with c degrees of freedom. To mimic a realistic MCMC scenario with dependent draws, we proceed as proposed by Fox & West (2011). Given parameter values n > 0, s > 0 and˛2 ( 1,1), let where IG is the inverse Gamma distribution, parametrised such that Z IG(a,b) when 1/Z G(a,b), with G being the Gamma distribution with shape a 0 and rate b > 0. Table 3 summarises our choices for the parameter configurations of the data generating process. The parameter˛determines the persistence of the chain, in that the unconditional mean of 2 i , which can be viewed as an average autoregressive coefficient (Fox & West 2011, Section 2.3), is given by (n˛2 + 1)/(n + 1). We consider three values, aiming to mimic MCMC chains with different persistence properties. The parameter s represents a scale effect, and n governs the tail thickness of the unconditional Student's t distribution in (13). We consider values of 12 and 20 that seem realistic for macroeconomic variables, such as the growth rate of the gross domestic product, that feature prominently in the empirical literature.

Approximation Methods
We consider the following approximation methods, which have been discussed in detail in Section 3. The first approximation uses a sequence .Â i / m i D1 of parameter draws, and the other three employ an MCMC sample .X i / m i D1 . (2), which here is of the form

Mixture-of-parameters estimator
where Â i is the predictive standard deviation drawn in MCMC iteration i. F .k/ m ; F 0 / by numerical integration as implemented in the R function integrate. This is unproblematic if the scoring rule is the CRPS. For the LogS, the integration is numerically challenging, as the logarithm of the densities needs to be evaluated in their tails. We therefore truncate the support of the integral to the minimal and maximal values that yield numerically finite values of the integrand.

Main Results
In the interest of brevity, we restrict attention to results for a single set of parameters of the data generating process, namely, (˛,s,n) D (0.5,2,12). This implies an unconditional Student's t distribution with 14 degrees of freedom, and intermediate autocorrelation of the MCMC draws. The results for the other parameter constellations in Table 3 are similar and available in the supporting information. Figure 1 illustrates the performance of the approximation methods under the LogS and the CRPS, by showing the distribution of the score divergence d S . O F m ; F 0 / as the sample size m grows. The MP estimator dominates the other methods by a wide margin: its divergences are very close to zero and show little variation across replicates. Under the LogS, the performance of the KD estimator is highly variable across the replicates, even for large sample sizes. The variability is less under the CRPS, where the KD approach using the Silverman (1986) rule of thumb for bandwidth selection performs similar to the ECDF-based approximation. Other bandwidth selection rules we have experimented with tend to be inferior, as indicated by slower convergence and higher variability across replicates. Finally, we observe the lack of consistency of the GA. Figure 2 provides insight into the performance of the MP approximation for small MCMC samples. Using as few as 150 draws, the method attains a lower median CRPS divergence than the KD estimator based on 20 000 draws. The superiority of the MP estimator is even more

Thinning the Markov Chain Monte Carlo Sample
In Appendix D, we present simulation analyses of the effects of thinning an MCMC sample (i.e. keeping only every th draw, where 2 N is the thinning factor), which is often performed in practice with the goal of reducing autocorrelation in the MCMC draws. From a practical perspective, the analysis in Appendix D suggests that thinning is justified if, and only if, a small MCMC sample is desired and the MP estimator is applied. Two arguments in favour of a small sample appear particularly relevant even today. First, storing large amounts of data on public servers (as is often performed for replication purposes) may be costly or inconvenient. Second, post-processing procedures such as score computations applied to the MCMC sample may be computationally demanding (cf. Table 2) and therefore may encourage thinning.

Economic Data Example
In real-world uses of Bayesian forecasting methods, the posterior predictive distribution F 0 is typically not available in closed form. Therefore, computing or estimating the object of interest for assessing consistency, that is, the score divergence d S . O F m ; F 0 /, is not feasible. In the subsequent data example, we thus compare the approximation methods via their out-of-sample predictive performance and examine the variation of the mean scores across chains obtained by replicates with distinct random seeds. While studying the predictive performance does not allow to assess consistency of the approximation methods, it does allow us to assess the variability and applicability of the approximations in a practical setting.

Data
We consider quarterly growth rates of US real gross domestic product, as illustrated in the supporting information. The training sample used for model estimation is recursively expanded as forecasting moves forward in time. We use the real-time data set provided by the Federal Reserve Bank of Philadelphia, 5 which provides historical snapshots of the data vintages available at any given date in the past, and consider forecasts for the period from the second quarter of 1996 to the third quarter of 2014, for a total of T D 74 forecast cases. For brevity, we present results for a prediction horizon of one quarter only. The supporting information contains results for longer horizons, which are qualitatively similar to the ones presented here.

Probabilistic Forecasts
To construct density forecasts, we consider an autoregressive model with a single lag and state-dependent error term variance, in that where " t N.0; Á 2 s t / and s t 2 {1,2} is a discrete state variable that switches according to a firstorder Markov chain. The model, which is a variant of the Markov switching model proposed by Hamilton (1989), provides a simple description of time-varying heteroscedasticity. The latter is an important stylised feature of macroeconomic time series (see, e.g. Clark & Ravazzolo, 2015).
We conduct Bayesian inference via a Gibbs sampler, for which we give details in Appendix E. Let Â i denote the complete set of latent states and model parameters at iteration i of the Gibbs sampler. Conditional on Â i , the predictive distribution under the model in (18) is Gaussian with mean i D (Â i ) and standard deviation i D (Â i ), where we suppress time and forecast horizon for simplicity. At each forecast origin date t D 1, : : : , T D 74, we produce 10 000 burnin draws and use 40 000 draws post burn-in. We construct 16 parallel chains in this way. The (time-averaged) mean score of a given approximation method, based on m MCMC draws within chain c D 1, : : : ,16, is where O F m;c;t is the probabilistic forecast at time t. The variation of N S m;c across chains c is due to differences in random seeds. From a pragmatic perspective, a good approximation method should be such that the values . N S m;c / 16 cD1 are small and display little variation.

Results
In Figure 3, the mean score is plotted against the size of the MCMC sample. The MP approximation outperforms its competitors: its scores display the smallest variation across chains, for both the CRPS and the LogS, and for all sample sizes. The scores of the MP estimator also tend to be lower (i.e. better) than the scores for the other methods. The KD estimator performs poorly for small sample sizes, with the scores varying substantially across chains. Under the CRPS, the KD estimator is dominated by the ECDF technique, which can be interpreted as KD estimation with a bandwidth of zero.

Discussion
We have investigated how to make and evaluate probabilistic forecasts based on MCMC output. The formal notion of consistency allows us to assess the appropriateness of approximation methods within the framework of proper scoring rules. Despite their popularity in the literature, GAs generally fail to be consistent. Conditions for consistency depend on the scoring rule of interest, and we have demonstrated that the MP and ECDF-based approximations are consistent relative to the CRPS under minimal conditions. Proofs of consistency relative to the LogS generally rely on stringent assumptions.
In view of these theoretical considerations as well as the practical perspective taken in our simulation and data examples, we generally recommend the use of the MP estimator, which provides an efficient approximation method and outperforms all alternatives. This can be explained by the fact that it efficiently exploits the parametric structure of the Bayesian model. The ECDF-based approximation provides a good alternative if the conditional distributions fail to be available in closed form, or if for some reason the draws are to be made directly from the posterior predictive distribution, as opposed to using parameter draws. The ECDF-based approximation is available under the CRPS and DSS but not under the LogS, where a density is required. Under the LogS, the case for the MP estimator is thus particularly strong. In particular, the score's sensitivity to the tails of the distribution renders KD estimators unattractive from both theoretical and applied perspectives.
Our recommendations have been implemented in the scoringRules package for R (R Core Team, 2019); see Jordan et al. (2019) for details. The functions and default choices aim to provide readily applicable and efficient approximations. The MP estimator depends on the specific structure of the Bayesian model and can therefore not be covered in full generality. However, the implemented analytical solutions of the CRPS and LogS allow for straightforward and efficient computation. The scoringRules package further contains functions and data for replicating the simulation and case study, with details provided at https://github.com/FK83/scoringRules/blob/ master/KLTG2020_replication.pdf. Ferro (2014) studies the notion of a fair scoring rule in the context of ensemble weather forecasts. A scoring rule is called fair if the expected score is optimal for samples with members that behave as though they and the verifying observation were sampled from the same distribution. While certainly relevant in the context of meteorological forecast ensembles, where the sample size m is typically between 10 and 50, these considerations seem less helpful in the context of MCMC output, where m is on the order of thousands and can be increased at low cost. Furthermore, the proposed small sample adjustments and the characterisation of fair scores hold for independent samples only, an assumption that is thoroughly violated in the case of MCMC.
We are interested in evaluating probabilistic forecasts produced via MCMC, so that the predictive performance of a model during an out-of-sample, test or evaluation period can be used to estimate its forecast performance on future occasions. In contrast, information criteria suggest a different route towards estimating forecast performance (Spiegelhalter et al., 2002;Watanabe, 2010;Hooten & Hobbs, 2015). They consider a method's in-sample performance and account for model complexity via penalty terms. Preferred ways of doing so have been the issue of methodological debate, and a consensus has not been reached; see, for example, the comments in Gelman et al. (2014b) and Spiegelhalter et al. (2014). This present analysis does not concern in-sample comparisons and does not address the question of whether these are more or less effective than out-of-sample comparisons. However, our results and observations indicate that out-of-sample comparisons of the type considered here yield robust results across a range of implementation choices.
Necessarily, the scope of this paper is restricted along several dimensions. First, our theoretical results focus on consistency but do not cover rates of convergence. Results on the latter tend to rely on theoretical conditions that are hard to verify empirically, and the plausibility of which is likely to depend on the specifics of the MCMC algorithm. In contrast, many of our consistency results require only minimal conditions that hold across a wide range of sampling algorithms in the interdisciplinary applied literature. Second, we have focused on univariate continuous forecast distributions. The corresponding applied literature is large and features a rich variety of implementation variants (cf. Table 1 of the supporting information). Nevertheless, there are other empirically relevant set-ups, notably simple functionals of a predictive distribution, discrete univariate distributions and continuous multivariate distributions. We briefly discuss each set-up in turn.
Functionals such as quantiles summarise a predictive distribution, thus allowing for simpler interpretation and communication (Raftery, 2016). If the forecast user requires only a specific quantile of the predictive distribution, it seems natural to focus on this quantile for evaluation. Interestingly, the CRPS can be represented as the integral over (twice) the asymmetric piecewise linear scoring function, which is commonly used to evaluate quantile forecasts [Gneiting & Ranjan, 2011, Equations (11) to (13)]. Consequently, the CRPS divergence is the integral over the quantile score divergence. In this sense, results for quantiles are covered by our results in terms of the CRPS. The same argument applies if the functional sought is the exceedance probability at any given threshold value, as an immediate consequence of the standard representation of the CRPS [Equation (7)]. In order to illustrate the argument numerically, Section S3 of the supporting information applies our simulation design to quantiles at two different levels, yielding results that are qualitatively very similar to our CRPS results for full predictive distributions.
In relevant discrete settings, such as predicting probabilities of a binary or categorical outcome, the estimation problem becomes considerably simpler than for the real-valued case. The more complex case of integer-valued count data can be handled using methods similar to the ones we discuss. Instead of probability density functions, the count data case involves probability mass functions to which both the LogS and the CRPS transfer naturally (Czado et al., 2009). Furthermore, all of the approximation methods we discuss can be used in the count data case. For example, the MP estimator can be used in concert with a Poisson or negative binomial specification. Similarly, Shirota & Gelfand (2017, Section 4) consider Equation (10) in a count data context, and kernel-type smoothing methods have been proposed for count data as well (Rajagopalan & Lall, 1995).
The multivariate case features novel challenges. Perhaps most fundamentally, a consensus on practically appropriate choices of the scoring rule is yet to be reached (Gneiting et al., 2008;Scheuerer & Hamill, 2015). Held et al. (2017, Section 4.2) and White et al. (2019, Section 3.3) propose the use of the ECDF approximation in concert with the multivariate energy score. In this setting, analogues of our Theorem 3 hold, assuring consistency under weak conditions. For KD estimators, the 'curse of dimensionality' applies, and for the MP estimator, we expect numerical challenges when evaluating, say, a log predictive density in a high-dimensional space. Clearly, there is considerable scope and opportunity for future research in these directions. A potentially much faster, but not exact, alternative is to evaluate the integral in (7) numerically. 6 Here, we provide some evidence on the viability of this strategy, which we implement via the R function integrate, with arguments rel.tol and abs.tol of integrate set to 10 6 . As a first experiment, we use numerical integration to recompute the CRPS scores of the mixture-of-parameters estimator in our data example for the first quarter of 2011. Figure A1 summarises the results for 16 parallel chains. The left panel shows that the approximate scores are visually identical to the exact ones across all sample sizes and chains. Indeed, the maximal absolute error incurred by numerical integration is 8.0 10 8 . The approximation errors are dwarfed by the natural variation of the scores across MCMC chains. The right panel compares the computation time for exact evaluation versus numerical integration. The latter is much faster, especially for large samples. For a sample of size 40 000 numerical integration requires less than 1.5 s, whereas exact evaluation requires about 2 min on an Intel i7 processor.
To obtain broad-based evidence, we next compare exact evaluation versus numerical integration for all 74 forecast dates, from the second quarter of 1996 to the third quarter of 2014, employing 16 parallel chains for each date. We focus on the two largest MCMC sample sizes, 20 000 and 40 000, and find that across all 2 368 instances (74 dates times 2 sample sizes times 16 chains), the absolute difference of the two CRPS values never exceeds 6.3 10 7 . Therefore, we feel that numerical integration allows for the efficient evaluation of the CRPS for mixtures of normal distributions. The differences to the exact values are practically irrelevant and well in line with the error bounds in R's integrate function.

B1 Proof of Theorem 1
In the case of the CRPS, we prove the stronger result that The Ergodic Theorem implies that the first term on the right-hand side of (B1) tends to zero and that lmost surely as m ! 1. In view of (B1) we conclude that almost surely as m ! 1. As the right-hand side of (B2) decreases to zero as N grows without bounds, the proof of the claim is complete.
In the case of the DSS, let For the second moments, we find similarly that R R´2 dF 0 .´/ D 2 0´O H m .´/ d´. Proceeding as before, the Ergodic Theorem implies almost sure convergence of the first and second moments, and thereby consistency relative to the DSS.

C1 Proof of Theorem 3
In the case of the CRPS, we proceed in analogy to the proof of Theorem 1 and demonstrate the stronger result that The Generalised Glivenko-Cantelli Theorem (Dehling et al., 2002, Theorem 1.1) implies that the first term on the right-hand side of (C1) tends to zero almost surely as m ! 1. If Z 0 has distribution F 0 , then almost surely as m ! 1. As the right-hand side of (C2) gets arbitrarily close to zero as N grows without bounds, the proof of the claim is complete. In the case of the DSS, it suffices to note that the moments of the empirical CDF are the sample moments of .X i / m i D1 and then to apply the Ergodic Theorem. C2 Proof of Theorem 4 In this light, the first part of the theorem's statement implies the second part.
Note that the samples in S1 and S3 have the same dynamic properties, whereas S2 will typically produce a chain with less autocorrelation. Furthermore, S2 and S3 require the same computing time, which exceeds that of S1 by a factor of 10. Figure D1 summarises the corresponding simulation results, using parameter values s D 2 and n D 12, and varying values of the persistence parameter˛. We report results for four popular combinations of scoring rules and approximation methods.
As expected, S2 tends to outperform S1: when the sample size is held fixed, less autocorrelation entails more precise estimators. While the difference in performance is modest in most cases, S2 attains large (relative) gains over S1 when the mixture-of-parameters estimator is applied to a very persistent sample with˛D 0.9. This can be explained by the direct effect of the persistence parameter˛on the parameter draws .Â i / m i D1 , whereas the influence is less immediate for the KDE and ECDF approximation methods, which are based on the sequence .X i / m i D1 obtained in an additional sampling step. Furthermore, S3 outperforms S2 in all cases covered. While the effects of thinning have not been studied in the context of predictive distributions before, this observation is in line with extant reports of the greater precision of unthinned chains (Geyer, 1992;MacEachern & Berliner, 1994;Link & Eaton, 2012). The performance gap between S3 and S2 is modest for the mixture-of-parameters estimator (top row of Figure D1), but very pronounced for the other estimators.
Drawˇj h; s t from a Gaussian posterior. The mean and variance derive from a generalised least squares problem, with observation t receiving weight Á 2 s t . Draw h jˇ; s t from a Gamma posterior. The Gamma distribution parameters for Á 2 s ; s 2 f1; 2g; are calculated from the observations t for which s t D s. If necessary, permute the draws such that Á 2 1 > Á 2 2 . Draw s t jˇ; h; P using the algorithm described by Greenberg (2013, pp. 194-195). Draw P j s t from a Dirichlet posterior.