Decompounding discrete distributions: A non-parametric Bayesian approach

Suppose that a compound Poisson process is observed discretely in time and assume that its jump distribution is supported on the set of natural numbers. In this paper we propose a non-parametric Bayesian approach to estimate the intensity of the underlying Poisson process and the distribution of the jumps. We provide a MCMC scheme for obtaining samples from the posterior. We apply our method on both simulated and real data examples, and compare its performance with the frequentist plug-in estimator proposed by Buchmann and Gr\"ubel. On a theoretical side, we study the posterior from the frequentist point of view and prove that as the sample size $n\rightarrow\infty$, it contracts around the `true', data-generating parameters at rate $1/\sqrt{n}$, up to a $\log n$ factor.

1. Introduction 1.1. Problem formulation. Let N = (N t : t ≥ 0) be a Poisson process with a constant intensity λ > 0, and let Y i be a sequence of independent random variables, each with distribution P , that are also independent of N . By definition, a compound Poisson process (abbreviated CPP) X = (X t : t ≥ 0) is where here and below the sum over an empty index set is understood to be equal to zero. In particular, X 0 = 0. CPP constitutes a classical model in, e.g., risk theory, see Embrechts et al. (1997). Assume that the process X is observed at discrete times 0 < t 1 < t 2 < . . . < t n = T , where the grid {t i } is not necessarily uniform on [0, T ]. Based on the observations X t1 , X t2 , . . . , X tn , our goal is to estimate the jump size distribution P and the intensity λ. We specifically restrict our attention to the case where P is a discrete distribution, P (N) = 1, and we will write p = (p k ) k∈N for the probability mass function corresponding to P , where p k = P ({k}). A similar notation will be used for any other discrete law. The distribution P is called the base distribution. Abusing terminology, we will at times identify it with the corresponding probability mass function p. An assumption that P has no atom at zero is made for identifiability: otherwise this atom gets confounded with e −λ , which does not allow consistent estimation of the intensity λ. For a discussion of applications of this CPP model in risk theory, see Zhang et al. (2014).
Define the increments Z i = X ti − X ti−1 , i = 1, . . . , n. Then Z n = (Z i : i = 1, . . . , n) is a sequence of independent random variables. When {t i } are uniformly spaced on [0, T ], the random variables Z i have in fact a common distribution Q satisfying Q(N 0 ) = 1. As Z n carries as much information as (X ti : i = 1, . . . , n) do, we can base our estimation procedure directly on the increments Z n . Since summing up the jumps Y j 's amounts to compounding their distributions, the inverse problem of recovering P and λ from Z i can be referred to as decompounding; see Buchmann and Grübel (2003).
There are two natural ways to parametrise the CPP model: either in terms of the pair (λ, p), or in terms of the Lévy measure ν = (ν k ) k∈N of the process X, see Sato (2013). A relationship between the two is λ = ∞ k=1 ν k and p = ν/λ. Inferential conclusions in one parametrisation can be easily translated into inferential conclusions into another parametrisation. However, for our specific statistical approach the Lévy measure parametrisation turns out to be more advantageous from the computational point of view.
1.2. Approach and results. In this paper, we take a non-parametric Bayesian approach to estimation of the Lévy measure ν of X. See Ghosal and van der Vaart (2017) and Müller et al. (2015) for modern expositions of Bayesian non-parametrics. A case for non-parametric Bayesian methods has already been made elsewhere in the literature, and will not be repeated here. On the practical side, we implement our procedure via the Gibbs sampler and data augmentation, and show that it performs well under various simulation setups. On the theoretical side, we establish its consistency and derive the corresponding posterior contraction rate, which can be thought of as an analogue of a convergence rate of a frequentist estimator (see Ghosal and van der Vaart (2017)). The posterior contraction rate, up to a practically insignificant log n factor, turns out to be 1/ √ n, which is an optimal rate for non-parametric estimation of cumulative distribution functions. Our contribution thus nicely bridges practical and theoretical aspects of Bayesian non-parametrics.
1.3. Related literature. To provide a better motivation for our model and approach, in this subsection we briefly survey the existing literature. A Bayesian approach to non-parametric inference for Lévy processes is a very recent and emerging topic, with references limited at the moment to Belomestny et al. (2018), Gugushvili et al. (2015), Gugushvili et al. (2018) and Nickl and Söhl (2017). These deal exclusively with the case when the Lévy measure is absolutely continuous with respect to the Lebesgue density. At least from the computational point of view, these works are of no help in our present context. Related frequentist papers for CPP models with discrete base distributions are Buchmann and Grübel (2003) and Buchmann and Grübel (2004), which, after earlier contributions dating from the previous century, in fact revived interest in non-parametric techniques for Lévy processes. To estimate the base distribution p, Buchmann and Grübel (2003) employ a frequentist plug-in approach relying on the Panjer recursion (i.e., an empirical cumulative distribution estimate of q is plugged into the Panjer recursion equations to yield an estimate of p; see below on the Panjer recursion). The drawback is that the parameter estimates are not guaranteed to be non-negative. Buchmann and Grübel (2004) fix this problem by truncation and renormalisation. This works, but looks artificial. As noted in Buchmann and Grübel (2004), in practice the latter approach breaks down if no zero values are observed among Z i 's. Buchmann and Grübel (2004) establish weak convergence of their modified estimator, but on the downside its asymptotic distribution is unwieldy to give confidence statements on p. Most importantly, the plug-in approaches in Buchmann and Grübel (2003) and Buchmann and Grübel (2004) do not allow obvious generalisations to non-uniformly spaced observation times {t i }. In Lindo et al. (2018), another frequentist estimator of the jump measure is introduced, that is obtained via the steepest descent technique as a solution to an optimisation problem over the cone of positive measures. The emphasis in Lindo et al. (2018) is on numerical aspects; again, no obvious generalisation to the case of non-uniform {t i } is available.
1.4. Outline. The rest of the paper is organised as follows: in Section 2 we introduce our approach and describe an algorithm for drawing from the posterior distribution. In Sections 3 and 4 we study its performance on synthetic and real examples. Section 5 is devoted to the examination of asymptotic frequentist properties of our procedure. An outlook on our results is given in Section 6. Finally, in Appendix A technical lemmas used in the proofs of Section 5 are collected.
1.5. Notation. For two sequences {a n } and {b n } of positive real numbers, the notation a n b n (or b n a n ) means that there exists a constant C > 0 that is independent of n and such that a n ≤ Cb n . We write a n b n if both a n b n and a n b n hold. We denote a prior (possibly depending on the sample size n) by Π n . The corresponding posterior measure is denoted by Π n (· | Z n ). The Gamma distribution with shape parameter a and rate parameter b (a, b > 0) is denoted by Gamma(a, b). Its density is given by x → b a Γ(a) x a−1 e −bx , x > 0, where Γ is the Gamma function. The inverse Gamma distribution with shape parameter a and scale parameter b is denoted by IG(a, b). Its density is x , x > 0. We use the notation Exp(a) for an exponential distribution with mean 1/a.

Algorithm for drawing from the posterior
A Bayesian statistical approach relies on the combination of the likelihood and the prior on the parameter of interest through Bayes' formula. We start with specifying the prior. As far as the likelihood is concerned, although explicit, it is intractable from the computational point of view for non-parametric inference in CPP models. We will circumvent the latter problem by means of data augmentation, as detailed below.
2.1. Prior. We define a prior Π on ν through a hierarchical specification Note that the (fixed) hyperparameters m ∈ N, a, c > 0 are denoted by Latin letters. The hyperparameter m incorporates our a priori opinion on the support of the Lévy measure ν, or equivalently, the base measure p. In applications, the support of p may be unknown, which necessitates the use of a large m, e.g. m = max i=1,...,n Z i ; this latter is the maximal value suggested by the data Z n at hand. Nevertheless, we may simultaneously expect that the 'true', data-generating ν charges full mass only to a proper, perhaps even a small subset of the set {1, . . . , m}. In other words, ν may form a sparse sequence, with many components equal to zero. In fact, there are at least two plausible explanations for an occurrence of a large increment Z i in the data: either a few large jumps Y j 's occurred, which points towards a large right endpoint of the support of ν 0 , or Z i is predominantly formed of many small jumps, which in turn indicates that the intensity λ of the Poisson arrival process N may be large. To achieve accurate estimation results, a prior should take a possible sparsity of ν into account. This is precisely the reason of our hierarchical definition of the prior Π: a small β k encourages a priori the shrinkage of the components ν k of ν towards zero.
2.2. Data augmentation. Assume temporarily t i = i, i = 1, . . . , n, and write q = (q k ) k∈N0 for q k = q({k}). Then Z i have the distribution with * denoting convolution. The compounding mapping (λ, p) → q can be expressed explicitly via the Panjer recursion (see Panjer (1981)): This recursion can be inverted to give the inverse mapping q → (λ, p) via In view of (2), the likelihood in the CPP model is explicit. Nevertheless, an attempt to directly use (2) or the Panjer recursions in posterior computations results in a numerically intractable procedure. Equally important is the fact that a Panjer recursion based approach would not apply to non-uniformly spaced observation times {t i }. Therefore, instead of (2) and the Panjer recursion, we will employ data augmentation, see Tanner and Wong (1987). We switch back to the case when {t i } are not necessarily uniformly spaced. The details of our procedure are as follows: when the process X is observed continuously over the time interval [0, T ], so that our observations are a full sample path X (T ) = (X t : t ∈ [0, T ]) of CPP, the likelihood is tractable and is proportional to see Shreve (2004), p. 498. Here i.e. the total number of jumps of size k. Then the prior Π from Subsection 2.1 leads to conjugate posterior computations. In fact, the full conditionals are Therefore, the Gibbs sampler for posterior inference on ν can be implemented. The Gibbs sampler cycles through the above conditionals a large number of times, generating approximate (dependent) samples from the posterior. See, e.g., Gelfand and Smith (1990) and Section 24.5 in Wasserman (2004) on the Gibbs sampler and its use in Bayesian statistics.
As we do not observe the process X continuously, we will combine the above with the data augmentation device. First note that we have where (µ ij : i = 1, . . . , n, j = 1, . . . , m) are independent, and µ ij ∼ Poisson(∆ i ν j ) for ν j = λp j and ∆ i = t i −t i−1 ; see Corollary 11.3.4 in Shreve (2004). Furthermore, for µ k as in (3) we trivially have µ k = n i=1 µ ik . Data augmentation iterates the following two steps: (i) Draw (µ ij ) conditional on the data Z n and the parameter ν.
Once the algorithm has been run long enough, this gives approximate (dependent) samples from the posterior of ν. We already know how to deal with step (ii); now we need to handle step (i). Thus, keeping ν fixed, for each i we want to compute the conditional distribution (µ ij : j = 1, . . . , m) | Z i , and furthermore, we want to be able to simulate from this distribution. In turn, this will immediately allow us to simulate µ k conditional on the data Z n . Now, with Pr(·) referring to probability under the parameter ν, it holds that Knowledge of the normalising constant Pr(Z i = z i ) will not be needed in our approach.
In general, simulation from a discrete multivariate distribution is non-trivial; some general options are discussed in Devroye (1986), Chapter XI, Section 1.5, but are unlikely to work easily for a large m. We will take an alternative route and use the Metropolis-Hastings algorithm, see, e.g., Section 24.4 in Wasserman (2004). We start by observing that for a fixed i, the support of Pr(· | Z i = z i ) is precisely the set S i of non-negative solutions (k 1 , . . . , k m ) of the Diophantine equation Arnqvist et al. (2018)) implements an algorithm from Voinov and Nikulin (1997) that finds all such solutions for given integers m and z i . By Markovianity of the process X, we can simulate the vectors (µ i1 , . . . , µ im ) independently for each i = 1, . . . , n. If z i = 0 or 1, there is only one solution to the Diophantine equation: the trivial solution (0, . . . , 0) in the first case, and the solution (1, 0, . . . , 0) in the second case; for such z i , no simulation is required, as (µ i1 , . . . , µ im ) is known explicitly. We thus only need to consider each i ∈ I = {i : z i = 0 or 1} in turn, and design a Metropolis-Hastings move on the set of the corresponding solutions S i . Fix once and for all an ordering of elements in S i (this could be, e.g., lexicographic, or reverse lexicographic); we use the notation |S i | for the cardinality of S i . Let µ = (µ i1 , . . . , µ im ) be the current state of the chain, corresponding to the th element s of is obtained as follows: Occasionally, one may want to propose another type of a move too.
(iv) Draw µ • = (µ • i1 , . . . , µ • im ) uniformly at random from S i . The two proposals lead to reversible moves, and one may also alternate them with probabilities π and 1 − π, e.g. π = 0.8. The logarithm of the acceptance probability of a move from (µ i1 , . . . , The move is accepted if log U ≤ log A for U an independently generated uniform random variate on [0, 1], and in that case the current state of the chain is reset to (µ • i1 , . . . , µ • im ). Otherwise the chain stays in (µ i1 , . . . , µ im ).

Simulation examples
In this section, we test performance of our approach in a range of representative simulation examples. For benchmarking, we use the frequentist plug-in estimator from Buchmann and Grübel (2004). Two real data examples are given in Section 4. Unless otherwise stated, we took c = 2 and a = 0.01 as hyperparameters in our prior specification. As can be seen from the update formulae for the Gibbs sampler, as long as a is not taken too large, its precise value is not very influential on the posterior, given a reasonable sample size. The value c = 2 ensures that the update step for β k has finite variance. At each step of updating the imputed data for increment size z we have chosen with probability 0.2 to propose uniformly from all solutions to the Diophantine equation (for that particular value of z).
We implemented our procedure in Julia, see Bezanson et al. (2017). The code and datasets for replication of our examples are available on GitHub 1 and Zenodo, see Gugushvili et al. (2019).
3.1. Uniform base distribution. This simulation example follows with some extensions that in Buchmann and Grübel (2004). Let λ 0 = 2, and let P 0 be the discrete uniform distribution on {1, 4, 6}. We simulated data according to the following settings: In all cases this led to m = 15, as the value of Z (n) was equal to 30, 35 and 40 for the simulated data under settings (a), (b) and (c), respectively. The Gibbs sampler was run for 500, 000 iterations, of which the first 250, 000 were discarded as burn-in. From the remaining samples, the posterior mean and 2.5% and 97.5% percentiles were computed for each coefficient ν k . The results for the first 10 coefficients are shown in Figure 1. For comparison, the estimator from Buchmann and Grübel (2004) is also included in the figure. For setting (b), traceplots of every 50th iteration for a couple of coefficients ν k are shown in Figure 2. We measure the error of an estimate {ν k } by Err(ν,ν) = ∞ k=1 |ν k − ν k |. The errors are reported in Table 1. In all settings, for these particular realisations of the simulated data, the Bayesian procedure outperformed the truncated estimator from Buchmann and Grübel (2004). For setting (c), the latter is completely off, as was to be expected, given that it is derived under the assumption ∆ i = 1 for all i. An advantage of the Bayesian procedure is the included measure of uncertainty, namely the credible intervals for ν k . On the other hand, for the Buchmann-Grübel estimator it is hardly possible to derive confidence intervals, since the limiting distribution of the estimator is fairly complicated.
3.2. Geometric base distribution. The setup of this synthetic data example likewise follows that in Buchmann and Grübel (2004). Assume q is a geometric distribution with parameter α, i.e. q k = (1 − α) k α for 0 < α < 1. Then λ = − log α, and Hence, ν k = (1 − α) k /k. We consider two simulation setups: (a) n = 500, ∆ i = 1 for 1 ≤ i ≤ n and α = 1/3; (b) n = 500, ∆ i = 1 for 1 ≤ i ≤ n and α = 1/6. We set m = min(15, Z (n) ) and ran the sampler according to the settings of Section 3.1. The results for both scenarios (a) and (b) are reported in Figure 3. In Table 2 we also report estimation errors in one simulation run. For this example and these generated data, the Bayesian procedure gives less precise point estimates than the Buchmann-Grübel method. Note that estimation error for α = 1/3 is smaller than that for α = 1/6. This appears intuitive, as a smaller value of α corresponds to a larger value of λ. The latter implies that on average each Z i is a superposition of a larger number of jumps, which renders the decompounding problem more difficult. However, this argument is hard to formalise.
3.3. Monte Carlo study. For a more thorough comparison of the Buchmann-Grübel estimator and our Bayesian method, we performed a small Monte Carlo experiment. We considered two settings: (i) The setting from Section 3.1 with n = 250. We took m = min(15, Z (n) ).  (ii) The setting from Section 3.2 with α = 1/3. We took m = min(20, Z (n) ). In both cases we assumed ∆ i = 1 for all 1 ≤ i ≤ n. The number of Monte Carlo repetitions was taken equal to 50. We took 400.00 MCMC iterations and discarded the first half of these as burn-in. In Figure 4 we give a graphical display of the results by means of boxplots of the errors. Here, as before, if the true values are denoted by ν k and the estimate within a particular simulation run byν k , the error is defined by Err(ν,ν) = ∞ k=1 |ν k − ν k | (we truncated the infinite summation to 50). The results agree with our earlier findings, in that there is no clear "winner" in the comparison. Note that for the setting (ii) we considered both c = 2 and c = 0.01 in the prior specification. Both values give similar performance of the Bayesian method. This provides insight into sensitivity of our results with respect to the prior specification. A minor difference between the middle and rightmost panels of Figure 4 may be attributed to Monte Carlo error: the 50 simulated datasets on which these panels are based on are not the same. Note that our prior promotes sparsity, and in that respect it is not surprising it does better when the true datagenerating Lévy measure is 'sparse'.
3.4. Computing time. In terms of computational effort, the time it takes to evaluate the Buchmann-Grübel estimator is negligible compared to our algorithm for sampling from the posterior. This is not surprising, as that frequentist estimator relies on a plug-in approach, whereas in our case an approximation to the posterior is obtained by MCMC simulation. However, if proper uncertainty quantification is desired, then the Bayesian method is advantageous in the sense that it does not solely produce a point estimate.
Note that the proposed MCMC scheme requires determination of the solutions to the Diophantine equation m j=1 jk j = z for all unique values z in the observation set. For moderate values of z, say z ≤ 30, this is rather quick, but for large values of z the computing time increases exponentially, as does the amount of the allocated memory. The computing time of each Metropolis-Hastings step is then very small, but we potentially need a very large number of iterations to reach stationarity. The latter is caused firstly by the fact that at a particular iteration, our proposals for µ ij do not take into account the current values of ν 1 , . . . , ν m ; secondly, the size of the state space that needs to be explored increases exponentially with m.

Real data examples
4.1. Horse kick data. To further illustrate our procedure, we will use the von Bortkewitsch data on the number of soldiers in the Prussian cavalry killed by horse kicks (available by year and by cavalry corps); this example was also employed in Buchmann and Grübel (2003). Each observation is an integer from 0 to 4, giving the number of deaths for a given year and a given cavalry corps, with overall counts reported in Table 3. The data are extracted from the table on p. 25 in   von Bortkewitsch (1898). Note that von Bortkewitsch corrects for the fact that the Guards and I, VI and XI cavalry corpses of the Prussian army had a different organisation from other units, and justifiably omits the corresponding counts from consideration. It has been demonstrated by von Bortkewitsch that the Poisson distribution fits the horse kick data remarkably well. Assuming instead that observations follow a compound Poisson distribution is a stretch of imagination, as that would correspond to a horse running amok and killing possibly more than one soldier in one go. Nevertheless, this example provides a good sanity check for our statistical procedure.
The estimation results are graphically depicted in Figure 5. Clearly, point estimates of both methods are in agreement and lend support to the Poisson model for this dataset.
4.2. Plant data. Our second real example is the one used in Buchmann and Grübel (2004). Consider the data in Table 4, taken from Evans (1953). The data were collected as follows: the area was divided into plots of equal size and in each plot  Table 4. Plant population data from Evans (1953).
the number of plants was counted; the number of plants in each plot ranges from 0 to 12. The second row of Table 4 gives the counts of plots containing a given number of plants; thus, there were 274 plots that contained no plant, 71 that contained 1 plant, etc. It is customary in the ecological literature to model such count data as i.i.d. realizations from a compound Poisson distribution. Thus, e.g., Neyman (1939) advocated the use of a Poisson base distribution in this context; another option here is a geometric base distribution. Given existence of several distinct modelling possibilities, performing an exploratory non-parametric analysis appears to be a sensible strategy. The estimation results are graphically depicted in Figure 6. There are some small differences between the posterior mean and the Buchmann-Grübel estimate, but overall they are very similar.

Frequentist asymptotics
In this section we assume that the observation times {t i } are equidistant: t i = i, i = 1, . . . , n. To evaluate our Bayesian method from a theoretical point of view, we will verify that it is consistent, and we will establish the rate at which the posterior contracts around the 'true', data-generating Lévy measure ν 0 ; see Ghosal and van der Vaart (2017) for a thorough treatment of Bayesian asymptotics from the frequentist point of view. From now on the subscript 0 in various quantities will refer to the data-generating distribution.
Our strategy consists in proving that the posterior contraction rate for ν 0 , given the sample Z n = (Z 1 , . . . , Z n ), can be derived from the posterior contraction rate for q 0 given Z n , which is mathematically easier since Z 1 , . . . , Z n is a sequence of independent and identically distributed random variables with distribution q 0 . We therefore effectively avoid dealing directly with the inverse nature of the problem of estimating p 0 . The prior we consider in this section is defined as follows: • Endow the rate λ of the Poisson process with a prior distribution.
• Independently, endow the vector (p 1 , . . . , p m ) with a Dirichlet distribution with parameter (α 1 , . . . , α m ). • Set a priori p k = 0 for all k > m. This is a somewhat simplified version of the prior we used in Section 2, which allows us to concentrate on essential features of the problem, without need to clutter the analysis with extra and unenlightening technicalities. Also remember the wellknown relationship between the Gamma and Dirichlet distributions: if ξ 1 , . . . , ξ m are independent Gamma distributed random variables, ξ i ∼ Gamma(α i , 1), then for η i = ξ i / m j=1 ξ j , the vector (η 1 , . . . , η m ) follows the Dirichlet distribution with parameter (α 1 , . . . , α m ); furthermore, we have that m j=1 ξ j ∼ Gamma m j=1 α j , 1 , and η i are independent of m j=1 ξ j . In our asymptotic setting, we will make m = m n dependent on n and let m n → ∞ at a suitable rate as n → ∞.
Recall that we write Q = (q k ) k∈N0 for q k = Q({k}). Let Q denote the collection of all probability measures supported on N.
Remark 1. Note that since the support of ν 0 is not assumed to be known, our CPP model is still naturally non-parametric. The assumption of the compact support of ν 0 does not cover the simulation example of Section 3.2. However, its relaxation appears to pose very difficult technical challenges and is not attempted in this work.
The remainder of this section is devoted to the proof of Theorem 1.

Basic posterior inequality via the stability estimate.
A key step of the proof of Theorem 1 is the stability estimate in Equation (5) below, which bounds the total variation distance between the Lévy measures ν, ν in terms of the total variation distance between the corresponding compound distributions q, q . In principle, it is conceivable that the Panjer recursion should allow one to bound probability distances between P -probabilities via distances between Q-probabilities; we call such a bound a stability estimate. Nevertheless, explicit as the equations of the Panjer recursion are, they are still somewhat unwieldy for that purpose. Hence we will use another inversion formula from Buchmann and Grübel (2003), that will lead to the stability estimate we are after. First we introduce some notation, and also recall a few useful facts summarised in Buchmann and Grübel (2003). The space of absolutely summable sequences is defined as 1 := a ∈ R N0 : ∞ j=0 |a j | < ∞ , with a norm given by a 1 = ∞ j=0 |a j |. For probability vectors a, b, the norm a − b 1 is (twice) the total variation distance between a and b. For any a, b ∈ 1 , we have the inequality where * denotes convolution of a and b. We define a mapping a → exp(a) from 1 into 1 via The exponential has the following two useful properties: We define a sequence δ 0 = (δ 0,k ) k∈N0 , such that δ 0,0 = 1 and its all other entries are equal to zero. Then, using the above properties of the exponential, we can write concisely the compounding mapping in (2) in terms of convolutions of infinite sequences: q = exp(λ(p − δ 0 )). Its convolution inverse, i.e. q * (−1) such that q * (−1) * q = δ 0 , is given by r = q * (−1) = exp(−λ(p − δ 0 )). Note that r ∈ 1 . We have the following recursive expressions Lemma 1. Let q, q correspond to two pairs (λ, p) and (λ , p ), respectively (and r correspond to q, i.e. the pair (λ, p)). Then, in accordance with the notation introduced above and provided that q − q 1 < r −1 1 , it holds that Proof. The result is a direct consequence of Lemma 3 in Buchmann and Grübel (2003), which states that 1 j (r * (q − q )) * j whenever q − q 1 < r −1 1 . Taking the · 1 -norm on both sides and some elementary bounding via (4) imply that and thus Equation (5) follows.
We will use Equation (5) to establish the key inequality for the posterior measure Π(· | Z n ). We recall once again that the subscript 0 refers to 'true', data-generating quantities.
In general, stability estimates like the one in Equation (5) are unknown in the literature on Lévy processes. Consequently, studying Bayesian asymptotics for Lévy models, even in the CPP case, necessitates the use of very intricate arguments under restrictive assumptions; see, e.g., Nickl and Söhl (2017).

5.2.
Proof of Theorem 1. The usefulness of Proposition 1 above lies in the fact that the posterior contraction rate in the inverse problem of estimating the Lévy measure ν 0 from indirect observations Z n can be now deduced from the posterior contraction rate in the direct problem of estimating the compound distribution q 0 , which is easier (observe that r 0 is determined by ν 0 and is therefore fixed in the proofs). The general machinery developed in Ghosal et al. (2000) can be applied to handle the latter, and also several inequalities obtained in Gugushvili et al. (2015) are useful in that respect. In particular, we make use of the following inequality for the Hellinger distance, Cf. Lemma 1 in Gugushvili et al. (2015). To ease our notation, in the sequel we will often write q and q instead of q λ,p and q λ ,p , respectively. Denote Another two inequalities we will use are the following: let λ, λ 0 ∈ [λ, λ]. Then there exists a positive constant C, such that cf. equations (14) and (15) in Lemma 1 in Gugushvili et al. (2015).
These three inequalities can be obtained by adjustment of the arguments used in Gugushvili et al. (2015). However, we opted to give their direct proofs in Lemma 5 from Appendix A under slightly weaker conditions than required in Gugushvili et al. (2015).
Our proof of Theorem 1 proceeds via verification of the conditions for posterior contraction in Theorem 2.1 in Ghosal et al. (2000). In our setting, the latter result reads as follows.
Theorem 2. Assume Z n = (Z 1 , . . . , Z n ), where Z 1 , . . . , Z n are independent and identically distributed with distribution q 0 . Let h denote the Hellinger metric on Q, a collection of all measures with support in N. Suppose that for a sequence { n } with n → 0 and n 2 n → ∞, a constant C and sets Q n ⊂ Q, we have log N ( n , Q n , h) ≤ n 2 n , Π n (Q \ Q n ) ≤ exp −n 2 n (C + 4) , Π n q : KL(q 0 , q) ≤ 2 n , V (q 0 , q) ≤ 2 n ≥ exp −Cn 2 n . Then, for sufficiently large M > 0, we have that Π n (Q : h(q, q 0 ) ≥ M n | Z n ) → 0 in Q n 0 -probability. We will now verify the three conditions of this theorem, which we refer to as the entropy condition, the remaining mass condition, and the prior mass condition, respectively. To that end, fix strictly positive sequences {Λ n }, {Λ n }, and define the sieves Q n = q λ,p : λ ∈ [Λ n , Λ n ], supp p ⊆ {1, . . . , m n } .
5.2.1. Entropy. We start with bounding the entropy of the sieve Q n for h-balls of radius n .
then the Hellinger distance between q λ,p and q λ ,p is bounded by n . To cover [Λ n , Λ n ], we need at most Λn 2 n √ Λ n + 1 intervals of length 2 Λ n n . To cover discrete distributions with support in {1, . . . , m n }, we need at most 2Λ n m n 2 n + 1 mn L ∞ -balls of radius 2 n /(4Λ n m n ). Under assumption (8), the summand 1 in the above display is asymptotically negligible and can be omitted. In that case, the number of h-balls that we need to cover Q n is of order Taking the logarithm and next a straightforward rearrangement of the terms gives the statement of the lemma.

5.2.2.
Remaining prior mass. Now we will derive an inequality for the remaining prior mass.
n e −bΛn + Λ n . Proof. We have (with a slight abuse of notation) Furthermore, The proof is concluded.

5.2.3.
Prior mass. Finally, we lower bound the prior mass in a small Kullback-Leibler neighbourhood of the data-generating compound distribution q 0 . Define the function g : (0, ∞) × (0, 1) → (0, ∞) by where C is the constant appearing in the statement of Lemma 6 below.
Lemma 4. Assume that (i) there exists α, such that 0 < α ≤ α i ≤ 1 for all 1 ≤ i ≤ m n ; (ii) strictly positive sequences p n → 0, n → 0 and m n → ∞ satisfy the inequalities m n g( n , p n ) < 1 and p n < g( n , p n ) 2 .
Proof. Define For all n large enough and small, we have {λ : |λ 0 − λ| ≤ } ⊆ [Λ n , Λ n ]. Then by inequalities in Lemma 5, B n ( ) ⊂ B n √ 3C , with a constant C that can be taken the same for all large enough n; see the arguments in Section 4.2 in Gugushvili et al. (2015). Hence, using the a priori independence of p and λ, Now, Π (|λ 0 − λ| ≤ n ) n . Furthermore, by Lemma 6 from Appendix A, we have The statement of the lemma now follows upon applying Lemma 7 from Appendix A with η = p n and = g( n , p n ).
5.3. Using bounds in Theorem 2. We take m n log n, n log γ n √ n , p n 1 n 2 , Λ n log 2γ n, Λ n exp(− const · log 2γ n) with appropriately selected proportionality constants, and verify the conditions in Theorem 2. Firstly, condition (8) is trivially satisfied. Therefore, we can invoke Lemma 2 and conclude that the entropy is upper bounded by a multiple of log 2γ n, since γ > 1. Now log 2γ n n 2 n , and this verifies the entropy condition in Theorem 2. Be Lemma 3, for a suitable choice of the constant C the remaining prior mass condition is likewise satisfied.
Finally, for the prior mass condition in a small Kullback-Leibler neighbourhood to hold, by Lemma 4 we need that the term Π (|λ 0 − λ| ≤ n ) exp −m n log(1/(g( n , p n ) 2 − p n )) − m n log(1/α) is lower bounded by exp(−Cn 2 n ) for some large enough C > 0. Take the logarithm on both sides of the above display and note that by our conditions log (Π (|λ 0 − λ| ≤ n )) log( n ) −n 2 n . Likewise, m n log(1/(g( n , p n ) 2 − p n )) + m n log(1/α) n 2 n , so that the prior mass condition holds.
Thus we have verified all the conditions of Theorem 1. The resulting posterior contraction rate is n log γ n/ √ n.

Outlook
In this paper we introduced a non-parametric Bayesian approach to estimation of the Lévy measure ν of a discretely observed CPP, when the support of ν is a subset of N. We constructed an algorithm for sampling from the posterior distribution of ν, and showed that in practice our procedure performs well and measures up to a benchmark frequentist plug-in approach from Buchmann and Grübel (2004). Although computationally more demanding and slower than the latter, our method has an added benefit of providing uncertainty quantification in parameter estimates through the spread of the posterior distribution. On the theoretical side we show that our procedure is consistent, in that asymptotically, as the sample size n → ∞, the posterior concentrates around the 'true', data-generating distribution. The corresponding posterior contraction rate is the (nearly) optimal rate log γ n/ √ n for an arbitrary γ > 1, if we are to ignore a practically insignificant log n factor.
Among several generalisations of our results, the one that looks the most promising is extension of our methodology to CPP processes with jump size distributions supported on the set of integers Z. The corresponding model has garnered substantial interest in financial applications, see Barndorff-Nielsen et al. (2012). We leave this extension as a topic of possible future research.

Appendix A. Technical results
Lemma 5. Let q (resp. q ) be the law at time 1 of a compound Poisson process with intensity λ (resp. λ ) and jump distribution p (resp. p ). Suppose that p and p are distributions concentrated on N. Then, In particular, if λ, λ ∈ [Λ, Λ] with 0 < Λ ≤ Λ < ∞, then there exists a positive constant C, that depends on Λ, Λ, such that Proof. If KL(p, p ) and V (p, p ) are infinite, then the above inequalities are trivially satisfied. Therefore, we can assume these two divergences are finite. With this in mind, the proof of the lemma is divided into three steps.
Step 1: We begin by proving that for any n ≥ 1, Now note that the last summand satisfies We therefore conclude that h 2 p * n , p * n ≤ h 2 (p * (n−1) , p * (n−1) ) + h 2 (p, p ), which leads to the desired inequality, by an induction argument.
Here N and N are Poisson random variables with means λ and λ , respectively, and with a slight abuse of notation, KL(N, N ), V (N, N ) and h(N, N ) are the KL and V divergences and the Hellinger distance between the corresponding laws. Note that p * n (i)P (N = n), q (i) = ∞ n=0 p * n (i)P (N = n).
This gives the required inequality for the V divergence. Finally, we prove the inequality for the Hellinger distance. Denoting the law of N j=1 Y j byq, it holds by the triangle inequality that h(q, q ) ≤ h(q,q) + h(q, q ). Since g(x) = (1 − √ x) 2 1 [0,∞) (x) is a convex function, h 2 (q,q) ≤ i∈N n∈N p * n (i)P(N = n)g p * n (i) p * n (i) = n∈N P(N = n)h 2 (p * n , p * n ).
It remains to prove that h(q, q ) ≤ h(N, N ). This again follows by convexity of g, since h 2 (q, q ) = i∈N n∈N p * n (i)P(N = n)g ∞ n=0 p * n (i)P(N = n) ∞ n=0 p * n (i)P(N = n) ≤ i∈N n∈N p * n (i)P(N = n)g P(N = n) P(N = n) = h 2 (N, N ).
Proof. Lemma 8 in Ghosal and van der Vaart (2007) assures that there exists a constant C (not depending on either p or p ), such that KL(p , p) ≤ Ch 2 (p , p) 1 + log p p ∞ and V (p , p) ≤ Ch 2 (p , p) 1 + log p p ∞ 2 .
This entails the result.