An importance sampling approach for reliable and efficient inference in Bayesian ordinary differential equation models

Statistical models can involve implicitly defined quantities, such as solutions to nonlinear ordinary differential equations (ODEs), that unavoidably need to be numerically approximated in order to evaluate the model. The approximation error inherently biases statistical inference results, but the amount of this bias is generally unknown and often ignored in Bayesian parameter inference. We propose a computationally efficient method for verifying the reliability of posterior inference for such models, when the inference is performed using Markov chain Monte Carlo methods. We validate the efficiency and reliability of our workflow in experiments using simulated and real data and different ODE solvers. We highlight problems that arise with commonly used adaptive ODE solvers and propose robust and effective alternatives, which, accompanied by our workflow, can now be taken into use without losing reliability of the inferences.


Introduction
Implicitly defined quantities that depend on unknown parameters introduce challenges when they are involved in statistical models.Examples of such quantities are solutions to parameterized algebraic equations, optimization problems, integrals, or ordinary differential equations (ODEs).They generally do not have a closed form given the parameters, and to evaluate the model likelihood they have to be approximated using numerical methods (Süli and Mayers, 2003).In the Bayesian context, model inference is commonly done by sampling the posterior distribution of the parameters using Markov chain Monte Carlo (MCMC) (Brooks et al., 2011) techniques.The theory behind MCMC assumes that the model likelihood can be computed exactly, but this is not the case if numerical approximations are required.Some numerical routines can estimate their own error, but as the true error is not known and its magnitude varies in different parts of the parameter space, it is difficult to predict how it affects the MCMC posterior draws.As this bias has not had a lot of attention in the literature, unaware users can blindly use default configurations of the numerical methods implemented in software packages.
Numerically solving an ODE system is a computationally intensive task, and often dominates the cost of one unnormalized posterior density evaluation in Bayesian ODE models.Computational requirements are amplified by the fact that MCMC inference typically requires a large number of these ODE solutions, and this number can be several orders of magnitude larger than the posterior effective sample size (ESS) due to the need to warm-up the sampler and high auto-correlation of intermediate draws.The ratio of ESS to the number of unnormalized posterior density evaluations can be drastically increased by using gradient-based MCMC methods such as Hamiltonian Monte Carlo (HMC) (Duane et al., 1987), and using the gradient information is essential in the case of high-dimensional parameter spaces or complex posterior geometries in order to achieve good sampling performance.In modern statistical software such as Stan (Stan Development Team, 2022), PyMC (Salvatier et al., 2016), and Turing.jl(Ge et al., 2018), gradients are computed using automatic differentiation (Bell and Burke, 2008;Baydin et al., 2018).This poses a challenge for ODE models, for which computing the gradient is computationally significantly more demanding and subtle than the plain likelihood evaluation.
Classic numerical integrators for solving ODE systems are iterative methods that discretize the integral, and their accuracy and stability depend on the discretization step size (Hairer et al., 1993;Griffiths and Higham, 2010;Süli and Mayers, 2003).In theory, the error can be made arbitrarily small by using a step size that approaches zero, but this is not possible in practice due to limited computational resources and floating point arithmetic.Smaller step size means more evaluations of the ODE right-hand side (RHS) function, which means more computation.Selecting the step size therefore involves balancing between a reasonable computation time and good accuracy of the approximation.
Adaptive integrators remove the burden of selecting the step size, as they can tune it automatically.These methods estimate their own local error, and adapt their step size so that given tolerances are satisfied.However, this does not give any guarantees about validity of the related statistical inference results.Moreover, requiring more accuracy typically causes the solver to adapt to smaller step size values, which leads to more computation.The problem of step size selection has thus been replaced by the problem of selecting the tolerances.Regardless, adaptive solvers are the most commonly used methods in statistical software, and have been included in various probabilistic programming and machine learning frameworks that implement gradient-based MCMC samplers with automatic differentiation.
To our best knowledge, there exist no generally applicable frameworks for validating the reliability of an approximate numerical method that is needed for posterior density evaluations in MCMC inference.For ODE solvers, one approach is to gradually use lower and lower step sizes (or stricter and stricter tolerances for adaptive solvers) during inference, until posterior estimates do not change appreciably anymore.However, repeating MCMC sampling like this quickly becomes computationally very expensive.Capistran et al. (2016) recognized the model that uses a numerical approximation as a different model than the true model with the exact ODE solution.In the special case of a Gaussian observation model and a certain type of fixed-step solver with step size h, they showed that the Bayes factor of the two models approaches one with the same rate as the numerical solution approaches the true solution, as h → 0. Relying on this, they estimated the marginal likelihood of the true model based on first estimating it for approximate models, with different h, and then extrapolating to h = 0 using linear regression.However, the required several marginal likelihood approximations are difficult to perform in high dimensions and Capistran et al. (2016) only demonstrated their method in one-dimensional parameter spaces.
Probabilistic numerical methods (Hennig et al., 2015) view the numerical problems probabilistically and can give uncertainty estimates for the solution.Teymur et al. (2021) used Gaussian process regression to estimate a distribution for the exact solution given a series of approximations of different accuracy.These methods are designed for performing a fixed numerical problem probabilistically, and it is not clear how to use them in Bayesian inference of models that contain numerical problems with unknown parameters.For the probabilistic parameter inference problem of ODE models, there exist strategies that completely avoid numerically integrating the ODE (see e.g.Barber and Wang, 2014, and references threrein).However, such approaches lack convergent numerical methods and thus are not asymptotically approximating the true ODE solutions.
We present an efficient, reliable, and generally applicable strategy for MCMC inference of models which require numerically approximating parameter-dependent quantities.It uses importance weights to correct the biases that result from using numerical approximations.These importance weights are cheap to compute compared to the cost of MCMC sampling, and we can diagnose their success using Pareto smoothed importance sampling (Vehtari et al., 2021b).The proposed method is straightforward to implement in probabilistic programming languages, as it does not require modifications to standard MCMC algorithms or classic numerical solvers.We demonstrate its benefits in ODE model inference, using both adaptive and non-adaptive ODE solvers.

Initial value problems
Ordinary differential equation (ODE) models are routinely used in various fields of science to model dynamic phenomena.A D-dimensional ODE system with state variables y(t) ∈ R D is defined as where the right-hand side function (RHS) f ψ : R D × R → R D has parameters ψ.These parameters can be a subset of all parameters of a Bayesian model, in which ODE systems usually appear in the form of initial value problems (IVPs).This means that an initial value y (0) := y(t 0 ) is explicitly defined (either a known value or a model parameter), and evaluating the likelihood of the data requires solving y(t) at several time points t > t 0 .
The solution is implicitly defined by Eq. 1 and the initial value, and can be written using the integral formula which according to the Picard-Lindelöf theorem has a unique solution assuming some smoothness conditions1 for f ψ (Hairer et al., 1993).However, the integral rarely has an explicit closed form and has to be approximated numerically.

ODE solvers
A myriad of different methods exist for numerically approximating the integral in Eq. 2.
We use y M (t) to denote the solution given by a numerical method M .A general strategy used by method M (h) with fixed step size h > 0 is to first compute y M (h) (τ j ) on a grid τ j = t 0 +jh, j ∈ {0, 1, 2, . ..}.This can be done by setting y M (h) (τ 0 ) = y (0) and iteratively computing y M (h) (τ j+1 ) for j ≥ 0 using some update rule.After this, some interpolation method can be used if the solution is required at a time point t which is not on the grid (Hairer et al., 1993).Numerical methods are generally required to be convergent, meaning that the global error y M (h) (t) − y(t) must approach zero as h → 0. A method is called convergent of order p if this happens at rate O(h p ) (Hairer et al., 1993).Smaller step sizes h will give a more accurate solution, but require more iterations and therefore more computation.In practice, one would like to set the step size small enough to achieve good precision, but large enough to avoid unnecessary computation.Adaptive step size methods try to automatically adapt the step size by estimating their own error.As the global error is difficult to estimate, software implementations are usually based on estimating the local truncation error, i.e. the error induced by a single step (Griffiths and Higham, 2010).The step size is adapted so that user-supplied absolute ( abs ) and relative ( rel ) tolerances in the estimated local truncation error are satisfied.While these methods remove the burden of selecting h from the user, the user must still supply the absolute and relative tolerances.These tolerances have virtually the same trade-off as the step size selection itself; lower tolerances give better accuracy, but require smaller step sizes and therefore more computation.
ODE solvers are generally either explicit or implicit.For explicit solvers, the next state is computed explicitly based on the current state.Implicit solvers tend to perform significantly better for stiff problems (Hairer and Wanner, 1996), but the downside is that they require numerically solving a system of algebraic equations on each step.This has to be done using for example Newton iteration, which has its own stopping criterion that affects the result.The ODE solvers used in our experiments are described in more detail in Appendix A.

Sensitivity analysis
Gradient-based MCMC requires computing the gradient of the unnormalized posterior density.In modern probabilistic programming frameworks, gradients are computed using automatic differentiation (AD) (Baydin et al., 2018;Margossian, 2019).This involves building a differentiable computation graph for the target density, where all operations on inputs are recorded.However, computing the sensitivities efficiently and reliably in AD frameworks is not straightforward when iterative numerical solvers are involved (Bell and Burke, 2008;Margossian, 2019;Rackauckas et al., 2021).
There are three main ways to integrate ODE solves into the computation graph built for AD.The direct method treats the ODE solve similarly as any other sequence of operations and directly records these operations into the computation graph.We use this method for all non-adaptive solvers in our experiments.The other two methods rely on continuous sensitivity analysis (Rackauckas et al., 2021).This approach is based on the fact that applying the chain rule of differentiation to the ODE system (Eq. 1) gives d dt from which we get an additional ODE system with DP dimensions.Forward continuous sensitivity analysis solves this extended system simultaneously with the original system using the same adaptive numerical ODE solver.This can be implemented so that also the extended system needs to satisfy the given tolerances, but more sophisticated rules are used with implicit solvers that require also Newton iteration (Feehery et al., 1997).In our experiments, we use forward continuous sensitivity analysis for all adaptive solvers.Adjoint continuous sensitivity analysis (Margossian, 2019;Rackauckas et al., 2021) first solves only the original ODE system forward in time, and uses the obtained solution to solve a different additional system backward in time to get the sensitivities.In this method the additional system has only dimension D, so it theoretically scales better with respect to the number of parameters.However, this method is even harder to configure, and we leave it for future work.

Bayesian models with numerical approximations
We consider MCMC inference of models that define a posterior p(θ | D) ∝ p(D | θ)p(θ), where p(D | θ) is the likelihood of data D given parameters θ, and p(θ) is the prior.The goal of inference is commonly the computation of expectations of the form which can be for instance model predictions or parameter estimates, determined by the function ϕ.When MCMC is used to obtain posterior draws θ s , s = 1, . . ., S, the integral can be estimated as We focus on models whose unnormalized posterior density depends on We define to denote the unnormalized density.Quantities Y(θ) can be defined implicitly through equations that have no closed form solution or for other reasons have to be numerically approximated as , where M denotes the approximation method.Since Y(θ) can only be approximated, P (Y(θ), θ) cannot be evaluated exactly and therefore it is not possible to use MCMC to sample from p(θ | D) exactly.Instead, MCMC will sample from some distribution p M (θ | D) ∝ P Y M (θ), θ .It is therefore crucial to develop methods that can correct this bias and inform users if the approximation method M is so inaccurate that it renders the correction impossible, and re-running MCMC with a more accurate method is needed.
Various numerical methods M have some control parameters that can be used to tune their accuracy.Examples of such methods are ODE solvers M = M (h), where h > 0 is the step size.In ODE solvers, the implicitly defined quantities are Y n (θ) = y ψ (t n ), where ψ are a subset of all parameters θ and y ψ (t n ) is the solution to an ODE initial value problem with parameters ψ, at time t n .The corresponding numerical approximate solution with method M we denote y M ψ (t n ) ≈ y ψ (t n ).

Importance sampling approach
In importance sampling, draws θ s , s = 1, . . ., S are obtained from another distribution q(θ), and the expectation (Eq.5) can be estimated as where r s = p(θ s |D) q(θ s ) are called importance ratios/weights.Our approach is to use MCMC importance sampling with q(θ) = P Y M (θ), θ , where M is the approximation method used during MCMC.The importance ratios are but as we cannot evaluate P (Y(θ s ), θ s ), exact importance sampling is not possible.Instead, we use ratios where M * is a more accurate method than M .We require that M * is a convergent numerical method, meaning that for any fixed θ, for all n = 1, . . ., N , as M * is made more and more accurate.This means that at each point θ.Consequently, the ratios r M,M * s converge to the exact ratios r M s , and posterior estimates (Eq.8) converge, too.Our approach therefore is to incrementally increase the accuracy of M * until the maximum absolute error has converged2 .In general high-dimensional parameter spaces of ODE models, this step can be done with negligible computational effort compared to the initial MCMC sampling.Furthermore, this analysis can be conveniently done simultaneously as we assess whether MCMC needs to be run again with a more accurate method M .

Pareto smoothing and diagnostics
Importance sampling requires that the distribution q(θ) is sufficiently similar to the target distribution p(θ | D), so that the nominator and denominator in Eq. 8 would have finite variance (Geweke, 1989(Geweke, , 2005;;Koopman et al., 2009).Pareto smoothed importance sampling (PSIS) (Vehtari et al., 2021b) modifies the raw ratios so that the estimator of the expectation has finite variance and asymptotic normality in a wider range of problems (Vehtari et al., 2021b).However, any modifications to the ratios cannot correct for a q(θ) that is too far from p(θ | D), which in our case means that the method M used during MCMC sampling has to be sufficiently accurate.Importantly, PSIS provides a diagnostic that we can use to assess whether this is the case.
In PSIS, a generalized Pareto distribution (GPD) is fitted to match the tail of the empirical distribution of the ratios r M,M * s .The probability density function of the GPD is where u ∈ R is a location parameter, k ∈ R is a shape parameter and σ > 0 is a scale parameter.The cutoff value u = û is first determined as explained in Vehtari et al. (2021b), and k, σ are then fitted to the empirical distribution of the tail r M,M * s |r M,M * s > û.The latter part can be done using the method by Zhang and Stephens (2009), and overall fitting the GPD parameters is computationally very cheap.
The estimated shape parameter k can be used as a diagnostic to determine if the importance sampling estimate (Eq.8) is reliable (Vehtari et al., 2021b).Values k < 0.7 indicate small errors with high probability and good convergence rate with increasing sample size (Vehtari et al., 2021b).Moreover, as we increase the accuracy of M * , we can study the convergence of k as an additional safeguard metric to assess whether the distribution of importance ratios, and therefore any posterior estimates, have converged.

The proposed workflow
The proposed algorithm for MCMC inference of models that require numerical approximations, can be summarized in the following steps.
1. Select a reasonable approximation method M .
3. Compute MAE M,M * (Eq.13) and importance weights r M,M * s (Eq.10) using approximation method M * .Fit k as explained in Section 2.4.4. Increase the accuracy of M * and repeat Step 3 until MAE M,M * and k converge.If k converges to a value larger than 0.7, increase the accuracy of M and go back to Step 2.
5. Compute any posterior estimates using Eq. 8, with r s being the values to which r M,M * s finally converged.
The algorithm can be used in two ways, depending on how Step 1 is done.Firstly, it can be used to validate the reliability, and correct the errors of a given method M , which can be for example a software default.Secondly, a smart initial selection of M can provide substantial speed gains compared to often rather conservatively set software defaults, while still maintaining reliability of the inferences.
In the latter case, we generally recommend selecting M initially so that sampling is as fast as possible.For example, for non-adaptive ODE solvers M = M (h), one can first try using the largest sensible step size h that does not result in immediate failure.Sometimes the solver returns infinite or NaN values or values which are inconsistent with the observation model3 and MCMC cannot proceed, but in this case the sampler will fail fast and not much time is wasted.Selecting a good M is more difficult in the case of adaptive solvers M = M ( ) whose adaptation rules are discontinuous and whose gradients also need to be approximated.In their case, using too large tolerances cause inaccurate gradients and ragged likelihood surfaces, which can cause MCMC to struggle and become slower, and these issues are slower to detect.We discuss this effect in more detail in Appendix B.

Results
We developed an R-package called odemodeling for fitting Bayesian ODE models using Stan (Stan Development Team, 2022) and it is available at https://github.com/jtimonen/odemodeling.In addition to the adaptive built-in ODE solvers of Stan, it implements two explicit Runge-Kutta methods using a fixed but tuneable number of steps.Furthermore, it implements our workflow for determining reliability of ODE model inference and other convenience functions.
We demonstrate our proposed workflow using various ODE solvers and two ODE models, first with simulated and then with real data.In all experiments we use the dynamic HMC algorithm (Betancourt, 2018;Stan Development Team, 2022) implemented in Stan for MCMC sampling.We use initial leap frog4 step size 0.1 and otherwise default configuration, unless otherwise mentioned.In each experiment we run 4 independent MCMC chains for 4000 iterations and the first 2000 iterations are discarded as warmup.Code for reproducing the experiments is available at https://github.com/jtimonen/numapprox-is.Full experiment details are in Appendix C.

Target-mediated drug disposition model
In the first experiment, the model is a target mediated drug disposition (TMDD) model (Mager and Jusko, 2001;Aston et al., 2011).This is a common pharmacokineticpharmacodynamic model for reversible binding of drug/ligand (y 1 ) with receptor (y 2 ), where a receptor-ligand complex (y 3 ) is formed.For solving the ODE system, we use the backward differentiation formulae method (BDF, see Appendix A).We denote the BDF solver with absolute and relative tolerances abs = rel = by BDF( ).The ODE Figure 1: Illustration of the ODE system used in the TMDD experiment.The three columns correspond to (concentrations of) the ligand y 1 , receptor y 2 , and receptor-ligand complex y 3 .a) The black line shows the ODE trajectory used in data simulation, solved using BDF(10 −15 ).The black dots are the generated noisy data.b) ODE solutions solved using BDF(0.02),corresponding to 10 parameter draws from the posterior induced by BDF(0.02).c) ODE solutions corresponding to the same 10 posterior parameter draws as in b), but solved using BDF(10 −12 ).The black and red lines visualized the ODE solutions at a dense set of time points t = 0.00, 0.10, 0.11, 0.12, . . ., 10.00.See detailed explanation of the BDF method and the used implementation in Appendix A.3. solution y BDF(10 −15 ) (t) using the simulation parameters (Appendix C) and the simulated noisy data used in this experiment are visualized in Fig. 1a.
To study the effect of tolerances, we first run MCMC sampling using M = BDF( ) different values.Example ODE solutions from the posterior using = 0.02 are shown in Fig. 1b, and corresponding solutions with = 10 −12 in Fig. 1c.The tolerance value = 0.02 is very high and the solutions look unstable compared to = 10 −12 .
The timing results in Fig. 2 (black line) show that for sufficiently small values (roughly < 0.02) of , the MCMC runtime increases consistently as is decreased.For example, with = 10 −10 , which is the default in Stan, MCMC sampling takes around 10 3 seconds.On the other hand, we can sample from the same distribution by first MCMC sampling Figure 2: Runtime comparison in the TMDD experiment.a) The black line corresponds to running MCMC directly with M = BDF( ), where is on the x-axis.The four colored lines correspond to first running MCMC with M = BDF(0.05),BDF(0.04),BDF(0.03) or BDF(0.02), and additionally computing the importance weights needed to correct resulting posterior estimates as if the draws were from M = BDF( ), where is on the x-axis.The same MCMC algorithm and same number of chains and draws is used in all MCMC sampling.b) The MCMC runtime is almost completely explained by the number of times we need to perform automatic differentiation for the ODE RHS function.The red and blue dots each correspond to one MCMC chain, x-axis showing the number of needed RHS calls with AD and y-axis the chain runtime.
with M = BDF(0.02)and then performing importance sampling with M * = BDF(10 −10 ) in just 10 2 seconds.This time comes almost solely from the MCMC sampling, and we could change M * to even more accurate, with virtually no extra cost.We note that plain runtime of MCMC is not always a good measure of sampling efficiency.In this case, however, it is informative as we always use the same MCMC algorithm, the bulk and tail estimated effective sample sizes (Vehtari et al., 2021a) are very similar in all cases, and convergence diagnostics are satisfactory (see Table 1).
Fig. 1b-c demonstrated that the BDF(0.02)gives suspicious ODE solutions, indicating that we cannot trust it as such and need our reliability workflow.However, quantities in Fig. 3 validate that importance sampling can be trusted for all tested tolerances in M , because the estimate of the shape parameter of the generalized Pareto distribution k stabilizes at a value smaller than 0.5 as M * is made more and more accurate.Performing this reliability check takes an insignificantly small amount of time compared to MCMC sampling.The estimated relative efficiency of the importance sampling phase is close to 1 (Fig. 3d), and we do not significantly lose efficiency due to it.
Fig. 4 illustrates the distribution of the importance ratios and fitting the generalized Pareto distribution (GPD) in different cases.We see that the tail of the ratios becomes The runtime starts increasing if the tolerances are made too large (approx.> 0.03).This means that we cannot start our workflow with being arbitrarily large.To understand the reason for this, we have recorded some HMC NUTS metrics in Table 1.The runtime is explained almost completely by the number of ODE RHS evaluations with AD (Fig. 2b), and too large tolerances cause the leap frog step size to adapt to smaller values, meaning that more leap frog steps and therefore more RHS evaluations with AD are needed.Potential reasons for this can be that the discontinuity of the ODE solver step size adaptation rule becomes more evident with large tolerances, which makes the likelihood surface more discontinuous and complicates sampling.The problems are amplified by the fact that the gradient error grows when tolerances are increased (See Section 2.1.3and Appendix B).

model
In the second experiment, we study a model of predator-prey dynamics between Canadian lynx and snowshoe hare.See Appendix C details about the model and data.In this experiment, we use the adaptive RK45 ODE solver (Appendix A), and use RK45( ) denote it with tolerances rel = abs = .
We first run MCMC sampling using M = RK45( ) with different values.The results in Fig. 5a (black line) show that for small values (approx < 10 −3 ) of , the runtime increases as is decreased.For example, with = 10 −6 , which is the default in Stan, MCMC sampling takes around 10 2.15 ≈ 140 seconds.On the other hand, we can sample from the same distribution by first MCMC sampling with M = RK45(0.001)and then performing importance sampling M * = RK45(10 −6 ) in just 10 1.7 ≈ 50 seconds.
We performed MCMC inference and importance sampling using also the RK4 and midpoint methods for solving the ODE (Appendix A).We use notation RK4(K) and midpoint(K) where K is the number of steps taken between subsequent output time points and thus implicitly determines the step size h.The results in Fig. 5b (purple and orange lines) show that for both methods, the runtime increases consistently with K. Furthermore, we find that using M = midpoint(3) is faster than any tested tolerance values for RK45 and M = RK4(2) is a bit slower than RK45 with = 0.001 or = 10 −4 .Smaller K for either midpoint or RK4 in this case caused the solver to break at initial parameter values and not be able to produce MCMC draws.
Quantities in Fig. 6 validate that our importance sampling workflow can be trusted.The MAE M,M * and other metrics converge as M * is made more accurate, and k values Figure 5: Runtime results in the Lotka-Volterra experiment.a) The black line corresponds to just running MCMC directly with M = RK45( ), where is on the x-axis.The green line corresponds to first using MCMC with M = RK45(0.001),and additionally computing the importance weights against the posterior density induced by RK45( ).b) Using the explicit midpoint and RK4 methods, with fixed number of steps K between output time points.converge to < 0.5 in all cases with close to 1 relative efficiency.The convergence is especially smooth for the non-adaptive solvers, as K is increased for M * .

Discussion
We have shown that the proposed approach is useful for efficient and reliable Bayesian inference of general ODE models.We have demonstrated our workflow in ODE problems, but it is generally applicable also to various other types of models that require approximate numerical methods.
We have shown that by using a less accurate numerical method M during MCMC and importance sampling with increasingly more accurate versions M * of the method can be an order of magnitude faster than plain MCMC sampling with a software default method M * .Moreover, after MCMC sampling with some method M , users can easily and without further computational effort check what the posterior of ODE solutions would look like with any given other solver.
Efficiently selecting the initial method M to be used in our workflow still has some challenges if M has to belong to the class of commonly used adaptive ODE solvers.Inference with these methods can become slower as M is made too inaccurate.These methods have not originally been designed to be used in Bayesian inference, but have been widely adopted into even gradient-based MCMC and optimization software despite their discontinuous adaptation rules, and the fact that their local error control itself is not enough for reliable statistical inference.
Our framework enables also rapid testing of different types of new and old numerical solvers, which do not need to have error control.The only requirement is that they are convergent numerical methods whose accuracy can be controlled.We have shown a large potential and obtained promising results using explicit non-adaptive solvers whose gradient is computed using direct automatic differentiation of the solver formula.

A ODE solvers
Here we give the details about the used ODE solvers and their used implementations.We implemented the explicit non-adaptive RK methods as user-defined functions in Stan.The RK45 and BDF solvers are built into Stan, and we describe how Stan uses the Boost Odeint (Ahnert et al., 2011) and SUNDIALS (Hindmarsh et al., 2005) libraries to obtain both the ODE solutions and their sensitivities.The goal is to focus on how the specified tolerances and other factors affect the approximate solution and sensitivities.We used Stan version 2.28, which depends on Boost version 1.75.0 and SUNDIALS version 5.7.0.

A.1 Explicit Runge-Kutta methods
A general R-stage explicit RK method with step size h uses the update rule where and a r,l , b r , and c r are suitable coefficients.For the MIDPOINT method, the number of stages is R = 2 and the coefficients are a 21 = 1 2 , b 1 = 0, b 2 = 1 and c 2 = 1 2 .For RK4, the number of stages is R = 4 and the coefficients are a 21 = 1 2 , a 31 = 0, a 32 = 1 2 , a 41 = 0,

A.2 RK45
Methods with adaptive step size control can have a varying step size h j .An example is the RK45 method (Dormand and Prince, 1980), which estimates its local truncation error and adapts h j so that a certain requirement based on given absolute ( abs ) and relative ( rel ) tolerances is satisfied.On each step j, a test solution for the next state is computed using two RK formulas, which have order (number of stages) 4 and 5.We denote the current approximation at t j by y RK45 j ∈ R D and the test solution given by order R method at the next point t j+1 = t j + h j by y where f j,d is the dth component of f (y RK45 j , t j ).If v j > 1, the test step is rejected as the step size has to be decreased to satisfy the tolerances.This is done by setting recomputing v j until v j ≤ 1.Once an acceptable step size h j is found, the higher order test solution is then used to advance the integrator, i.e. y RK45 j+1 = y (R+1) j+1 .The next step size is then where the first case means that the step size increases.These update rules are based on theory about the optimal step size updating, and safety modifications that ensure that step size does not change too fast (cf.Hairer et al. (1993)).There exist automatic ways for selecting a good initial step size h 0 (Hairer et al., 1993), but Stan does not attempt that and always sets h 0 = 0.1.
Stan uses the so called dense output version of RK45, for which the maximum step size is not restricted based on the time points where the solution is requested, until for the very last output time point.Instead, an interpolation method (Dormand and Prince, 1986) is used to obtain the solution at the intermediate output time points.However, there is a limit for maximum number of steps that can be taken between two consecutive output time points, for which we used the value 105 .If this limit is reached during MCMC, the current proposal is rejected.
Continuous forward sensitivity analysis of the RK45 solutions is in Stan implemented by using the above strategy as such for the extended system that involves also the forward sensitivity equations (see Section 2.1.3).This means that also the extended system dimensions affect the error control and step size updates similarly as the original dimensions.

A.3 BDF
The backward differentiation formulae (BDF) method is a linear multistep method, and Stan uses the implementation in the CVODES package (Hindmarsh et al., 2021) of the SUNDIALS software suite.Each step j is based on the rule where y BDF j = y BDF (t j ).The order q can vary between 1 and 5, and is adapted simultaneously with the step size h j .The coefficients α j,i , β j are determined by the recent history of the step sizes (Byrne and Hindmarsh, 1975;Jackson and Sacks-Davis, 1980).For this method, estimation of the local truncation error and adaptation rules are highly complex and all details are described in (Hindmarsh et al., 2021).A nonlinear rootfinding problem must be solved at each step to obtain y BDF j , and this is done approximately using modified Newton iteration.This in turn requires solving a linear system, for which Stan uses the dense linear solver facilities of SUNDIALS.In addition to controlling the estimated local truncation error, also the Newton iteration convergence has to be controlled.The stopping criterion for the Newton iterations is in CVODES attempted to set so that its error does not interfere with local error control.
The usage of BDF in Stan involves a hard coded initial step size setting rule which depends on the first requested output time.One effect of this is that the returned solution at a given time point depends not only on the selected tolerances, but also on whether the solution output has been requested at earlier time points.This is why when visualizing the ODE solutions at a dense set of time points in Figure 1, we are not able to solve the system at 0 < t < 0.1, because t = 0.1 is the first observation time point.Requesting a dense solution also before the first data time point would cause the ODE solutions at the data time points differ from those we get when we solve only at the data time points (i.e.what we do when fitting the model).
Also this solver has a limit for the maximum number of steps that can be taken between two consecutive output time points, and we used the value 10 5 .
Continuous forward sensitivity analysis of the BDF solutions is in Stan implemented by using the continuous sensitivities provided by CVODES.They are computed so that the same linear multistep formula is used also for solving the sensitivity equations.Stan selects to employ the staggered corrector method (Feehery et al., 1997), where a separate Newton iteration is used to solve the sensitivity system after the convergence of Newton iteration for the original system.The sensitivity dimensions are included in the error control, which means that they affect the adaptation.The relative tolerance for sensitivity dimensions is the same rel as for the original dimensions, but their absolute tolerances are automatically scaled versions of abs .See more details in Hindmarsh et al. (2021).

B Impact of numerical approximations on sampling efficiency
Numerical approximation procedures may introduce issues for gradient based MCMC methods such as HMC and its variants.These methods aim to reduce the autocorrelation of subsequent draws by numerically simulating the trajectory of a fictitious particle which is accelerated by the negative gradient of the unnormalized log posterior density.The sampling efficiency depends on the average energy error of the simulated particle and decreases sharply if the numerical simulation of its trajectory is not accurate enough (Neal, 2011).However, it is usually assumed that the average energy error can be brought arbitrarily close to zero by lowering the step size used for the numerical simulation of the fictitious particle, and this assumptions is usually relied upon to optimize sampling efficiency (Stan Development Team, 2022).Numerical procedures can cause this assumption to fail in two ways.If the procedure has an adaptive control flow (e.g.iterate until some value passes some threshold), the numerical procedure may introduce discontinuities in the posterior density.If the numerical procedure includes continuous adaptivity (e.g.continuously increasing or decreasing the step size) or if the computation of the sensitivities of its approximate result uses a mathematical identity which only holds for the exact result (e.g. when using the implicit function theorem), then the numerical procedure may introduce a gradient mismatch, meaning that the vector field which gets used to accelerate the fictitious particle is not equal to the gradient of the (approximate) log posterior density.
While the direct method generally does not introduce a gradient mismatch, it can still introduce discontinuities if it is applied to an adaptive method.Forward continuous sensitivity analysis generally introduces both a gradient mismatch and discontinuities if applied to adaptive ODE solvers.However, if it is applied to an explicit ODE solver with a fixed step size configuration it need not do either.Except for edge cases or if handled in a special way, any implicit ODE solver generally introduces both a gradient mismatch and discontinuities, as does continuous adjoint sensitivity analysis.
Whether gradient mismatches or discontinuities cause gradient based MCMC methods to struggle depends mainly on the size of the two effects.HMC and its variants can generally tolerate minor discontinuities or gradient mismatches, if the average energy error can still be made sufficiently small.Generally, stationarity of the (approximate) target distribution is not lost due to discontinuities or gradient mismatches, as neither influences the reversibility or the phase space volume preservation of the commonly used leapfrog integrator (Hairer et al., 2003), which are the main necessities for stationarity of the target distribution (Neal, 2011).

C.1 TMDD model
The dynamics are modeled using the three-dimensional ODE system   2: HMC NUTS diagnostics in the TMDD experiment.The accept_stat column is the average the Metropolis acceptance probability over all Hamiltonian trajectories, stepsize is the average leapfrog step size, treedepth is the average depth of the tree used by the NUTS sampler, n_leapfrog is the average number of leapfrog steps per transition, and divergent is the percentage of divergent transitions out of total postwarmup draws (Stan Development Team, 2022).

Figure 3 :
Figure 3: Convergence of different metrics as M * is made more and more accurate in the TMDD experiment.The a) column shows the maximum absolute error (MAE) between posterior ODE solutions y M and y M * (Eq.13), as a function of the tolerance of M * .The b) column shows the largest importance ratio r M,M * .The c) column shows for Pareto smoothed importance sampling the k diagnostic, and d) column its estimated relative efficiency.The rows correspond to different choices of the method M (see the legend in column d) used during MCMC.

Figure 6 :
Figure 6: Convergence of different metrics during our workflow in the Lotka-Volterra experiment, as the method M * is made more and more accurate.The rows correspond to different choices of the initial ODE solver M (see the legend in column d) used during MCMC.
implementation in Boost odeint 5 , where the step size control is based on the maximum relative estimated local truncation error v y 1 − (k on y 1 y 2 − k off y 3 ) k in − k out y 2 − (k on y 1 y 2 − k off y 3 ) (k on y 1 y 2 − k off y 3 ) − k eP y 3 unknown parameters ψ = {k on , k off , k in , k out , k eL , k eP }.The initial state isy (0) = y(t 0 ) = L 0 , k in kout , 0, where t 0 = 0 and L 0 = 10 is the initial drug bolus.