Flexible and efficient Bayesian pharmacometrics modeling using Stan and Torsten, Part I

Abstract Stan is an open‐source probabilistic programing language, primarily designed to do Bayesian data analysis. Its main inference algorithm is an adaptive Hamiltonian Monte Carlo sampler, supported by state‐of‐the‐art gradient computation. Stan's strengths include efficient computation, an expressive language that offers a great deal of flexibility, and numerous diagnostics that allow modelers to check whether the inference is reliable. Torsten extends Stan with a suite of functions that facilitate the specification of pharmacokinetic and pharmacodynamic models and makes it straightforward to specify a clinical event schedule. Part I of this tutorial demonstrates how to build, fit, and criticize standard pharmacokinetic and pharmacodynamic models using Stan and Torsten.


Introduction
Bayesian inference offers a principled approach to learn about unknown variables from data using a probabilistic analysis.The conclusions we draw are based on the posterior distribution which, in all but the simplest cases, is intractable.We can however probe the posterior using a host of techniques such as Markov Chains Monte Carlo sampling and approximate Bayesian computation.Writing these algorithms is a tedious and error prone endeavor but fortunately modelers can often rely on existing software with efficient implementations.
In the field of pharmacometrics, statistical software such as NONMEM ® [1], Monolix ® [2], and the R package nlmixr [3] support many routines to specify and analyze pharmacokinetic and pharmacodynamic population models.There also exist more general probabilistic programing languages such as BUGS [4] and more recently Stan [5], to only name a few examples.This tutorial focuses on Stan.Stan supports a rich library of probability densities, mathematical functions including matrix operations and numerical solvers for differential equations.These features make for an expressive and flexible language, however writing common pharmacometrics models can be tedious.Torsten extends Stan by providing a suite of functions to facilitate the specification of pharmacometrics models.These functions make it straightforward to model the event schedule of a clinical trial and parallelize computation across patients for population models.
This tutorial reviews key elements of a Bayesian modeling workflow in Stan, including model implementation, inference using Markov chains Monte Carlo (MCMC), and diagnostics to assess the quality of our inference and modeling.We assume the reader is familiar with compartment models in pharmacokinetics and pharmacodynamics and has experience with data that describe a clinical event schedule.Since Torsten follows the input conventions in NMTRAN ® , experience with NONMEM ® is helpful, though not essential.Likewise, exposure to Bayesian statistics and inference algorithms is desirable, in particular an elementary understanding of MCMC.
We introduce programming in Stan and Torsten with the assumption that the reader is familiar with R.

Why Stan?
We believe that Stan, coupled with Torsten, can be an important addition to the pharmacometrician's toolkit, especially for Bayesian data analysis.
The most obvious strength of Stan is its flexibility: it is straightforward to specify priors, systems of ODEs, a broad range of measurement models, missing data models and complex hierarchies (i.e.population models).Examples of how Stan's flexibility may be leveraged in pharmacometrics include: • Combining various sources of data and their corresponding measurement models into one large model, over which full Bayesian inference can be performed [e.g.6].In a similar vain, it is possible to build complex hierarchical structures, which allow us to simultaneously pool information across various groups, e.g.patients, trials, countries.We will study such an example in Part II of this tutorial.
• Using a sparsity inducing prior, such as a the Horseshoe prior [7,8], to fit models with a high-dimensional covariate.This approach has, for example, been used in oncology [9] and is a promising avenue in pharmacogenetics [10].
• Incorporating a non-parametric regression, such as a Gaussian process, to build a translational model for pediatric studies [e.g.11].
Stan's expressive language plays a crucial part here, since more specialized software do not readily handle the relatively complex structures and priors the above examples require.Secondly, Stan supports state of the art inference algorithms, most notably an adaptive Hamiltonian Monte Carlo (HMC) sampler, a gradient-based Markov chains Monte Carlo (HMC) algorithm [12] based on the No U-Turn sampler (NUTS) [13], automatic differentiation variational inference (ADVI) [14], and penalized maximum likelihood estimators.Stan's inference algorithms are supported by a modern automatic differentiation library that efficiently generates the requisite derivatives [15].It is worth pointing out that algorithms such as NUTS and ADVI were first developed and implemented in Stan, before being widely adopted by the applied statistics and modeling communities.As of the writing of this article, new inference algorithms continue to be prototyped in Stan.Recent such examples include adjoint-differentiated Laplace approximations [16], cross-chain warmup [17], and path finding for improved chain initialization [18].Some of Stan's algorithms are now available in specialized pharmacometrics software.NONMEM ® supports an HMC sampler, although certain diagnostics required to assess the quality of HMC, notably for population models, are still missing.
Stan indeed provides a rich set of diagnostics, including the detection of divergent transitions during HMC sampling [12], and the improved computation of effective sample sizes and scale reduction factors, R [19], as well as detailed warning messages based on these diagnostics.The automated running of these diagnostics makes the platform more user-friendly and provides much guidance when troubleshooting our model and our inference.
Last but not least, both Stan and Torsten are open-source projects, meaning they are free and their source code can be examined and, if needed, scrutinized.The projects are under active development with new features being added regularly.

Bayesian inference: notation, goals, and comments
Given observed data D and latent variables θ from the parameter space Θ, a Bayesian model is defined by the joint distribution p(D, θ).The latent variables can include model parameters, missing data, and more.In this tutorial, we are mostly concerned with estimating model parameters.
The joint distribution observes a convenient decomposition, with p(θ) the prior distribution and p(D | θ) the likelihood.The prior encodes information about the parameters, usually based on scientific expertise or results from previous analysis.The likelihood tells us how the data is distributed for a fixed parameter value and, per one interpretation, can be thought of as a "story of how the data is generated" [20].The Bayesian proposition is to base our inference on the posterior distribution of the parameters, p(θ | D), and more generally the posterior distribution of any derived quantity of interest, p(f (θ) | D).
For typical pharmacometric applications, the full joint posterior density of the model parameters is an unfathomable object which lives in a high dimensional space.Usually we cannot even numerically evaluate the posterior density at any particular point!Instead we must probe the posterior distribution and learn the characteristics that interest us the most.In our experience, this often includes a measure of a central tendency and a quantification of uncertainty, for example the mean and the variance, or the median and the 5 th and 95 th quantiles for any quantity of interest.For skewed or multimodal distributions, we may want a more refined analysis which looks at many quantiles.What we compute are estimates of these quantities.Most Bayesian inference involves calculations based on marginal posterior distributions.That typically requires integration over a high number of dimensions-an integration that is rarely tractable by analytic or numerical quadrature.One strategy is to generate approximate samples from the posterior distribution and then use sample mean, sample variance, and sample quantiles as our estimators.
Bayes' rule teaches us that Typically we can evaluate the joint density in the numerator but not the normalizing constant, p(D), in the denominator.A useful method must therefore be able to generate samples from the posterior p(θ | D) using the unnormalized posterior density, p(D, θ).Once we generate a sample θ, we can apply a transformation f to obtain a sample from p(f (θ) | D).
Many MCMC algorithms are designed to generate samples from an unnormalized density.Starting at an initial point, these chains explore the parameter space Θ, one iteration at a time, to produce the desired samples.The first iterations of MCMC are used to find and explore the region in the parameter space where the posterior probability mass concentrates.Only after this initial warmup phase, do we begin the sampling phase.Hamiltonian Monte Carlo (HMC) is an MCMC method which uses the gradient to efficiently move across the parameter space [21,12].Computationally, running HMC requires evaluating log p(D, θ) and ∇ θ log p(D, θ) many times across Θ, i.e. for varying values of θ but fixed values of D. For this procedure to be well-defined, θ must be a continuous variable, else the requisite gradient does not exist.Discrete parameters require a special treatment, which we will not discuss in this tutorial.
A Stan program specifies a method to evaluate log p(D, θ).Thanks to automatic differentiation, this implicitly defines a procedure to compute ∇ θ log p(D, θ) [22,23,24].Together, these two objects provide all the relevant information about our model to run HMC sampling and other gradient-based inference algorithms.

Bayesian workflow
Bayesian inference is only one step of a broader modeling process, which we might call the Bayesian workflow [12,25,26].Once we fit the model, we need to check the inference and if needed, fine tune our algorithm, or potentially change method.And once we trust the inference, we naturally need to check the fitted model.Our goal is to understand the shortcomings of our model and motivate useful revisions.During the early stages of model development, this mostly comes down to troubleshooting our implementation and later this "criticism" step can lead to deeper insights.
All through the tutorial, we will demonstrate how Stan and Torsten can be used to check our inference and our fitted model.

Setting up Stan and Torsten
Detailed instructions on installing Stan and Torsten can be found on https: //github.com/metrumresearchgroup/Torsten.At its core, Stan is a C++ library but it can be interfaced with one of many scripting languages, including R, Python, and Julia.Running Stan requires a modern C++ compiler such as g++ 8.1 provided by RTools 4.0 on Windows and the GNU-Make utility program on Mac or the Windows equivalent mingw32-make.More details of setting up work environment can be found in [27].We will use cmdStanR, which is a lightweight wrapper of Stan in R, and in addition, the packages posterior [28], bayesplot [29], and loo [30].We generate most of the figures in this paper using BayesPlot, though at times we trade convenience for flexibility and fall back to ggplot2 [31].
The R and Stan code for all the examples are available at https://github.com/metrumresearchgroup/torsten_tutorial_1_supplementary.

Resources
Helpful reads include the Stan User Manual [32] and the Torsten User Manual [33].Statistical Rethinking by McElreath (2020) [34] provides an excellent tutorial on Bayesian analysis that may be used for self-learning.A comprehensive textbook on Bayesian modeling is Bayesian Data Analysis by Gelman et al (2013) [20], with more recent insights on the Bayesian workflow provided by Gelman et al (2020) [26].Betancourt (2018) [12] offers an accessible discussion on MCMC methods, with an emphasis on HMC.

Two compartment model
As a starting example, we demonstrate the analysis of longitudinal plasma drug concentration data from a single individual using a linear two-compartment model with first order absorption.The individual receives multiple doses at regular time intervals and the plasma drug concentration is recorded over time.Our goal is to estimate the posterior distribution of the parameters of the model describing the time course of the plasma drug concentrations in this individual.

Pharmacokinetic model and clinical event schedule
Let us assume an individual receives a drug treatment of 1200 mg boluses q12h×14 doses.Drug concentrations are measured in plasma obtained from blood sampled at 0.083, 0.167, 0.25, 0.5, 0.75, 1, 1.5, 2, 3, 4, 6 and 8 hours following the first, second and final dose.In addition, we take measurements before each drug intake, as well as 12, 18, and 24 hours following the last dose.We analyze that data using a two-compartment model with first order absorption: • u(t): drug amount in each compartment (mg), • k a : absorption rate constant (h −1 ), ii For events with multiple dosing, inter-dose interval.
Table 1: Variables used specify an event schedule.
Both intervention and measurement events are described by the event schedule.Stan does not have any reserved variable names, but in this tutorial we follow the NONMEM convention to specify events using the variable names in Table 1.More details can be found in the Torsten User Manual.

Statistical model
Given a treatment, x, and the physiological parameters, {k a , Q, CL, V cent , V peri }, we compute the drug amounts u by solving the two-compartment ODE.We use y to denote measured drug concentration, and ĉ(= u/V cent ) the model-predicted drug concentration.We model the residual error from y to ĉ using a lognormal distribution where σ is a scale parameter we wish to estimate.The deterministic computation of ĉ along with the measurement model, defines our likelihood function p(c | θ, x), where θ = {k a , CL, Q, V cent , V peri , σ} and x are input data, i.e. the clinical event schedule.Note that we are not limited to the above simple model.Stan is capable of many distributions [35] as well as encoding more complex residual models such as the proportional and additive error variance.It remains to define a prior distribution, p(θ).Our prior should allocate probability mass to every plausible parameter value and exclude patently absurd values.For example the volume of the central compartment is on the order of ten liters, but it cannot be the size of the Sun.In this simulated example, our priors for the individual parameters are based on population estimates from previous (hy-pothetical) studies.

Specifying a model in Stan
We can now specify our statistical model using a Stan file, which is divided into coding blocks, each with a specific role.From R, we then run inference algorithms which take this Stan file as an input.

Data and parameters block
To define a model, we need a procedure which returns the log joint distribution, log p(D, θ).Our first task is to declare the data, D, and the parameters, θ, using the coding blocks data and parameters.It is important to distinguish the two.The data is fixed.By contrast, the parameter values change as HMC explores the parameter space, and gradients of the joint density are computed with respect to θ, but not D.
For each variable we introduce, we must declare a type and, for containers such as arrays, vectors, and matrices, the size of the container [37,Ch.5].In addition, each statement ends with a semi-colon.It is possible to specify constraints on the parameters, using the keywords lower and upper.If one of these constraints is violated, Stan returns an error message.More importantly, constrained parameters are transformed into unconstrained parameters -for instance, positive variables are put on the log scale -which greatly improves computation.// observed drug concentration vector < lower = 0 >[ nObs ] cObs ; } parameters { real < lower = 0 > CL ; real < lower = 0 > Q ; real < lower = 0 > VC ; real < lower = 0 > VP ; real < lower = 0 > ka ; real < lower = 0 > sigma ; }

model block
Next, the model block allows us to modify the variable target, which Stan recognizes as the log joint distribution.The following statement increments target using the prior on σ, which is a normal density, truncated at 0 to only put mass on positive values.

target += normal_lpdf ( sigma | 0 , 1) ;
The truncation is implied by the fact σ is declared as lower-bounded by 0 in the parameters block.An alternative syntax is the following: sigma ∼ normal (0 , 1) ; This statement now looks like our statistical formulation and makes the code more readable.But we should be mindful that this is not a sampling statement, rather instructions on how to increment target.We now give the full model block: The likelihood statement involves a crucial term we have not defined yet: concentrationHat.Additional variables can be created using the transformed data and transformed parameters blocks.We will take advantage of these to compute the drug concentration in the central compartment for each event.Note that for the likelihood, we only use the concentration during observation events, hence the indexing [iObs].

Transformed data and transformed parameters block
In transformed data, we can construct variables which only depend on the data.For this model, we simply specify the number of compartments in our model (including the gut), nCmt, and the numbers of pharmacokinetic parameters, nTheta, two variables which will come in handy shortly.
transformed data { int nCmt = 3; int nTheta = 5; } Because the data is fixed, this operation is only computed once.By contrast, operations in the transformed parameters block need to be performed (and differentiated) for each new parameter values.
To compute concentrationHat we need to solve the relevant ODE within the clinical event schedule.Torsten provides a function which returns the drug mass in each compartment at each time point of the event schedule.
matrix < lower = 0 >[ nCmt , nEvent ] mass = pmx_solve_twocpt ( time , amt , rate , ii , evid , cmt , addl , ss , theta ) ; The first eight arguments define the event schedule and the last argument, theta, is an array containing the pharmacokinetic parameters, and defined as follows: real theta [ nTheta ] = { CL , Q , VC , VP , ka }; It is also possible to have theta change between events, and specify lag times and bioavailability fractions, although we will not take advantage of these features in the example at hand.The Torsten function we have chosen to use solves the ODEs analytically.Other routines use a matrix exponential, a numerical solver, or a combination of analytical and numerical methods [38].It now remains to compute the concentration in the central compartment at the relevant times.The full transformed parameters block is as follows:

Calling Stan from R
The R package cmdstanr allows us to run a number of algorithms on a model defined in a Stan file.An excellent place to get started with the package is https://mc-stan.org/cmdstanr/articles/cmdstanr.html.
The first step is to "transpile" the file -call it twocpt.stan-, that is translate the file into C++ and then compile it.mod <-cmdstan_model (" twocpt .stan ") We can then run Stan's HMC sampler by passing in the requisite data and providing other tuning parameters.Here: (i) the number of Markov chains (which we run in parallel), (ii) the initial value for each chain, (iii) the number of warmup iterations, and (iv) the number of sampling iterations.fit <-mod$sample ( data = data , chains = n_chains , parallel_chains = n_chains , init = init , iter_warmup = 500 , iter_sampling = 500) By default, Stan uses 1,000 warmup iterations and 1,000 sampling iterations.Empirically these defaults work well across a broad range of models when running an adaptive HMC sampler.For relatively simple models, we may even use shorter warmup and sampling phases, as we have done above.This should be contrasted with random walk MCMC, such as the Gibbs sampler in BUGS, where it is typical to run 5,000 or even 10,000 iterations per phase.Random walk MCMC tends to generate Markov chains with a higher autocorrelation than HMC, hence the need to run more iterations.In the next two sections, we will discuss diagnostics which can be used to adjust the length of the warmup and sampling phases.
There are several other arguments we can pass to the sampler and which we will take advantage of throughout the tutorial.For applications in pharmacometrics, we recommend specifying the initial starting points via the init argument, as the defaults may not be appropriate.In this tutorial, we draw the initial points from their priors by defining an appropriate R function.
The resulting fit object stores the samples generated by HMC from which can deduce the sample mean, sample variance, and sample quantiles of our posterior distribution.This information is readily accessible using fit$summary() and summarized in table 2. We could also extract the samples and perform any number of operations on them.

Checking our inference
Unfortunately there is no guarantee that a particular algorithm will work across all the applications we will encounter.We can however make sure that certain for some samples θ (1) , θ (2) , • • • , θ (n) .When constructing such estimators using MCMC samples, rather than with exact independent samples, we must account for the fact that our samples are correlated and biased.

Checking for convergence with R
MCMC samples are biased because Markov chains generate correlated samples, meaning any sample has some correlation with the initial point.If we run the algorithm for enough iterations, the correlation to the initial point becomes negligible and the chain "forgets" its starting point.But what constitutes enough iterations?
To monitor bias, we run multiple Markov chains, each started at different points, and check that they all converge to the same region of the parameter space.One way to check this is to compute the R statistics, for which we provide an intuitive definition: R intuitively = Total variance across all chains Average within chain variance .
If the chains are mixing properly, both the numerator and denominator measure the posterior variance, and R converges to 1.0, as n increases.Moreover, we want R ≈ 1.0, as is the case in table 2. Stan uses an improved R statistics described in a recent paper by Vehtari et al (2020) [19].We can also visually check that the chains are properly mixing using a trace plot (Figure 1).If R 1 and, more generally, if the chains were not mixing, this would be cause for concern and an invitation to adjust our inference method.One potential solution is to increase the warmup length.Even when R ≈ 1, we should entertain the possibility that all the chains suffer from the same bias.

Controlling the variance of our estimator
Let's assume that our warmup phase is long enough and the bias negligable.The expected error of our sample estimator is now determined by the variance.Under certain regularity conditions, our estimator follows an MCMC central limit theorem, where n eff is the effective sample size, denoted ESS bulk in Table 2. Deviations from this approximation have order O 1/n 2 eff .In the limiting case where we generate independent samples, n eff = n; however, when samples exhibits correlation, n eff < n and the variance of our sample estimator increases.For CL, we have 2,000 samples, but the effective sample size is 1,580 (Table 2).If n eff is low, our estimator may not be precise enough and we should increase the sampling phase to generate more samples.
Achieving n eff ≈ 100 is, in our experience, usually sufficient in an applied setting.This means that the variance of the sample estimator is 1% that of the posterior, as can be seen from Equation (2).At this point the uncertainty is dominated by the intrinsic posterior variance rather than the error in our inference procedure.If the effective sample size is below 100 for certain quantities, Stan issues a warning message.
The effective sample size is only formally defined in the context of estimators for expectation values.We may also be interested in tail quantities, such as extreme quantiles, which are more difficult to estimate and require many more samples to achieve a desired precision.Vehtari et al (2020) [19] propose a generalization of the effective sample size for such quantities, and introduce the tail effective sample size.This is to be distinguished from the traditional effective sample size, henceforth the bulk effective sample size.Both quantities are reported by Stan.

Checking the model: posterior predictive checks
Once we develop enough confidence in our inference, we still want to check our fitted model.There are many ways of doing this.We may look at the posterior distribution of an interpretable parameter and see if it suggests implausible values.Or we may evaluate the model's ability to perform a certain task, e.g.classification or prediction, as is often done in machine learning.In practice, we find it useful to do posterior predictive checks (PPC), that is simulate data from the fitted model and compare the simulation to the observed data [39, chapter 6].Mechanically, the procedure is straightforward: 1. Draw the parameters from their posterior, θ ∼ p(θ | y).
This amounts to drawing observations from their posterior distribution, that is ỹ ∼ p(ỹ | y).Both the uncertainty due to our estimation and the uncertainty due to our measurement model propagate to our predictions.Stan provides a generated quantities block, which allows us to compute values, based on sampled parameters.In our two compartment model example, the following code draws predicted observations from the likelihood:

generated quantities { real c o n c e n t r a t i o n O b s P r e d [ nObs ] = lognormal_rng ( log ( concentrationHat [ iObs ]) , sigma ) ; }
We generate predictions at the observed points for each sampled point, θ (i) .This gives us a sample of predictions and we can use the 5 th and 95 th quantiles to construct a credible interval.We may then plot the observations and the credible intervals (Figure 2) and see that, indeed, the data generated by the model is consistent with the observations.

Comparing models: leave-one-out cross validation
Beyond model criticism, we may be interested in model comparison.Continuing our running example, we compare our two compartment model to a one compartment model, which is also supported by Torsten via the pmx solve onecpt routine.The corresponding posterior predictive checks are shown in Figure 3.
There are several ways of comparing models and which method is appropriate crucially depends on the insights we wish to gain.If our goal is to asses a model's ability to make good out-of-sample predictions, we may consider Bayesian leaveone-out (LOO) cross validation.The premise of cross-validation is to exclude a point, (y i , x i ), from the training set, i.e. the set of data to which we fit the model.Here x i denotes the covariate and in our example, the relevant row in the event schedule.We denote the reduced data set, y −i .We then generate a prediction (ỹ i , x i ) using the fitted model, and compare ỹi to y i .A classic metric to make this comparison is the squared error, (ỹ i − y i ) 2 .
Another approach is to use the LOO estimate of out-of-sample predictive fit: Here, no prediction is made.We however examine how consistent an "unobserved" data point is with our fitted model.Computing this estimator is expensive, since it requires fitting the model to n different training sets in order to evaluate each term in the sum.Vehtari et al (2016) [40] propose an estimator of elp loo , which uses Pareto smooth importance sampling and only requires a single model fit.The premise is and correct this value, using importance sampling, to estimate log p(y i | y −i ).
Naturally this estimator may be inaccurate.What makes this tool so useful is that we can use the Pareto shape parameter, k, to asses how reliable the estimate is.In particular, if k > 0.7, then the estimate shouldn't be trusted.The estimator is implemented in the R package loo.see [30] for more details including its connection and comparison to the widely applicable information criterion (WAIC).Conveniently, we can compute log p(y i | y) in Stan's generated quantities block.

vector [ nObs ] log_lik ; for ( i in 1: nObs ) log_lik [ i ] = lognormal_lpdf ( cObs [ i ] | log ( concentrationHat [ iObs [ i ]]) , sigma ) ;
These results can then be extracted and fed into Loo to compute elp loo .The file twoCpt.r in the Supplementary Material shows exactly how to do this.Figure 4 plots the estimated elp loo , along with a standard deviation, and shows the two compartment model has better out-of-sample predictive capabilities.

Two compartment population model
We now consider the scenario where we have data from multiple patients and fit a population model.Population models are a powerful tool to capture the heterogeneity between patients, while also recognizing similarities.Building the right prior allows us to pool information between patients, the idea being that what we learn from one patient teaches us something -though not everything -about the other patients.In practice, such models can frustrate inference algorithms and need to be implemented with care [41].We start with an example where the interaction between the model and our MCMC sampler is well behaved.In Part II of this tutorial, we will examine a more difficult case, for which we will leverage Stan's diagnostic capabilities in order to run reliable inference.

Statistical model
Let ϑ be the 2D array of body weight-normalized pharmacokinetic parameters for each patient, with the parameters for the j th patient.We construct a population model by introducing random variation to describe otherwise unexplained inter-individual variability.In a Bayesian context this is sometimes referred to as a prior distribution for the individual parameters, As before we work on the log scale to account for the fact the pharmacokinetic parameters are constrained to be positive.
is the population mean (on the logarithmic scale) and Ω the population covariance matrix.Both ϑ pop and Ω are estimated.In this example, we start with the simple case where Ω is diagonal.For our example we will also use conventional allometric scaling to adjust the clearance and volume parameters for body weight.The likelihood remains mostly unchanged, with the caveat that it must now be computed for each patient.Putting this all together, we have the following model, as specified by the joint distribution, (prior on pharmacokinetic parameters) Ω ∼ p(Ω), (prior on population covariance)

Specifying the model in Stan
We begin by adjusting our parameters block: The declaration for k a,pop illustrates that constraints may be expressions including other variables in the model.In this case k a,pop is constrained to avoid identifiability problems due to "flip-flop".
The variable, ϑ pop is introduced in transformed parameters, mostly for convenience purposes: It remains to compute concentrationObs.There are several ways to do this and, depending on the computational resources available, we may either compute the concentration for each patients sequentially or in parallel.For now, we do the simpler sequential approach.In the upcoming Part II of this tutorial, we examine how Torsten offers easy-to-use parallelization for population models.
Sequentially computing the concentration is a simple matter of bookkeeping.In transformed parameters we loop through the patients using a for loop.The code is identical to what we used in Section 2.3.3, with the caveat that the arguments to pmx solve twocpt are now indexed to indicate for which patient we compute the drug mass.For example, assuming the time schedule is ordered by patient, the event times corresponding to the j th patient are given by time where start[j] and end[j] contain the indices of the first and last event for the j th patient, and the syntax for indexing is as in R. The full for loop is then mass [2 , start [ j ]: end [ j ]] / VC [ j ]; } Note that the last vector argument in pmx solve twocpt is generated using {} syntax.
Once we have written our Stan model, we can apply the same methods for inference and diagnostics as we did in the previous section.

Posterior predictive checks
We follow the exact same procedure as in Section 2.6 -using even the same line of code -to simulate new observations for the same patients we analyzed.Figure 5 plots posterior predictions for each individual patient.In addition, we simulate new observations for hypothetical new patients by: (i) drawing pharmacokinetic parameters from our population distribution, (ii) solving the ODEs with these simulated parameters and (iii) using our measurement model to simulate new observations.Those predictions are also shown in Figure 5 for each individual.
It is worth noting that the computational cost of running operations in the generated quantities is relatively small.While these operations are executed   Key: Black circles = observed data; blue curves and shaded areas = posterior median and 80% credible intervals for the population median; red curve and shaded area = posterior median and 80% credible intervals for the 10 th and 90 th population percentiles intervals once per iteration, in order to generate posterior samples of the generated quantities, operations in the transformed parameters and model blocks are run and differentiate multiple times per iterations, meaning they amply dominate the computation.Hence the cost of doing posterior predictive checks, even when it involves solving ODEs, is marginal.The computational scaling of Stan, notably for ODE-based models, is discussed in the article by Grinsztajn et al (2021) [42].
For this simple population PK modeling example with a uniform study design for all individuals, the PPCs shown in Figures 5 and 6 are arguably sufficient model diagnostics.In cases where the study design and patient populations are more heterogeneous, methods that adjust for such heterogeneity are desirable.NPDEs [43] are commonly used in the maximum likelihood context and could be applied to point predictions from Bayesian models, e.g., posterior mean or median predictions.A similar approach termed probability integral transforms (PIT) are used for Bayesian model checking [26,25].
Standard PPCs that use the same data for model fitting and model checking may be over-optimistic particularly when applied to highly flexible or overparameterized models.This may be remedied by using out-of-sample predictions for PPCs and PITs.In the context of population models this means fitting the model to data from a subset of individuals and predicting outcomes for the remaining individuals.This may be done for an entire data set using cross validation.However, generating a cross validation predictive distribution is computationally expensive and may often be impractical.4 Nonlinear pharmacokinetic / pharmacodynamic model Now let us consider a PKPD model described in terms of a nonlinear ODE system that requires the use of a numerical solver.The patient receives multiple doses at regular time intervals and the drug plasma concentration is recorded over time.

Nonlinear ODE model in pharmacokinetics / pharmacodynamics
In this the last example, we go back to the single patient two-compartment model and append it with a PD model.Specifically, we examine the Friberg-Karlsson semi-mechanistic model for drug-induced myelosuppression [44,45,46,47,48,49] with the goal to model the relation between neutrophil counts and drug exposure.
The model describes a delayed feedback mechanism that keeps the absolute neutrophil count (ANC) at the baseline (Circ 0 ) in a circulatory compartment (y circ ), as well as the drug effect that perturbs this meachanism.The delay between proliferative cells (y prol ) and y circ is modeled by three transit compartments with mean transit time MTT = (3 + 1)/k tr where k tr is the transit rate constant.Figure 7 summarizes the model (see also [44,Figure 2]).The PD likelihood is ANC ∼ logNormal(log(y circ ), σ ANC ), where ĉ = y cent /V cent is the drug concentration calculated from the PK model, and f FK solves the non-linear ODE: We use to model the linear effect of the drug once it has been absorbed in the central compartment.This effect reduces the proliferation rate and induces a reduction in neutrophil count.The upper bound of 1 on E drug excludes the scenario where the feedback loop is flipped if ĉ becomes too large.While we expect that for any reasonable parameter values, E drug < 1, we should anticipate the possibility that our Markov chains may encounter less well-behaved values as it explores the parameter space.Encoding such constraints can lead to improved numerical stability when solving the ODE.We obtain the complete ODE system for the PKPD model by coupling equation ( 1) and ( 4).Because the equation is nonlinear, we can no longer resort to analytical solutions as we have done in the previous sections.

Numerically solving ODEs
To solve an ODE numerically in Stan we first need to define a function that returns right-hand-side of the ODE, i.e. the derivative of the solution, in the functions block.The functions block allows users to define functions and is written at the top of the Stan file before the data block.

functions {
vector t wo Cp tN eu tM od el OD E ( real t , vector y , real [] parms , real [] rdummy , int [] idummy ) { real CL = parms [1]; real Q = parms [2]; real V1 = parms [3]; real V2 = parms [4]; real ka = parms [5]; real mtt = parms [6]; real circ0 = parms [7]; real gamma = parms [8]; real alpha = parms [9]; real k10 = CL / V1 ; real k12 = Q / V1 ; real k21 = Q / V2 ; real ktr = 4 / mtt ; vector [8]  The above function is an almost direct translation of Eq. ( 1) and (4).The first three components of dydt describes the PK.The next five components of dydt describe the PD minus the baseline Circ 0 .Writing the ODE as a difference from the baseline means the initial PD conditions is 0, as opposed to a parameter dependent value.This results in better computation, because derivatives of the ODE solution with respect to the initial conditions no longer need to be computed; for more details, see [42,Section 5.2].In addition, we encode a constraint on the circulatory compartment where is the machine precision and can be interpreted as the smallest nonzero number the computer can handle.This is to improve numerical stability, especially during the early stages of MCMC exploration when we may need to handle somewhat implausible parameter values.Stan and Torsten provide several numerical solvers.In this example we use the Runge-Kutta solver pmx solve rk45 [33,Section 3.4].The signature of pmx solve rk45 is a bit more sophisticated than that of pmx solve twocpt and requires the following arguments: 1. the name of the user-defined ODE function (twoCptNeutModelODE) 2. the number of states/compartments in the ODE 3. the event schedule 4. the bioavailibility fraction, F , and the dosing lag time, t lag for each compartment (optional) 5. the tuning parameters for the ODE solver (optional) Because arguments are nameless in Stan, we can only pass the ODE tuning parameters if we also pass F and t lag .By setting F to 1 and t lag to 0 for each compartment, we essentially ignore their effect.This is best done in the transformed data block: int < lower = 1 > nCmt = 8; real F [ nCmt ] = rep_array (1.0 , nCmt ) ; real tLag [ nCmt ] = rep_array (0.0 , nCmt ) ; Numerical solvers in Stan and Torsten admit three tuning parameters: • rtol: relative tolerance to determine solution convergence, • atol: absolute tolerance to determine solution convergence, • max num step: maximum number of steps allowed.
Though Stan and Torsten provide default values, we highly recommend that the user define the ODE solver control parameters in the data block: real < lower = 0 > rtol ; real < lower = 0 > atol ; int < lower = 0 > max_num_step ; Users should make problem-dependent decisions on rtol and atol, according to the expected scale of the unknowns, so that the error does not affect our inference.For example, when an unknown can be neglected below a certain threshold without affecting the rest of the dynamic system, setting atol greater than that threshold avoids spurious and error-prone computation.For more details, see [37,Chapter 13] and [33, Section 3.7.5]and references therein.
As before, we solve the ODE within the event schedule in the transformed parameters block: The approach in the last section applies to all models that involve ODE solutions, but we will not use it here.An acute observer may have noticed the PKPD model here exhibits a particular one-way coupling structure.That is, the PK (Eq.( 1)) and PD (Eq.( 4)) are coupled through the proliferation cell count y prol and E drug , such that the PK can be solved independently from the PD.This is what motivates Torsten's coupled solvers which analytically solves the PK ODEs before passing the PK solution to the PD ODE.The PD ODE is then solved numerically.Since the dimension of the numerical ODE solution is reduced, in general this coupled strategy is more efficient than the last section's approach of numerically solving a full ODE system.To see it in action, let us apply the coupled solver pmx solve twocpt rk45 [33,Section 3.5] to the same model.We need only make two changes.First, we modify the ODE function to reflect that only the PD states are to be solved.

Discussion
Stan provides an expressive language to build models, state-of-the-art algorithms to fit these models, and a host of easy-to-use diagnostics.Torsten complements Stan with a suite of routines which solve ODEs within the context of a clinical event schedules.Together, Stan and Torsten are potent tools when working through the tangled steps of a Bayesian workflow for PKPD modeling.

Current and potential role for Stan and Torsten for pharmacometrics applications
We can apply Stan/Torsten to a large palette of generative models, both for inference and simulation.Applications range from simple linear regression to complex multi-scale Quantitative Systems Pharmacology models.Compared to specialized pharmacometrics tools such as NONMEM ® , Stan/Torsten is particularly well-suited for cases where more flexibility is desired.This includes models with • random effects distributions other than normal, • prior distributions other than the limited set available in existing pharmacometrics tools, • multiple submodels with different random effect structures.
It's important to recognize that MCMC, including the HMC scheme used by Stan/Torsten, can be computationally intensive, notably when fitting hierarchical models which require us to numerically solve ODEs.This can be especially frustrating during the initial model exploration stage of a project.For such exploratory analyses access to a rapid approximate Bayesian inference engine may be desirable.Stan/Torsten includes two optimization-based inference engines, one for estimation of posterior modes and one for variational inference.These algorithms attempt to simultaneously optimize over the entire joint posterior distribution of all model parameters.This process can be relatively slow and error prone when trying to optimize over the large number of population and individuallevel parameters of a typical population pharmacometrics model.This contrasts with typical mixed effects modeling programs that use algorithms specialized for a more limited range of models-usually employing an alternating sequence of lower dimensional optimization problems.
For applications that may be implemented with typical pharmacometrics tools, the choice between those and Stan/Torsten comes down to the trade-offs between flexibility, doing accurate Bayesian inference and computation time.
We would also like to point out that Stan is not the only probabilistic programing language that is actively under development.PyMC3 [50], TensorFlow Probability [51,52], and Turing [53], among others, provide similar modeling capabilities.A full review and comparison of these languages is however beyond the scope of this paper.

Preview of part 2
In part 2 of this tutorial, we plan to build on the material we have covered thus far and tackle more advanced topics, including: • Improving the performance of HMC, using within-chain parallelization for population models and Torsten's dedicated group solvers.
• Advanced diagnostic tools, namely divergent transitions which can flag bias in our posterior samples.Stan makes these diagnostics readily available.
• Fake data simulation and analysis, in particular prior predictive checks as a way to understand and build priors, fitting the model to fake data as an imperfect tool to troubleshoot Bayesian inference, and an overview of the more sophisticated but computationally demanding simulation-based calibration [54].
• Performance tuning of ODE models, such as solver selection, accurarcy control, as well as stability issues.
We will dive into these subjects by examining more advanced models and utilizing techniques such as reparameterization, within-chain parallelization, and pooling multiple data sources.We will also discuss ongoing developments with Stan and Torsten, such as tools to handle larger scale ODEs and plans to leverage parallelization.

Figure 1 :
Figure 1: Trace plots.The sampled values for each parameters are plotted against the iterations during the sampling phase.Multiple Markov chains were initialized at different points.However, once in the sampling phase, we cannot distinguish the chains.

Figure 2 :
Figure 2: Posterior predictive checks for two compartment model.The circles represent the observed data and the shaded areas the 50 th and 90 th credible intervals based on posterior draws.

Figure 3 :
Figure 3: Posterior predictive checks for one compartment model.The circles represent the observed data and the shaded areas the 50 th and 90 th credible intervals based on posterior draws.A graphical inspection suggests the credible interval is wider for the one compartment model than they are for the two compartment model.

Figure 4 :
Figure4: Leave-one-out estimate of out-of-sample predictive fit.Plotted is the estimate, elp loo , for the one and two compartment models as well as the difference in elpd for the two models.Clearly, the two compartment model has superior predictive capabilities.

Figure 5 :
Figure 5: Population two compartment model: Posterior predictive checks for each individual.Key: Black dots = observed data; red curve and shaded area = posterior median and 90% credible intervals for prediction of new observations in the same individual; blue curve and shaded area = posterior median and 90% credible intervals for prediction of new observations in a hypothetical new individual with the same body weight.

Figure 6 :
Figure 6: Population two compartment model: Posterior predictive checks for all individuals.Key: Black circles = observed data; blue curves and shaded areas = posterior median and 80% credible intervals for the population median; red curve and shaded area = posterior median and 80% credible intervals for the 10 th and 90 th population percentiles intervals

Figure 8 :
Figure 8: Posterior predictive checks for the PKPD model The Stan file contains all the coding blocks in the following order: data, transformed data, parameters, transformed parameters, model.The full Stan code can be found in the Supplementary Material.

Table 2 :
Summary of results when fitting a two compartment model.The first columns return sample estimates of the posterior mean, median, standard deviation, median absolute deviation, 5 th and 95 th quantiles, based on our approximate samples.The next three columns return the R statistics and the effective sample size for bulk and tail estimates, and can be used to identify problems with our inference.