Drug sensitivity prediction with normal inverse Gaussian shrinkage informed by external data

Abstract In precision medicine, a common problem is drug sensitivity prediction from cancer tissue cell lines. These types of problems entail modelling multivariate drug responses on high‐dimensional molecular feature sets in typically >1000 cell lines. The dimensions of the problem require specialised models and estimation methods. In addition, external information on both the drugs and the features is often available. We propose to model the drug responses through a linear regression with shrinkage enforced through a normal inverse Gaussian prior. We let the prior depend on the external information, and estimate the model and external information dependence in an empirical‐variational Bayes framework. We demonstrate the usefulness of this model in both a simulated setting and in the publicly available Genomics of Drug Sensitivity in Cancer data.


Variational Bayes derivations
For clarity we have indexed the variational density functions with their respective parameters, which we omitted in the MD. In the following all expectations are with respect to the variational posterior Q d . The variational posterior for β d is found as follows: The variational posterior for γ 2 jd is given by: Similarly, we derive the variational posterior for τ 2 d as: Lastly, we have the variational posterior for σ 2 d : If we fill in the expectations and variances, we arrive at the estimating equations in MD equations (3).

Evidence lower bound
The evidence lower bound that is maximised during the iterations is:

Efficient computation
The empirical Bayes updates require the quantities e jd , b jd , g d , and f d , which in turn depend on the Then, the first two quantities are efficiently calculated as: , and g d µ T d diag(b jd )µ d are easily calculated from the first two. For the remaining one we have: Both (1) and (2)

Ratios of modified Bessel functions
Ratios of modified Bessel functions of the second kind K α−1 (x)/K α (x) are prone to under-and overflow for large α. In our case, α increases linearly with p. Since p may be large, this may cause numerical issues in the calculation of various quantities. We alleviate the numerical issues through the following.
We let n 1 = p/2 and n 2 = (p − 1)/2 and use the well-known recursive relation: to rewrite the ratio: The ratio K 0 (x)/K 1 (x) is well-behaved, so that an arbitrary ratio K α−1 (x)/K α (x) may be calculated as a sequence of numerically stable scalar sums, products and inverses.

Empirical Bayes
For the feature scale parameter λ feat updates to be non-negative, we have to show that: which holds if all summands in the left-hand side of (3) are non-negative.
We note that through Jensen's inequality which we use to show that: Similar reasoning holds for the drug scale parameter λ drug .
10 Gibbs sampler MCMC samples from the posterior corresponding to model (2) may be generated for each equation independently. MCMC samples may be generated by iteratively sampling the following conditional distributions: and In high dimensional space, the p×p matrix inversions in (4b) and (4c) are significant computational bottlenecks. Bhattacharya et al. (2016) describe a method to sample from (4a) without explicit calculation of this inverse, thereby offering a siginifcant speed-up compared to naive sampling.
Sampling from (4) either requires estimating (as in MD Section 3.2) or specifying hyperparameters α feat , α drug , λ feat , and λ drug , or to endow them with and extra layer of hyperpriors. In general, specifying them requires rarely available, detailed subject knowledge, so we may resort to endowing them with hyperpriors (as in Upadhyay and Agrawal (1996)): with hyperparameters ν 2 feat , ν 2 drug , κ feat , ξ feat , κ drug , and ξ drug . Improper, flat priors occur if we let ν feat , ν drug → ∞ and κ feat , κ drug , ξ feat , ξ drug → 0. The corresponding full conditionals are given by: which may be sampled together with (4).

Simulations
The lasso and ridge models in this section were fit separately per drug, using the R package glmnet (Friedman et al., 2010) with cross validated penalty parameters.
Tables 1-3 display averaged estimation mean squared error (EMSE), and prediction mean squared error (PMSE), calculated on independent test data, for simulation Scenarios 1-3. PMSE is calculated as in Section 5 of the MD, while EMSE is calculated as: EMSE = D −1 p −1 D d=1 p j=1 (β jd − β jd ) 2 , withβ jd the estimator for β jd . In the NIG models, that provide the full posteriors, the posterior mean E(β jd |y d ) is used as point estimate. EMSE is further split into the contribution of the bottom 10% of the β jd in terms of size (in absolute value) and the top 10% of the β jd in size.
Focussing on estimation, we see that the NIG f , NIG d , and NIG f+d models (depending on the Scenario), that include the external covariates, outperform the other methods in terms of EMSE and PMSE. The lower EMSEs confirm that the NIG models that include the external data (NIG f , NIG d , and NIG f+d ) learn the underlying structure in the data better than the models that do not include the external data (NIG − f , NIG − d , and NIG − f+d ). Furthermore, as expected, the sparser models lasso and NIG are better able to learn the β jd with larger magnitude, as seen from the lower EMSE top . On the other hand, ridge is better able to capture the smaller β jd , as evident from the lower EMSE bottom . Figure 1 shows that, as expected, the performance of the NIG f+d model deteriorates with increasing noise level in the external covariates and converges to the NIG − f+d model performance.    This leads us to conclude that the variational approximation to the posterior is quite accurate in this Scenario. The simulation Scenario was setup to mimic the real GDSC data, so that we expect an accurate variational posterior in the real GDSC data as well.
For completeness, Figure 3 displays samples from the full Bayesian NIG posterior, as introduced in Section 10. Unfortunately, we were not able to generate enough samples for a reliable posterior using the Gibbs procedure as described in the Section. We generated the samples from the full Bayesian posterior using stan through the R package rstan (Guo et al., 2018) (see https://github.com/magnusmunch/NIG for the implemented stan code). The Figure shows that the posteriors for φ and χ have unexpected shapes and seem to have explored only a small part of the posterior space. This indication that posterior inference is not reliable is confirmed by the average and minimum effective sample sizes of 66 and 2over all parameters, respectively. For reliable inference, Kruschke (2015) recommends an effective sample size of at least 10 000. Inspection of the computation times and a quick calculation learns that we have to sample from the posterior for at least 37 hours to obtain an average effective sample size of 10 000. Moreover, to achieve a

GDSC data
the CPO for cell line i and drug d is  and is the Bayesian version of the leave-one-out cross-validated likelihood. Congdon (2005) suggests as outlier cut-off 0.01. Figure 4 displays these CPOs (calculated by numerical integration of (5)) for the four Analyses. For y id |y −id following a Gaussian distribution with standard deviation 0.83 (the average posterior mean of σ 2 d from our analyses) this implies an expected proportion of outliers of about 0.003. In our analyses we find 0.006, 0.006, 0.006, and 0.006, for the four analyses respectively. So about double the amount of expected outliers under a Gaussian posterior predictive distribition. However, in reality, due to posterior uncertainty of β d and σ 2 d , the predictive distribution is more heavy-tailed than the Gaussian and the corresponding expected proportion of outliers higher than 0.003. Unfortunately, calculation of this true expected proportion is infeasible due to intractibility of the posterior predictive distribution.