Gaussian Process Regression and Classification using International Classification of Disease Codes as Covariates

International Classification of Disease (ICD) codes are widely used for encoding diagnoses in electronic health records (EHR). Automated methods have been developed over the years for predicting biomedical responses using EHR that borrow information among diagnostically similar patients. Relatively less attention has been paid to developing patient similarity measures that model the structure of ICD codes and the presence of multiple chronic conditions, where a chronic condition is defined as a set of ICD codes. Motivated by this problem, we first develop a type of string kernel function for measuring similarity between a pair of subsets of ICD codes, which uses the definition of chronic conditions. Second, we extend this similarity measure to define a family of covariance functions on subsets of ICD codes. Using this family, we develop Gaussian process (GP) priors for Bayesian nonparametric regression and classification using diagnoses and other demographic information as covariates. Markov chain Monte Carlo (MCMC) algorithms are used for posterior inference and predictions. The proposed methods are free of any tuning parameters and are well-suited for automated prediction of continuous and categorical biomedical responses that depend on chronic conditions. We evaluate the practical performance of our method on EHR data collected from 1660 patients at the University of Iowa Hospitals and Clinics (UIHC) with six different primary cancer sites. Our method has better sensitivity and specificity than its competitors in classifying different primary cancer sites and estimates the marginal associations between chronic conditions and primary cancer sites.


Introduction
EHR data are useful for organizing information about clinical care and developing tools for personalized patient care.They are also used for answering interesting and important research questions that cannot be practically addressed through traditional prospective research.We develop a Bayesian model based on GP priors for predicting biomedical responses depending on the presence of a set of chronic conditions.We assume that the EHR encode diagnoses using the 10th edition of ICD billing code (hereafter called ICD code), so a chronic condition is defined using one or more specific codes from a predefined set of ICD codes [2].An ICD code is a three to seven character long string, where the first three characters define the diagnosis category, the next three define the etiology, anatomic site, and severity, and the final character identifies encounters.We develop a family of kernel functions on a pair of ICD code subsets that is based on the definition of ICD codes and chronic conditions.Using any such kernel as a covariance function for a GP prior, we perform Bayesian nonparametric regression and classification using patients' chronic conditions and demographic information as covariates.
Encoding diagnoses using ICD coding system has become the most widely adopted medical diagnostic classification system since it was endorsed by the World Health Assembly in 1990.Automated extraction of data from the EHR using ICD codes allows us to obtain large amounts of precise medical information about chronic conditions.This is a cheap and efficient alternative to the significantly more expensive and slower manual chart review by a healthcare provider.Unfortunately, ICD codes are prone to encoding errors, so care must be exercised in fitting a model with ICD codes as covariates [6].This has motivated a significant interest in automating the ICD encoding process to eliminate any sources of errors [21].In a properly curated EHR, models that use ICD codes as covariates in predicting disease risks and related medical conditions are known to be more accurate than those which do not use ICD codes [7].Building on these ideas, we achieve an additional but significant improvement in predictive accuracy by exploiting the grouping of diagnoses into chronic conditions and the hierarchical structure of ICD codes.Our Bayesian methods require minimal tuning and provide results with uncertainty estimates, so they can be easily used by a healthcare provider for assisting with patient care.
EHR data have been analyzed using Bayesian nonparametric methods, but the literature on using ICD codes as covariates is sparsely populated.Henao et al. [5] have developed a deep hierarchical factor model for identifying latent subgroups of patients.They summarize the EHR data in terms of counts of different medications usage, laboratory tests, and diagnoses specified using ICD codes.If the EHR data only contains information about the presence or absence of disease, then Bayesian nonparametric extensions of low-rank matrix factorization perform better in discovering latent disease groups [11].Recently, deep learning has been widely used for automated regression and classification using EHR data, but the major focus remains on mining clinical notes and not on using the information encoded in ICD codes or on providing uncertainty estimates [15].
A major limitation of all these approaches is that they ignore the additional structure provided by the membership of diagnoses in different chronic conditions.
Addressing this gap, we first develop a similarity measure on diagnoses encoded as ICD codes that accounts for the grouping of diagnoses in chronic conditions and the hierarchical structure of ICD codes.A widespread practice is to define the similarity of two patients to be proportional to the number of common ICD codes in their diagnoses or Jaccard index.Our similarity measure is superior to this practice in that its value could be positive for a pair of non-identical ICD codes, and its magnitude depends on the degree of overlap between the pair.In fact, our similarity measure belongs to the class of string kernels and extends a restricted version of the boundrange kernel, which has been widely used in text mining and information retrieval [9,18].The similarity between a pair of ICD codes is defined as the total number of common sub-strings between them, where a sub-string always begins at the first character of an ICD code.This similarity measure is extended to a pair of subsets as a weighted average of similarities of all ICD code pairs formed using the two subsets.
Our second major contribution is to use the similarity between patients for nonparametric regression and classification using GP priors.This is done in two stages.First, using the similarity measure, we define the equivalents of polynomial, exponential, and squared exponential kernels on subsets of ICD codes, which represent chronic conditions.Second, we use these kernels as covariance functions for defining GP priors indexed by subsets of ICD codes.These GP priors are used for fitting Bayesian nonparametric regression and classification models that are tuned for EHR data analysi [16].We develop MCMC algorithms for automated fitting of these models that provide theoretically guaranteed estimation of posterior uncertainty [4].Minimal tuning requirements makes our method extremely attractive for routine applications in analyzing EHR data and for assisting healthcare providers in predicting risks related to chronic conditions.We verify our claims through empirical studies and show that the proposed method provides a more detailed quantification of dependence of primary cancer sites on chronic conditions.

Motivating EHR Data
Multimorbidity is the simultaneous occurrence of more than one chronic condition.Developing operational measures of multimorbidity is an active research area due to its importance in improving patient care.Calderón-Larrañaga et al. [2] have operationalized the concept of multimorbidity into a list of 918 ICD codes that map on to a list of 60 chronic conditions.This serves as a motivation for developing a classifier that accounts for the grouping of diagnoses into chronic conditions.We show that this yields better accuracy in predicting primary cancer sites in UIHC EHR data.
The data for this analysis came from the EHR at UIHC relating to the set of patients who received cancer treatment beginning in 2017.The cohort included 18 years or older patients who were diagnosed with malignant solid neoplasm of any type or site.The data included diagnoses of the patients encoded as ICD codes and the race and marital status of every patient.We filtered patients who had at least one of the 58 non-cancerous chronic conditions, resulting in a sample size of 1660.The main question here is to identify chronic conditions or diagnoses that are predictive of a primary cancer site.Unfortunately, the Bayesian toolkit provides limited options for answering such biomedical questions.Using the UIHC EHR data as a representative example, we develop GP-based regression and classification models that are tuned for solving such biomedical problems, where the subsets of ICD codes are covariates.
The main goal is to predict marginal associations between chronic conditions and cancer primary sites and a minor goal is to predict the primary cancer site using the ICD codes, which are structured strings; for example, J301 encodes allergy and D50, D51 denote different kinds of anaemia.Clearly, using dummy variables for representing these ICD codes is inappropriate because D50, D51 denote the same disease but are assigned different dummy variables.Motivated from the text mining literature, a useful similarity measure for strings is defined by counting the number of matching substrings of different lengths.Accounting for the hierarchical structure of ICD codes, if we modify this definition by enforcing the substrings to always start at the beginning, then the similarities of the three pairs (J301, D50), (J301, D51), and (D50, D51) are 0, 0, 2, respectively, which are more realistic.In the next section, we develop this idea further to account for the extra structure provided by the 58 chronic conditions while down-weighting the contributions of commonly occurring diagnoses.
3 Kernel Functions for Diagnoses

Kernel Function on ICD Codes
Consider the structure of ICD codes.Let t be an ICD code, t * be the longest ICD code, |t| denote the number of characters in t, and t 1:i be the contiguous substring of length i that starts at the first character of t.The first character of t is t 1:1 and t 1:l = t if l > |t|.Let T be the set of all ICD codes, A and D be the sets of letters and numbers that are used to represent an ICD code, and set of all alphanumeric substrings that have an letter as their first character and have lengths 1, 2, . . ., or |t * |.Then, B contains all contiguous substrings of ICD codes in T that start at the first character in an ICD code and The kernel function on two ICD codes is defined as a Euclidean dot product between their feature maps.Let ψ(t) be the feature map of an ICD code t ∈ T and F = {ψ(t) : t ∈ T } be the feature space.Then, for every t ∈ T , ψ(t) ∈ {0, 1} |t * || B | and is defined as where 1(•) is 1 if • is true and 0 otherwise.The kernel function on T × T is defined using ψ(•) and where Λ = diag{Λ b : b ∈ B} is a block diagonal matrix.Two useful special cases of (2) are obtained by setting Λ b to be a diagonal matrix in (2) such that (i) Λ b = λ b I and (ii) (Λ b ) aa = λ a for every b ∈ B and a = 1, . . ., |t * |, where I is an identity matrix, λ b > 0 for every b ∈ B, and λ ∈ (0, 1).We prove in the next section that κ 0 (t, t | Λ) in ( 2) is a valid kernel function on T × T .
The kernel in (2) belongs to the class of string kernels that are widely used in text mining and information retrieval [18].Two popular string kernels are spectrum and boundrange kernels [9].
The k-spectrum kernel defines the similarity to be the number of matching substrings between two strings of length k exactly.The k-boundrange kernel instead defines the similarity to be the number of matching substrings between two strings of length less than or equal to k.The substrings can be non-contiguous in both kernels and the substring matches can be weighted by a factor that decays exponentially with the substring length.The kernel in ( 2) is restricted version of |t * |-boundrange kernel where the substrings always start at the beginning of an ICD code.

Kernel Function on Subsets of ICD Codes
We now define the kernel function on subsets of ICD codes that represent diagnoses.Let 2 T be the power set of T , t = {t 1 , . . ., t r } be an element of 2 T , Ψ(t) be the feature map of t, and T } be the feature space of subsets of ICD codes.The feature vector Ψ(t) and the kernel function κ 1 on 2 T × 2 T are defined using ψ(•) in (1) and κ 0 (•, • | Λ) in (2), respectively, as In the motivating EHR data, covariates include subsets of ICD codes that have an additional structure specified by their relation to a chronic condition.Let C be the total number of chronic conditions and T c ∈ 2 T be the subset of ICD codes that defines the chronic condition c (c = 1, . . ., C).A patient has chronic condition c if the patient's diagnoses consists of an ICD code that belongs to T c .Accounting for the C chronic conditions, represent a patient's diagnoses as the accounts for the structure imposed by C chronic conditions is where Λ is defined in (2), w = (w 1 , . . ., w C ), and w c indicates the importance of chronic condition c in defining the similarity between two patients with diagnoses T and T , respectively.The kernel κ 2 assumes the effect of C chronic conditions on the response is additive, but other valid methods for combining kernels κ 1 (t This implies that κ 2 in (4) is large if the cardinality of T is large; therefore, the length of feature maps impacts the similarity measure instead of the chronic conditions.This undesirable effect is removed by normalizing the feature maps of T, T before computing the dot product, which results in the kernel All kernels in this paper are derived from κ in (5), which depends on the kernels κ 0 , κ 1 , and κ 2 defined earlier.
The following theorem proves that κ is a valid kernel function on (2 T ) C × (2 T ) C and defines a metric and the equivalents of polynomial, exponential, and squared exponential (SE) kernels on Theorem 3.1.Let {Λ b } b∈B and w be defined as in (2) and (4), T, T ∈ (2 T ) C , σ 2 > 0, φ > 0, and Then, given Λ and w, where R is an upper triangular matrix, then the feature map of κ is where Ψ w,R (T) is a (C|t * || B |)-dimensional vector; and 3. the equivalents of polynomial kernel of order s and γ-exponential kernel on (2 T ) C × (2 T ) C , respectively, are defined as where σ 2 and φ play the role of variance and inverse length-scale parameters, respectively, and κ e γ reduces to the exponential and SE kernels when γ equals 1 and 2, respectively.
The proof of this theorem is in the supplementary material along with other proofs.We illustrate the application of six kernel functions in capturing the similarity of patients in the UIHC EHR data, where we expect patients with common primary sites to have similar sets of chronic conditions (Figure 1).Let T i be the ICD codes for the patent i, n be the sample size, Then, the kernel matrix defined using k is an n × n matrix with (i, j)-th entry k(T i , T j ).To highlight the similarity of patients, we have grouped the patients into six blocks depending on their primary cancer sites.The kernel matrix with k = κ 2 in (4) fails to capture the similarities of patients with common primary sites (Figure 1a).On the other hand, the kernel matrix with k = κ in (5), which is the normalized version of κ 2 , captures the patient similarities, where the diagonal blocks denote patient with common primary sites and the off-diagonal blocks capture similarities of patients with cancers at two different sites (Figure 1b).
The same observation is true for the γ-exponential kernel in (7) with γ equals 1 and 2, respectively (Figures 1c and 1d).The diagonal and off-diagonal blocks are clearest for radial basis function kernel.The spectrum and boundrange kernel matrices fail to capture similarities of patients except those with cancer at brain and nervous system, where the diagnoses are very similar compared to cancers at other sites (Figures 1e and 1f).
4 Modeling With Subsets of ICD Codes as Covariates

Setup
Consider the setup for regression and classification problems with ICD codes as covariates.Let n be the sample size and (y i , x i , T i ) be the data for subject i (i = 1, . . ., n), where y i is the response, x T i = (x i1 , . . ., x ip ) is the vector of p-dimensional covariates excluding ICD codes with x i1 = 1, and T i = {t i1 , . . ., t iC } ∈ (2 T ) C .For every i, t ic denotes the diagnoses of subject i that belong to chronic condition c and t ic can be an empty set (c = 1, . . ., C).In regression and classification problems, y i ∈ R and y i ∈ {0, 1}, respectively, and (x i2 , . . ., x ip ) T ∈ R p−1 for every i.We also predict the responses and estimate the covariate effects for a given collection of covariates including ICD codes (x * j , T * j ) (j = 1, . . ., m), where T * j ∈ (2 T ) C .We model the effects of x i and T i independently.Let β ∈ R p and f : (2 T ) C → R.Then, the covariate effects of x i and T i are x T i β and f (T i ), respectively, and the overall covariate effect for subject i as x T i β +f (T i ).In practice, β and f are unknown.Choosing a Bayesian approach, we estimate the posterior distributions of β and f given {(y i , x i , T i )} n i=1 .We use a parametric model for the estimating the effects of x i because our focus in on non-parametric regression and classification using T ∈ (2 T ) C ..

Regression Using Subsets of ICD Codes as Covariates
Consider the nonparametric regression model.Let i denote the idiosyncratic error for subject i, i s are independent and identically distributed (iid) with mean 0, and τ 2 > 0 denotes the error  variance.Then, the model assumes that where N (a, b) denotes the Gaussian density with mean a and variance b.We put a GP prior on f (•) with 0 mean function and covariance function , where θ are the parameters specifying κ f .The model is completed by putting a prior distribution on (β, τ 2 , θ) and its form depends on the choice of κ f .
We develop posterior inference algorithms for κ f equalling κ e γ defined in Theorem 3.1.We exclude the polynomial kernel because it yields a finite dimensional regression model, which is inflexible compared to the non-parametric regression model obtained using κ e γ .Setting w = (1, . . ., 1) and Λ b to be an identity matrix (b ∈ B) in ( 6) and ( 7), we define The parameter for κ exp and κ SE is θ = {σ 2 , φ}, where σ 2 , φ are positive scalars.Denote the kernel matrix as , where i, j ∈ {1, . . ., n}.
3. Draw f * given y, σ 2 , β, φ, α from N (m * , V * ), where We collect MCMC draws of β, f * , y * , σ 2 , φ, τ 2 post convergence.The derivation of this algorithm is in the supplementary materials along with the others.The algorithm in steps (1)-( 4) a variant of Gibbs sampling algorithm for posterior inference in univariate spatial linear models in that we replace the Metropolis-Hastings step by an ESS step in (2) and T replaces a spatial location [1].
Unlike the Metropolis-Hastings step, the ESS step is free of any proposal tuning, which is preferred in automated applications.

Classification Using Subsets of ICD Codes as Covariates
The classification model is based on logistic regression.It assumes that log Pr(y i = 1) Pr( where y i is the ith response.The choice of priors on β, f (•) and the kernels remain the same as in (10), but the prior on θ = (φ, where b 1 , b 2 > 0. We use Pólya-Gamma data augmentation (PG-DA) for posterior inference on β, f * , K * * and prediction of y * [13].This setup ensures that the MCMC algorithms for inference and predictions in classification and regression models are very similar.Specifically, we introduce Pólya-Gamma random variables ω 1 , . . ., ω n specific to every observation such that ω i are marginally distributed as PG(1, 0), where PG is the Pólya-Gamma distribution with parameters b = 1 and c = 0, respectively.Define ȳ = (y Then, the conditional likelihood of b given y, ω and the associated model are defined as respectively.Furthermore, Theorem 1 in Polson et al. [13] implies that ω i given b and the ith row of A, denoted as The MCMC algorithm for posterior inference and predictions in (16) follows from arguments similar to those used in Section 4.2.Marginalizing over f in (18) implies that z given y, β, ω, φ, σ 2 is distributed as N (X β, C zz ), where C zz = K θ + Ω −1 and it plays the same role as C yy in (10).
Following (12), if L is a lower triangular matrix such that C zz = L L , then define The MCMC algorithm for posterior inference on β, f * , K * * and for predicting y * cycles through the following six steps until convergence to its stationary distribution.

Draw y *
i given X * , β, f * from Bernoulli(p * i ) for i = 1, . . ., m, where p * i = e v i /(1 + e v i ) and We collect MCMC draws of β, f * , y * , σ 2 , φ post convergence.3] present an application of the PG-DA strategy for nonparametric regression with a negative binomial response and a GP prior, but a linear predictor similar to x i β in ( 8) is absent in their model.Second, Wang and Roy [19] develop a sampling algorithm based on the PG-DA strategy for posterior inference in a Bayesian logistic linear mixed model with independent and Gaussian random effects and prove its geometric ergodicity.Our MCMC algorithm is similar to theirs in that f (•) plays the role of random effects and the GP prior on f (•) induces dependence among patients that have similar diagnoses.Overall, the MCMC algorithm above broadens the range of application of the PG-DA strategy.

Theoretical Properties
The convergence rates of the posterior distributions of f (•) in Sections 4.
where K ν is the modified Bessel function of second kind.If ν ∈ N, then sample paths of the Matérn GP have partial derivatives up to order ν and all of them are square integrable.
The GP prior in Theorem 3.1 is indexed by the feature vectors . This is equivalent to embedding the string T in [0, 1] d using the feature map.The GP covariance kernels are defined using this embedding.The equivalent of Matérn covariance kernel where Ψ T denotes Ψ w,R (T).If ν = 1/2, then this kernel equals κ e 1 .Taking point-wise limit of σ 2 κ Mat (Ψ T − Ψ T | ν, φ) as ν → ∞, we obtain the equivalent of SE covariance kernel on We now define the regularity of functions using three function spaces.Let ν denote the largest integer less than or equal to ν and ǧ(s) denote the Fourier transform of g : R d → R. We introduce some notations for stating the theoretical results.Reformulating the models in Sections 4.2 and 4.3 in terms of feature maps Ψ ∈ [0, 1] d , the population versions of ( 8) and ( 16) respectively, where τ 2 0 is the error variance, any T ∈ (2 T), and the effect of T on y is modeled through the feature map Ψ w,R (T), which is fixed for a given w, R. The model is ( 23) is well-defined because Ψ is defined in Theorem 3.1 and is fixed if w, R are known.The true relation between the feature map and response is denoted as µ 0 (•), which maps [0, 1] d to R. The training data y 1 , . . ., y n are independently generated following (23) in the two models for fixed T 1 , . . ., T n that have been determined apriori.Every T i is embedded in [0, 1] d as Ψ i = Ψ w,R (T i ) and ( 23) is satisfied for all the n embeddings.The true distribution of y i is N {µ 0 (Ψ i ), τ 2 0 } and Bernoulli[1/{1 + e −µ 0 (Ψ i ) }] in ( 8) and ( 16), respectively.The n-fold product of these distributions is denoted as P n 0 in both models and y ∼ P n 0 .The GP prior on f (•) in ( 8) and ( 16) is induced using the Matérn and SE covariance kernels if µ 0 (•) in (23) has a finite or infinite regularity, respectively.The posterior distributions of f given y are denoted as Π Reg Mat,n , Π Reg SE,n and Π Clas Mat,n , Π Clas SE,n in ( 8) and ( 16) depending on the GP prior.The convergence rates are described in terms of posterior probabilities assigned to small neighborhoods of µ 0 as n → ∞.Using (23) and the embedding of (2 T ) C in [0, 1] d based on the feature map Ψ w,R , the distance between f (•) and µ 0 (•) is defined as the empirical norm where • n conditions on the covariates T 1 , . . ., T n and its dependence on w and R is suppressed.
Given w, R, the true and estimated effects of T i on y i are µ 0 Ψ w,R (T i ) and f Ψ w,R (T i ) , respectively.The next theorem quantifies the rate of convergence of the posterior distribution in terms of • n norm. 1 in P n 0 -probability for every w, R with n = n −ν 2 /(2ν 1 +d) .
The proof of Theorem 4.1 is based on the standard setup [4,Theorems 11.23 and 11.22].The theorem is also applicable to the case when β is non-zero if we absorb x as a part of the Ψ T vector.Specifically, we scale the covariates x, x so that they lie in [0, 1] p , redefine d to be p + C|t * || B |, and set the Matérn and SE kernels to be

Setup
We evaluate the performance of GP-based classification in (16) using three simulation studies and an analysis of the UIHC EHR data.In this section, covariance kernel of the GP is SE and the focus is on classification because our main goal is to address questions relevant to the UIHC EHR data; supplementary materials contain the results for GP-based regression using (8).In the simulated and real data analyses, the cutoff for predicting the response is estimated using Receiver Operating Characteristics (ROC) curve.We compare the predictive performance of the all the methods using accuracy, area under the ROC curve (AUC), sensitivity, and specificity on the test data.
We compare GP-based classification with parametric and nonparametric models.The parametric competitors are logistic regression and its regularized versions that penalize the regression coefficients using the lasso and ridge penalties.The covariates in these models are dummy variables that indicate the presence of ICD codes in a patient's diagnosis.Because the SE kernel in ( 5) is a restriction of the boundrange kernel, we use support vector machine (SVM) and kernel ridge regression (KRR) with the spectrum and boundrange kernels as the nonparametric competitors.
We also use the kernel function in (5) for defining feature vectors.Set φ = 1, σ 2 = 1 in the kernel matrix K θ .If U D U is the singular value decomposition of K, then the first r columns of U D 1/2 are the feature vectors that are used as covariates in random forest, logistic regression, and its penalized extensions, where r is selected using the scree plot.
The experiments are performed in R. We used glmnet [3] for regularized logistic regression, ranger [20] for random forest, and kernlab [8] for SVM and KRR.The performance metrics are evaluated using the pROC package [17].The tuning parameters in all the methods are selected using the recommended settings.We use the spectrum and boundrange kernels in kernlab with 3 and 4 as the length of the matching substring, respectively.This means that the matches in the two kernels include the first three and four contiguous characters in a pair of ICD codes, which are also the most informative [2]; therefore, we expect the results for kernlab and our method to be very similar.We use ESS by setting a φ , b φ , b 1 , and b 2 in ( 17) to be 0,5, 1, and 2, respectively.
The MCMC algorithms run for 10,000 iterations.The draws from the first 5,000 iterations are discarded as burn-ins, and every fifth draw in the remaining chain is chosen for posterior inference and prediction.

Simulated data analysis
We present three simulation studies.The first two are based on simple parametric models, whereas the third is based on a nonparametric model based on the UIHC EHR data.We expect all methods to have comparable performances in the first two simulations and differ significantly in the third one.The three simulations are replicated ten times.
First two simulations.Both simulations have four chronic conditions.The first chronic condition is denoted using the codes in {A, B, AA, BB, BA, AB}, where every code denotes a "diagnosis".
Replacing (A, B) with (C, D), (E, F ), and (G, H) in this set gives the set of codes for the second, third, and fourth chronic conditions, respectively.The chronic conditions are further structured into two groups: the first group includes the first and third chronic conditions and the second group includes the remaining two.We assign nine codes to every patient, where six out of the nine codes are from the 12 codes defining first or second group of chronic conditions and the remaining three come from the other chronic condition group.The probability of y = 1 is higher than that of y = 0 if codes from the second chronic condition group are in majority and vice-versa.
The first simulation study is based on logistic regression.Let z and y i = 1(p i > 0.5) for i = 1, . . ., n + m.
In first two simulations, GP with SE covariance kernel is among the best performers (Table 1).The first simulation has a linear decision boundary in terms of the 24 codes.On the other hand, the second simulation has a quadratic decision boundary depending on the interactions between {AB, BA}, {CD, DC}, {EF, F E}, and {GH, HG}.Logistic regression and its regularized versions perform the best in the first simulation, but their performance deteriorates slightly in the second simulation due to the presence of interactions between dummy variables.Random forest using the codes as covariates has the worst performance in both simulations.After including SE kernel-based covariates, random forest achieves the same level of performance as other method in both simulations.The remaining methods are based on non-linear kernels, so they are more flexible in adapting to both linear and non-linear decision boundaries.Compared to the spectrum and boundrange kernels, the SE kernel in ( 5) is better tuned for modeling the extra structure provided by the chronic conditions.This implies that kernel-based methods perform well in both simulations and that the GP with SE kernel performs better than SVM and KRR with spectrum and boundrange kernels.
Third simulation.For this simulation, we used ICD codes of patients in the UIHC EHR data with brain and other nervous system or breast as the cancer primary sites.For the ith (pseudo) patient in this subset, let z i = Ψ(T i ) 2 and δ i = 1 if the patient with T i code has brain and other nervous system cancer and δ i = −1 otherwise, where Ψ(T i ) is defined in (3).The response y i is simulated independently from Bernoulli (p i ), where for i = 1, . . ., n + m.We simulate the data after setting n = 1000 and m = 100.
GP with SE covariance kernel is still among the top performers (Table 1).This simulation has a nonlinear periodic decision boundary; therefore, the performance of random forest and parametric models, including logistic regression and its regularized, deteriorate significantly.While all the kernel-based methods are suited for modeling the non-linear periodic decision boundary in (29), the SE kernel is better tuned than spectrum and boundrange kernels for modeling the hierarchical structure of ICD codes.This implies that all kernel-based methods perform well, but the GP with SE kernel performs better than SVM and KRR with spectrum and boundrange kernels.
Summary.Our simulation studies suggest that the kernel-based methods are better suited for applications in practice where we expect interactions among the diagnoses.Furthermore, GP with the SE kernel is easily extended to account for any biomedical information, is tuned for modeling the structure of ICD codes, and produces similar results as SVM or KRR with spectrum and boundrange kernels in the absence of any additional structure.The distinguishing feature of the proposed method is that the posterior MCMC draws can be used for quantifying uncertainty in predictions and parameter estimates, which is key in biomedical applications; therefore, we conclude that the GP with SE covariance kernel is better than other kernel based methods in quantifying uncertainty and in modeling the effect of chronic conditions on the response.
Table 1: Performance comparisons for the three simulation studies.Every entry in the table is the average of its values across ten replications.SE-GP is the proposed method and SE kernel-based features are obtained from the kernel matrix estimated using the proposed method.If a method fails to produce results, then we indicate this using '-'.SE-GP also performs better than SVM with spectrum and boundrange kernels because SE kernel accounts for the additional structure provided by the 58 chronic conditions.The SE-GP is a restriction of the boundrange kernel that is tuned for EHR data analysis, so its sensitivity and specificity are much higher than those of the SVM in all the five classification models.Most importantly, the estimates of marginal associations between chronic conditions and primary cancer sites agree closely with those obtained using the SVM with spectrum and boundrange kernels (Table 7).A key feature of the SE-GP approach is that it provides 95% credible intervals in addition to the point estimates, which are important for characterizing uncertainty in biomedical applications.
Additionally, these credible intervals cover the corresponding estimates obtained using SVM.Based on our results in simulations and this section, we conclude that SE-GP outperforms its competitors in estimating the marginal associations among chronic conditions and primary cancer sites and in predicting the cancer primary site using diagnoses and demographic information as predictors.

Discussion
Our approach can be extended in several ways.First, the sample size in the motivating application is small so that repeated computations and evaluation of the kernel function is relatively inexpensive.
This, however, becomes problematic as the sample size becomes moderately large.The kernels developed in this work can be immediately extended for nonparametric regression and classification in large sample settings using low-rank kernels based on inducing points [14].Second, Every element of the kernel matrix is a similarity measure between a pair of patients.This can be used an input in an algorithm for unsupervised learning using ICD codes as inputs, such as clustering of subsets, data visualization, and dimension reduction [18,Chapters 6,8].The PG-DA strategy outlined in Section 4.3 is easily extended to model zero-inflated negative binomial responses following Neelon [10].Finally, We are exploring application of product partition model with regression on diagnoses, where the similarity measure on diagnoses is defined using the kernel matrix.
Following the same arguments used in (39), we have that The closure properties of kernel functions implies that κ 2 in (33) is a valid kernel function.
Proposition 3.22 on Page 75 in Shawe-Taylor and Cristianini [18] implies that ) is a kernel function and w c > 0 (c = 1, . . ., C).The same proposition also implies that κ 2 (T, for any T ∈ (2 T ) C , and it is obtained following the same steps in (39) and (40) as The dimension of Ψ w,R (T) is C times the dimension of ψ(t) for any t ∈ T , which equals C|t * || B |.
Finally, the kernel κ in (34) is obtained by the normalization of κ 2 .First, normalize the feature map Ψ w,R (T) for any T ∈ (2 T ) C by its Euclidean norm as where (i) follows from the Cauchy-Schwartz inequality.Second, noting that Ψ w,R (T) = {κ 2 (T, T | w, Λ)} 1/2 , define the normalized κ 2 kernel with feature map Ψ w,R (T) as Using the feature map Ψ w,R (T) in the definition of distance function in (35), we have that the Euclidean distance between Ψ w,R (T) and Ψ w,R (T ).
Proof of (2): Using (43), we have that Ψ w,R (T) is the feature map of κ.
Proof of (3): Proposition 3.24 on Page 76 in Shawe-Taylor and Cristianini [18] implies that if using its definition in (42).Section 4.2.1 in Rasmussen and Williams [16] shows that this is a valid kernel function, so κ e γ (T, T | σ 2 , φ, w, Λ) is a valid kernel function on (2 T ) C × (2 T ) C ; see Equation 4.18 on Page 86 in Rasmussen and Williams [16].

B Proof of Theorem 2
Recall that the population regression and classification models are where τ 2 0 is the error variance, T ∈ (2 42), and the effect of T on y is modeled through the feature map Ψ w,R (T), which is fixed for a given w, R. The distance between f (•) and µ 0 (•) is defined as the empirical norm where • n conditions on the covariates T 1 , . . ., T n and its dependence on w and R is suppressed.
We now restate Theorem 2.
Theorem B.1.Assume that the regression and classification models in (45) are true, the Matérn covariance kernel has smoothness ν 1 , and σ and φ parameters of the covariance function are known.

D Simulations for GP Regression
We now summarize the simulation results for GP regression with SE kernel that are based on Section 5.2 of the main paper.The three regression models are based on the three classification models in Equations ( 27), (28), and (29) of the main paper.We use the same methods that are used for evaluating the performance of GP classification in the main manuscript.The performance of all the methods is compared using the mean square error (MSE), mean square prediction error (MSPE), and the predictive coverage based on a 95% confidence or credible interval.
First two simulations.Recall that in first two simulations, there are four chronic conditions.The first chronic condition is denoted using the codes in {A, B, AA, BB, BA, AB}, where every code denotes a "diagnosis".Replacing (A, B) with (C, D), (E, F ), and (G, H) in this set gives the set of codes for the second, third, and fourth chronic conditions, respectively.The chronic conditions are further structured into two groups: the first group includes the first and third chronic conditions and the second group includes the remaining two.Our simulation assigns nine codes to every patient, where six out of the nine codes are from the 12 codes defining first or second group of chronic conditions and the remaining three come from the other chronic condition group.
Our regression models add error terms to the linear predictors in the classification models of the main paper.Let z ∼ N (0, 1), i ind.

Figure 1 :
Figure 1: Heatmaps of kernel matrices for patients in the UIHC EHR data.The color palette represents 0 and 1 using dark blue and dark red colors, respectively.The first four kernel matrices in (a)-(d) set Λ = I, σ 2 = 1, φ = 1, and w c ∝ κ 1 (T c , T c ) for kernels κ 2 , κ, κ e 1 , and κ e 2 , where T c is the set of ICD codes defining the cth chronic condition.The last two kernel matrices are obtained using spectrum and boundrange kernels, which are popular string kernels in text mining, with substring length equalling 3. The block correlation structures are better captured in (b), (c), and (d) than (a), (e), and (f), indicating the superiority of the proposed set of kernel functions.

Our
MCMC algorithm is similar to other algorithms based on the PG-DA strategy.First, Polson et al. [13, Section 5, Table 2 and 4.3 follow from the general theoretical setup for GP regression and classification Ghosal and van der Vaart [4, Section11.4.4].Our focus is on the stationary GPs with the Matérn and SE covariance kernels because κ e γ is based on them.For stationary GP priors, the posterior convergence rates are described using tail-decay properties of the spectral measure associated with the covariance kernel and regularity of the true f (•).In this section, we adapt existing results for our feature maps Ψ w,R assuming that w and R are known.Consider the Matérn GP prior and its spectral measure.Following Rasmussen and Williams[16,    Page 84], if d = C|t * || B |, then the Matérn covariance kernel with smoothness and range parameters ν and φ, respectively, is defined through its spectral measure λ on R d as First, the Hölder space C ν [0, 1] d is the space of all functions on [0, 1] d whose partial derivatives of order (k 1 , . . ., k d ) exist for all nonnegative integers k 1 , . . ., k d such that k 1 + . . .+ k d ≤ ν and whose ν -th derivative is Lipschitz of order ν − ν .Second, the Sobolev space H ν [0, 1] d is space of all functions on [0, 1] d that are the restrictions of a function g for which (1 + s 2 ) ν |ǧ(s)| 2 d s is finite.Finally, the space of infinitely smooth functions on [0, 1] d are restrictions of a function g on [0, 1] d for which exp(γ s r )|ǧ(s)| 2 d s is finite and is denoted as A r,γ [0, 1] d with r ≥ 1 and γ > 0. A function is ν-regular and ∞-regular on [0, 1] d if it belongs to C ν [0, 1] d ∩ H ν [0, 1] d and A r,γ [0, 1] d for some r ≥ 1 and γ > 0, respectively.

Theorem 4 . 1 .
Assume that the regression and classification models defined in (23) are true, σ and φ are known, and the Matérn covariance kernel has smoothness ν 1 .
ij be a dummy variable indicating the presence of the jth code in the cth chronic condition in the ith patient (c = 1, . . ., 4; j = 1, . . ., 6; i = 1, . . ., n + m), where codes follow the dictionary order, n is the training data size, and m testing data size.The first and second group dummy variables for the ith patient are {z (c) ij : j = 1, . . ., 6; c = 1, 3} and {z (c) ij : j = 1, . . ., 6; c = 2, 4}.The response y i is simulated independently from Bernoulli (p i )i = 1, . . ., n + m.The second simulation is a slight modification of the first.Dropping the x i β term in (27), we modify it to only include the interaction of {AB, BA}, {CD, DC}, {EF, F E}, and {GH, HG}, respectively, as
(4)t 1 | Λ), ..., κ 1 (t C , t C | Λ) can be also used in(4).The kernel κ 2 in (4) is affected by the number of elements in T and T .Let • be the Euclidean norm.The value of κ 1 in (3) is impacted by t in that Ψ(t) increases with the cardinality of t.
The polynomial kernel has finite dimensional feature.If s > 1, then the polynomial kernel κ s is obtained by a non-linear embedding of the kernel κ.The dimension of the feature space for κ s is C|t * || B |+s

Table 2 :
Performance in the classification model with y = 1 denoting the presence of cancer in the brain and other nervous system.Every entry in the table is the average of its values across ten replications.

Table 3 :
Performance in the classification model with y = 1 denoting the presence of cancer in the breast.Every entry in the table is the average of its values across ten replications.

Table 4 :
Performance in the classification model with y = 1 denoting the presence of cancer in the urinary system.Every entry in the table is the average of its values across ten replications.

Table 5 :
Performance in the classification model with y = 1 denoting the presence of cancer in the respiratory system.Every entry in the table is the average of its values across ten replications.

Table 6 :
Performance in the classification model with y = 1 denoting the presence of cancer in female gential system.Every entry in the table is the average of its values across ten replications.(i) follows from (38); therefore, κ 0 (•, • | Λ) is a kernel function with feature map R ψ(t) for any t ∈ T .Similarly, (32) implies that for any t, t ∈ 2 T , where C|t * || B | , where C, |t * |, | B | are defined in Theorem A.1; therefore, we can directly apply Theorem 11.22, Theorem 11.23, and the results in Section 11.4.4 of Ghosal and van der Vaart [4] for Matern and SE GP priors indexed by a covariate lying in [0, 1] d with d = C|t * || B |. Specifically, Lemma 11.36 and Lemma 11.37 imply that (47) is satisfied with n = n −ν 2 /(2ν 1 +d) for a given w, R. Similarly, Lemma 11.38 and Lemma 11.41 imply that (48) is satisfied with n