Smooth semi‐nonparametric (SNP) estimation of the cumulative incidence function

This paper presents a novel approach to estimation of the cumulative incidence function in the presence of competing risks. The underlying statistical model is specified via a mixture factorization of the joint distribution of the event type and the time to the event. The time to event distributions conditional on the event type are modeled using smooth semi‐nonparametric densities. One strength of this approach is that it can handle arbitrary censoring and truncation while relying on mild parametric assumptions. A stepwise forward algorithm for model estimation and adaptive selection of smooth semi‐nonparametric polynomial degrees is presented, implemented in the statistical software R, evaluated in a sequence of simulation studies, and applied to data from a clinical trial in cryptococcal meningitis. The simulations demonstrate that the proposed method frequently outperforms both parametric and nonparametric alternatives. They also support the use of ‘ad hoc’ asymptotic inference to derive confidence intervals. An extension to regression modeling is also presented, and its potential and challenges are discussed. © 2017 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.


Introduction
Competing risks refer to the situation where multiple event types are of interest and the time T and the event type D of the first occurring event, possibly subject to censoring or truncation, is analyzed. Competing risks are frequent in medical research, especially in multimorbid or critically ill study populations [1]. The most important descriptive quantity in the analysis of competing risks data is the cumulative incidence function (CIF), which measures the absolute risk of different event types j over time: CIF j (t) = P(T ≤ t, D = j). In the presence of right censoring or left truncation, the CIF is usually estimated by the nonparametric Aalen-Johansen estimator, which gives a step-function [2]. Non-parametric estimates of the CIF for interval-censored data have also been suggested, but they are non-unique over certain sub-intervals of the follow-up period and may have nonstandard asymptotic behavior with convergence rates slower than n −1∕2 [3][4][5]. Parametric methods for CIF estimation provide smooth estimates, can easily be applied to arbitrary censoring and truncation schemes, have standard asymptotic properties, and may be more efficient if the assumed parametric model is correct [6,7]. However, they rely on parametric assumptions, which may not be tenable.
In econometrics and survival analysis, smooth 'semi-nonparametric' (SNP) densities have been successfully used as building blocks for flexible models that combine some of the advantages of nonparametric and parametric approaches [8][9][10]. In this work, we propose a novel SNP estimator of the CIF. The rest of this paper is organized as follows: Section 2 gives a brief overview of SNP density estimation. Section 3 describes the underlying model for the CIF, and Section 4 proposes an estimation algorithm and methods for 'ad hoc' asymptotic inference. A test statistic to compare the CIFs for a specific event type between independent groups is given in Section 5. Section 6 evaluates the finite-sample performance of the proposed methods in a simulation study, and Section 7 gives an application to a trial in cryptoccocal meningitis. The potential and challenges of extending our CIF-estimation approach to regression modeling are discussed in Section 8, followed by concluding remarks.

Smooth semi-nonparametric density estimation -overview
The SNP densities were first introduced in the field of econometrics by Gallant and Nychka [8] to estimate the distribution of a nuisance random part in a statistical model where little is known about the shape of the distribution. For univariate problems, an SNP density is of the form g K (z) = P 2 K (z) (z) + 0 g 0 (z) where P K (z) = ∑ K i=0 a i z i , (z) is a known base density such as the standard normal density, and g 0 (z) is a known strictly positive density with expectation zero, which together with the small positive constant 0 governs the tail behavior of g K (z). For fixed K, g K (z) is fully parametrized by the finite set of polynomial coefficients ( a 0 , … , a K ) , which can be estimated using standard techniques such as maximum-likelihood estimation (MLE). If the polynomial degree K = K n is adaptively chosen depending on the sample size n such that lim n→∞ K n = ∞, then g K n (z) can consistently estimate any density belonging to a rich class of smooth densities allowing for skewness, kurtosis, multi-modality, or non-normal tail behavior under certain regularity conditions [8]. Moreover, it was demonstrated in an extensive simulation study that SNP density estimation is qualitatively similar to kernel density estimation [11].
Smooth semi-nonparametric densities were introduced as building blocks for regression modeling of time-to-event data and estimation of the survival function by Zhang, Doehler, and Davidian [9,10]. For estimation of the survival function of a random variable T, a simple model of the form is used with and governing the location and scale of log T, respectively, and Z is assumed to have a SNP density of the form h K (z) = P 2 K (z) (z), where (z) is the density function of the standard normal distribution and the 'tail' 0 g 0 (z) is ignored as in most practical implementations of the methodology. Alternatively, the authors suggested to parametrize the distribution of e Z by a SNP density of the form l K (z) = P 2 K (z) (z) where (z) is an exponential density with rate 1. For fixed K, the distribution of T in (1) only depends on , and the polynomial coefficients of the SNP density, leading to a tractable implementation of the likelihood function under arbitrary censoring and truncation. The choice of K as well as the choice between a normal or exponential base density is made adaptively based on an information criterion such as the criterion proposed by Akaike. In extensive simulation studies, several authors [9,10] demonstrated the advantages of the SNP approach compared with traditional parametric, semi-parametric and nonparametric methods. Moreover, they ascertained earlier claims (e.g. [11]) that 'ad hoc' asymptotic inference that ignores the adaptive choice of K frequently provides confidence intervals with acceptable coverage.
The marginal event probabilities P(D = j) are assumed to follow a simple multinomial model that we parametrize by a multinomial logistic model with intercept terms only: where j , j = 1, … , J − 1, are the parameters, and J is set to 0 to ensure model uniqueness.
Conditional on D = j, T | D = j is a 'proper' time-to-event variable, and we parametrize its distribution using SNP densities as suggested in [9], that is, we assume that for j = 1, … , J where Z j are random variables whose distribution is modeled using SNP densities, and j and j govern the location and scale of log (T | D = j), respectively. As in [9], we consider two different SNP models for Z j : (1) a direct model that assumes that Z j follows a SNP density of degree K j with a standard normal base density, or (2) an indirect model that models e Z j based on a SNP density with an exponential base density with rate 1. For parsimony reasons, we assume that all conditional time-to-event distributions use the same base density but possibly different polynomial degrees K j and parametrize the polynomial coefficients of the SNP densities by spherical coordinates as detailed in the web-appendix of [9]. Of note, for the special case of K j = 0, the model specifies a log-normal or a Weibull distribution for the conditional time-to-event distributions.

Likelihood construction for fixed polynomial degrees
Direct simultaneous estimation of all parameters in the model specified previously including the SNP degrees is not feasible. However, for fixed base densities and a fixed set of SNP polynomial degrees , the underlying model is parametric with a total parameter set given by . Hence, standard MLE of ( ) can be employed. Importantly, the likelihood can easily be specified for arbitrary censoring and truncation mechanisms. For example, for competing risks data subject to independent right or interval censoring as well as left-truncation, the respective likelihood contribution from a subject is for 0 ≤ l < t l ≤ t r . Here, D ′ refers to the observed event type indicator with D ′ = D if an event was observed at time t l = t r (for uncensored data) or in the interval [ t l , t r ] (for interval-censored data), and D ′ = 0 refers to a right-censored observation at time t l . l refers to the truncation time, Explicit formulas for P(T ≤ t | D = j) can be derived using integration by parts as described in [9].

Estimation procedure
For a fixed base density, we propose to choose the polynomial degrees based on an information criterion using the following greedy step-wise forward algorithm: Step 0: Perform MLE for the parametric model with all K j = 0. This is the current best fit. Go to Step 1.
Step 1: Calculate MLE for the J parametric models that correspond to increasing the polynomial degrees K 1 , … , K J , one at a time, by 1 compared with the current best fit. Among these, we call the model with the best information criterion the new best model. Step 2: If the new best model is better than the current best fit, replace the current best fit with the new best model and repeat Step 1. Otherwise, go to Step 3.
Step 3: Stop and return the current best fit.
In practice, we truncate the maximum allowed polynomial degrees at K max for computational reasons. In survival analysis, K max = 2 is generally sufficient to achieve an excellent fit based on extensive simulation studies [9]; in other settings, values of K max from 2 to 4 have been suggested [11]. Commonly used information criteria are of the form 2 × {−l( ( )) + qc}, where l( ( )) is the log-likelihood evaluated at the MLE for fixed polynomial degrees and q is the total number of parameters in the model. Typical choices of c are c = 1 (Akaike's information criterion, AIC), c = log(n)∕2 (Bayesian information criterion, BICn), and c = log{log(n)} (Hannan-Quinn's criterion, HQCn). Alternatively, we also considered replacing n, the sample size, by d, the total number of events of any type, in the definition of BIC and HQC and denote the resulting information criteria by BICd and HQCd, respectively. A schematic of the greedy stepwise forward algorithm for 2 competing risks with K max = 2 and AIC as the information criterion is given in Appendix A of the Supporting Information. Finally, the choice between the normal and the exponential base density is made by comparing the best fits from both base densities using the same information criterion.
For fixed polynomial degrees, the log-likelihood function is still relatively complex and may have multiple extrema. Hence, good starting values for the numerical optimization algorithm are crucial. We detail the proposed selection of starting values in Appendix B of the Supporting Information. In brief, for step 0, starting values for j and j are determined based on parametric log-normal or Weibull survival models including all subjects with an observed (or interval-censored) event of type j. Right-censored observations are also included but weighted based on a crude estimate of their probability of ultimately experiencing event type j. Initial values for j ( j = 1, … , J − 1) are then obtained by optimizing the likelihood function with respect to these parameters only. For step 1, starting values for j are the estimates from the current best fit. For the new polynomial coefficients (in spherical coordinates), multiple starting values are selected from a regular grid on [ − ∕2, ∕2 ] K j +1 as suggested in [9]. Finally, corresponding new starting values for j and j are selected such that the first two moments of the distribution of log T | D = j (which depend on these parameters and the spherical coordinates) remain unchanged compared with the current best fit. For each set of starting values, numerical optimization of the log-likelihood is then implemented using a quasi-Newton algorithm. An implementation of the proposed estimation algorithm by the first author using the statistical software R [16] has been uploaded to GitHub (see Section 9 for details).

Ad hoc statistical inference for cumulative incidence function estimates
A rigorous proof of asymptotic normality of smooth semi-nonparametric models based on SNP densities is still lacking in general. However, empirical evidence from several studies on density estimation and survival analysis suggest that the use of standard asymptotic inference for MLE based on the final fit and ignoring the adaptive choice of the degrees of the SNP polynomials yield acceptable performance [9,11]. Accordingly, we base the calculation of standard errors and confidence intervals on the observed Fisher information matrix I(̂) with respect tô≡̂( ) with being the SNP polynomial degrees of the final fit. Specifically, to obtain pointwise confidence for CIF j (t), we propose to first determine a Wald-type confidence interval for the complementary log-log (cloglog) transform of this quantity using the delta-rule and then to transform this confidence interval back to the original scale. Operating on the cloglog-scale is expected to improve the validity of the normal approximation and avoids having out of range confidence intervals. Of note, we also considered the usage of sandwich type robust variance estimator (e.g., equations (4.5.2) and (4.5.4) in [17]) for the aforementioned calculations, but they did not lead to improved coverage in our simulation studies.

Between-group comparisons of cumulative incidence functions
Doehler and Davidian suggested methods for comparing survival functions between two independent groups based on SNP estimates [10]. A similar approach is possible for between-group comparisons of CIFs. Without loss of generality, assume that the CIF corresponding to the first event type is of interest. The null hypothesis and alternative hypothesis are where superscripts denote the groups. A suitable test statistic is the integrated weighted difference (IWD): where W(⋅) is a positive weight function and is a suitably chosen time horizon. In principle, the weight function W(⋅) can be manipulated to make the test more sensitive to a specific alternative hypothesis, and for non-parametric CIF estimates, W(⋅) also serves as a stabilization tool to down-weight time points where the risk set is small and hence the test statistic unstable [18]. Instability is less of an issue for parametric or SNP estimates. Thus, if we are not interested in any specific alternative hypothesis, we can simply set W(⋅) = 1. Two possibilities for deriving the null distribution are available. First, one can estimate the variance of the test statistic based on ad hoc asymptotic inference and the delta-rule and use this as the basis of a Wald-type test. Alternatively, the null distribution and the resulting p-value can be evaluated exactly without relying on a normal approximation by implementing a permutation test (or a Monte-Carlo approximation to it). Details regarding the Wald-type test as well as the permutation test are given in Section F of the Supporting Information.

Simulation study
Simulation study for two event types (J = 2) -Setting: Several competing risks scenarios with right or interval censoring were investigated to compare the proposed SNP-CIF estimation method to alternative parametric and nonparametric approaches in a series of simulations. All scenarios considered two competing events only, that is, J = 2, and uncensored data were simulated either according to the mixture factorization (2) or a latent failure time model. For the scenarios based on the mixture factorization, we either chose a Weibull distribution for both conditional time to event distributions or a log-normal distribution for the first event type and a SNP distribution with K = 1 and a standard normal base density for the second event type, respectively. For simulation from a SNP distribution, we used rejection sampling as suggested in [19] and detailed in Appendix C of the Supporting Information. The corresponding marginal distribution of the first event type was chosen as 25% or 50%, respectively. The last scenario was based on two independent latent failure times T 1 and T 2 leading to T = min ( . We chose a mixture log-normal distribution for T 1 and a Weibull distribution for T 2 . Exact parametric choices for all scenarios are summarized in Table I. Table I. Specification of the distribution of uncensored competing risks data for the scenarios investigated in the simulation study.
Mixture representation based scenario Scenario ) 50 Latent failure times-based scenario Scenario Note: T | D = j is the time to event distribution for event type j conditional on the occurrence of that event type. P 1 is the marginal probability of event type 1. T j is the independent latent failure time distribution associated with event type j used in the latent failure times-based scenario. W( −1 , e ) refers to a Weibull distribution with shape −1 and scale e and LN ( , 2 ) to a log-normal distribution with mean and variance 2 for the log-transformed data. SN( , , ) is a random variable T satisfying log T = + Z, where Z has a smooth semi-nonparametric distribution with a standard normal base density and polynomial degree K = 1 with a corresponding spherical coordinate .
For scenarios with independent right-censoring, a censoring time C was simulated and the observed right-censored data defined as ( , where Δ = 1 if T ≤ C and Δ = 0 otherwise. The censoring distribution C was simulated as C = min{t m , C E } where the maximum followup duration t m was chosen as the 50% or 90% quantile of the total survival time T and C E was simulated according to an exponential distribution with rate chosen appropriately to ensure the desired overall right censoring probability of 65% or 35%, respectively. The parameters t m and were determined via simulation using a large competing risks dataset of size n = 10 6 from the respective scenario. In the presentation of the simulation results, the two different right-censoring schemes are denoted as '65% RC' and '35% RC', respectively. For scenarios with independent interval censoring, we first determined the maximum follow-up time t m as the 50% or 90% quantile of the total survival time T as before. Interval censoring was then simulated as follows: If t m corresponded to the 90% quantile (respectively median) of T, we split the interval [0, t m ] into 10 (respectively 7) equally spaced sub-intervals. For each observation, the corresponding observation process was defined as occurring at the sub-interval cut-points plus some observation-specific normally distributed 'noise' with mean 0 and standard deviation equal to 1/5 of the sub-interval length for time points other than 0 and t m . If a subject had an event before time t m , then their data was considered interval-censored between the two adjacent time points of the observation process; otherwise, they were considered right-censored at t m . In the presentation of the simulation results, the two different interval censoring schemes are denoted as 'IC-1' (t m =90% quantile, 10 sub-intervals) and 'IC-2' (t m =50% quantile, 7 sub-intervals), respectively.
Finally, the sample size was varied between 100 and 500 leading to 12 simulation settings with right censoring and 12 with interval censoring. Reported simulation results are based on 1000 simulated datasets per setting and maximum allowed polynomial degrees K max = 3.
For the adaptive choice of the polynomial degrees and base density, the following information criteria were investigated: AIC, BICn, BICd, HQCn, and HQCd. As simulation results based on the sample size n and the number of events d were very similar, only results based on n are reported, and the best fits according to the different information criteria are denoted as SB-AIC, SB-BICn, and SB-HQCn. Competing methods included parametric and non-parametric estimators. Parametric estimators were based on the mixture factorization assuming log-normal or Weibull conditional time to event distributions that are available as a special case of our algorithm with polynomial degrees restricted to K = 0. For rightcensored data, the standard nonparametric Aalen-Johansen estimator of the CIF was used and estimates for interval-censored data were based on a nonparametric maximum likelihood estimator for bivariate distribution (X, Y) described in [20] and implemented in function computeMLE of the R package MLEcens [21]. As the nonparametric maximum likelihood estimator can only assign mass to a finite number of disjoint intervals, we chose a unique representation by distributing the assigned mass uniformly across the respective intervals.
Simulation study for two event types (J = 2) -Results: Detailed results for interval-censored settings are provided here. Right-censored settings are only summarized but detailed results are provided in Appendix D of the Supporting Information. Table II summarizes how frequently our method selected both the correct base density and the correct polynomial degrees of the corresponding conditional time to event distributions for mixture representation-based scenarios. For the lower sample size of n = 100, the average proportion of correctly chosen base densities and polynomial degrees was quite low, ranging from 32% for AIC to 58% for BICn. For n = 500, these proportions increased sharply and ranged from 60% for AIC to 91% for BICn with an intermediate value of 80% for HQCn. Slightly higher proportions and a similar pattern were observed for right-censored settings; for n = 500 and HQCn, the proportion of correctly chosen models was 85% ( Table 1 of Appendix D of the Supporting Information). The last two rows in Table II show the frequency with which fits to data from latent failure time models (which do not follow a SNP mixture model) used the maximum allowed polynomial degree of K = 3. In general, AIC was more likely to exhaust the maximum complexity, although for right-censored data this was not true for one right-censored setting with n = 500, which can be explained by the fact that AIC frequently chose a different base density compared with the other criteria ( Table 1 of Appendix D of the Supporting Information).
The precision of CIF estimates was evaluated based on the integrated mean squared error (IMSE) from time 0 to t m , which was defined for event type j as For the other scenarios, the frequency with which the maximal allowed polynomial degree was chosen (i.e., K 1 = 3 or K 2 = 3) is reported (rows 5 and 6). All results are for scenarios with interval censoring. All frequencies are based on 1000 simulated data sets per scenario. IC-1 refers to interval censoring with 10 sub-intervals and right-censoring at the 90% quantile. IC-2 refers to interval censoring with seven sub-intervals and right-censoring at the 50% quantile.
where N s is the total number of simulated datasets,ĈIF ji refers to the estimated CIF for event type j from the ith dataset of the considered simulation setting and CIF j is the respective true CIF. Table III shows the relative IMSE of parametric and nonparametric methods compared with the SB-HQCn model and absolute IMSE values for SNP-based models for scenarios with interval censoring. If a parametric Weibull or log-normal model was the true model for the respective CIF, SNP models based on HQCn were slightly outperformed by the true parametric model. If this was not the case, the SNP-based model usually outperformed the parametric models, sometimes dramatically so. Moreover, our method consistently outperformed the non-parametric estimator. Conclusions were similar for right-censored data ( Table 2 in Appendix D of the Supporting Information), although in this setting the non-parametric estimator outperformed our estimator for one scenario. Regarding the investigated information criteria, BICn often performed best, although for some complex right-censored scenarios, AICn performed considerably better. HQCn, the recommended information criterion in survival analysis [9,10], provided intermediate performance and CIF estimates that resembled the truth quite closely ( Figure 1). Furthermore, to see if the IMSE might be driven by single time regions, we also looked at the mean squared error at the specific time points t m and t m ∕2. However, this leads to qualitatively identical conclusions to the results for the IMSE reported previously (data not shown). Table IV and Table 2 in Appendix D of the Supporting Information give point-wise Monte Carlo coverage probabilities of nominal Wald-type 95% confidence interval for all CIF estimators at two selected time points, 0.5t m and t m , for interval-censored and right-censored scenarios, respectively. All confidence intervals were based on the complementary log-log transformation, and for SNP-based estimates, they used ad hoc asymptotic inference as described in Section 4.3. Our model based on HQCn showed some undercoverage, but observed coverage was larger than 90% for 45/48 reported values for interval-censored settings and 43/48 reported values for right-censored settings with the lowest observed coverage being 82%. Models based on BICn and AICn showed comparable although overall slightly worse coverage.
Parametric models usually performed well if the parametric model reflected the truth but demonstrated dramatic undercoverage in some other settings. Coverage for the non-parametric estimator in the presence of interval censoring was not evaluated as confidence intervals are not provided by the used software and the estimator has non-standard asymptotic properties [21]. For right-censored settings, the nonparametric estimator performed well with observed coverage of at least 93.5% across all scenarios except for two very low observed values for the Weibull mixture scenario with n = 100. The latter is potentially due to the very low observed frequency of events of type 2 before time 0.5t m in these settings. Note: LN/SB-HQCn, WB/SB-HQCn, and NP/SB-HQCn are the ratios (in bold) of the integrated mean squared error (IMSE) of the parametric lognormal, Weibull, and the nonparametric models, respectively, versus the SB-HQCn model with corresponding bootstrap standard errors. For each scenario, the last three rows give IMSE values of smooth semi-nonparametricbased models for all information criteria. IC-1 (IC-2) refers to interval censoring with 10 (7) sub-intervals and right-censoring at the 90% (50%) quantile. Bootstrap-based confidence interval for SNP-estimators might lead to improved coverage as demonstrated in the survival setting in [10], but we did not explore this in our simulations as the estimation algorithm is computer intensive. Specifically in our simulation, median (mean) computing times for CIF estimation based on HQCn with a sample size of n = 500 were 91 (88) and 127 (212) seconds across the six interval-censored and right-censored settings, respectively (on a Windows PC with an Intel Core i7-3770 CPU and 10GB Ram).
Additional simulation study for three event types (J = 3): An additional simulation study was conducted for a scenario with three competing risks based on SNP distributions having standard normal base densities and polynomial degrees of 0, 2, and 3, respectively. The simulation scenario and results are detailed in Appendix E of the Supporting Information. A lower frequency of choosing the correct base densities and polynomial degrees (reported in Table 5 of the Supporting Information) than for the simpler scenarios with J = 2 and true K max = 1 reported in Table II was noted. This is not unexpected and a similar effect could already be seen when contrasting results from the competing risks settings with J = 2, shown in Table II, against results from the less complex survival settings investigated by Doehler and Davidian [10]. However, performance in terms of the integrated mean-squared error ( Table 6 of the Supporting Information) remained superior to the parametric and non-parametric alternatives and true coverage of nominal 95% confidence intervals (Table 7 of the Supporting Information) was also generally acceptable acknowledging some mild undercoverage. Note: LN and WB refer to parametric log-normal and Weibull models. SB-AIC, SB-BICn, and SB-HQCn refer to smooth semi-nonparametricbased models using different information criteria. Estimated Monte Carlo standard error of observed coverage ≈ 0.69%. IC-1 refers to interval censoring with 10 sub-intervals and right-censoring at the 90% quantile. IC-2 refers to interval censoring with seven sub-intervals and right-censoring at the 50% quantile.
Type I error for the between-group comparisons of cumulative incidence functions: We used our main simulation study (with J = 2) to investigate whether the asymptotic IWD-based test proposed in Section 5 protects the type I error at the nominal 5% significance level. Specifically, we used the 1000 simulation runs from each scenario to form 500 pairs of two-group comparisons under the null hypothesis that both groups are equal. According to normal plots (provided in Figure 3 of Section G of the Supporting Information), the resulting 500 IWD statistics using the simple weight function W(⋅) = 1 had an approximate normal distribution for all scenarios and both event types. However, estimates of its standard error calculated based on ad hoc asymptotic inference and the delta-rule tended to be too large. The consequence of this was that the simulated type-I errors were conservative, that is, below 5% for all scenarios and event types, but they were generally too small with 8∕20(40%) and 8∕20(40%) of values < 1% for simulations with n = 100 and n = 500 per group, respectively. Quantile plots of the p-values against the target uniform distribution are provided in Figure 2 of Section G of the Supporting Information. Based on these results, we suggest using the proposed (exact) permutation test instead acknowledging the higher computational burden.

Analysis of a trial in cryptococcal meningitis
We used data from a randomized controlled clinical trial comparing antifungal therapies in HIV-infected patients with cryptococcal meningitis [22], and for simplicity, we restricted attention to two of the three randomized interventions: Amphotericin B monotherapy (a frequent treatment in Asia) and combination therapy with amphotericin B and flucytosine (an expensive treatment recommended by treatment guidelines). The competing risks endpoint of interest here was the time from randomization to clearance of quantitative fungal counts in cerebrospinal fluid (beneficial event of interest) or death without prior fungal clearance (harmful competing event) during a follow-up period of 30 days. Quantitative fungal counts were only measured weekly, and hence, the time of fungal clearance was only known to have occurred between the last positive and the first zero measurement, that is, it is interval-censored. The date of death was known exactly, and the time to death was treated as interval-censored during the corresponding 24-h interval. Patients without an event were right-censored at the time of their last positive fungal count measurement. Of note, because of interval-censoring, the data could in principle have missed unobserved fungal clearances for patients who died without observed fungal clearance. However, based on the available quantitative fungal count measurements for these patients, this seemed highly unlikely for almost all cases. The dataset contained 175 patients: 103 reached fungal clearance, 43 died without prior clearance, and 29 were right-censored.
Cumulative incidence function estimates are displayed in Figure 2 and show good agreement between the SNP-based estimator using HQCn and the nonparametric estimator. The figure suggest both a faster Figure 2. Estimated cumulative incidence function for the time to fungal clearance (increasing lines) and one minus cumulative incidence function for prior death (decreasing lines) by treatment arm for the smooth semi-nonparametric-based estimator using HQCn and the nonparametric estimator. Black lines correspond to amphotericin B monotherapy, gray lines to amphotericin B plus flucytosine. Solid lines refer to smooth seminonparametric estimates, dashed lines to nonparametric estimates. As the nonparametric maximum likelihood estimator for interval-censored outcomes can only assign mass to a finite number of disjoint intervals, we chose a unique representation by distributing the assigned mass uniformly across the respective intervals.
time to clearance and a lower risk of prior death for the combination therapy. For a formal comparison, we calculated p-values for the IWD-based test statistic introduced in Section 5 using a trivial weight function and ad hoc asymptotic inference or, alternatively, a permutation test (approximated using 1000 Monte Carlo samples). P-values were p < 0.001 (asymptotic and permutation test) for the comparison of CIFs for fungal clearance and p = 0.020 (asymptotic) or p = 0.026 (permutation test) for prior death, respectively.

Discussion
In this paper, we have developed a new SNP CIF estimation technique for competing risks data. The underlying statistical model is specified via a mixture factorization of the joint distribution of the event type and time. The time to event distributions conditional on the event type are modeled using SNP densities. One key strength of the approach is that it can handle arbitrary censoring and truncation while producing smooth estimates with only mild parametric assumptions. We further suggested a stepwise forward procedure for model estimation. This procedure returns parametric mixture log-normal or Weibull models as starting points, and gradually increases model complexity until the optimal model based on some information criterion is found.
In an extensive simulation study, we demonstrated that the proposed approach provides smooth estimates of the CIF over the observed follow-up period that are frequently more accurate than parametric and nonparametric alternatives even under heavy censoring. The simulations also support the use of 'ad hoc' asymptotic inference to derive confidence intervals although some undercoverage was observed. This is in accordance with other statistical areas such as density estimation [11], time-series analysis [19], random effects modeling [23], and survival analysis [9,10] where the SNP approach has been successfully applied. Of note, the CIF is identifiable over the observed follow-up period, but the marginal event probabilities, P(D = j), are not identifiable in a non-parametric sense. However, the proposed model is weakly identifiable for each SNP polynomial degree because of the additional parametric assumptions. As shown in our simulations, this weak identifiability does not appear to corrupt estimation of the CIF over the observed time range. In principle, our model also allows for the extrapolation of the CIF beyond the observed follow-up period and the estimation of marginal event probabilities. However, we caution against this usage of our method except if there is minimal censoring and these probabilities are truly identifiable from the data. Not unexpectedly, the simulations also revealed that it is more difficult to identify the true data-generating SNP models when the number of competing risks increased from two to three. However, in the setting of three competing risks, our proposed method remained competitive compared with parametric and non-parametric alternatives in terms of the accuracy of the estimated CIFs.
An extension of the proposed model to the regression setting is straightforward by including covariates into the accelerated failure time sub-model (4) and the multinomial logistic sub-model (3). This leads to a flexible regression model for competing risks data that can handle arbitrary censoring and truncation, which has been investigated in detail in the PhD thesis of the first author [24]. Of note, in common with other proposed competing risks models based on the mixture factorization [12,14], the model has interpretational and identifiability issues, especially if there is insufficient follow-up relative to the timing of the events.
Despite the strong performance of models based on SNP densities in simulation studies across statistical areas, a formal theory that would allow to derive asymptotically valid confidence intervals and hypothesis tests is still largely lacking. This provides a potential area for future research that could benefit from recent advances in asymptotic theories for the broader class of sieve MLE estimators [25,26]. Moreover, while the proposed algorithm worked well, it is computationally intensive. The development of faster and more robust algorithms would allow for the exploration of bootstrap methods to improve the coverage of confidence intervals and of models with larger polynomial degrees K, which could further improve estimation if the true CIFs are very irregular. Finally, this paper focused on the CIF as the target of inference. However, competing risks models based on modeling the cause-specific hazards are also popular [27]. One approach to applying SNP densities to cause-specific hazards models would be to extend the SNP Cox proportional hazards model developed by [9].

Software and datasets
An implementation of the proposed estimating algorithm by the first author in the statistical software R [16] has been uploaded to GitHub: https://github.com/nguyenducanhvn101087/R_SNP_Competing_ Risks We thank our clinical colleagues [22] for allowing us to make the cryptococcal meningitis dataset discussed in Section 7 freely available. This dataset and corresponding analysis code has also been uploaded to the same GitHub location.