Penalized likelihood estimation of a mixture cure Cox model with partly interval censoring—An application to thin melanoma

Time‐to‐event data in medical studies may involve some patients who are cured and will never experience the event of interest. In practice, those cured patients are right censored. However, when data contain a cured fraction, standard survival methods such as Cox proportional hazards models can produce biased results and therefore misleading interpretations. In addition, for some outcomes, the exact time of an event is not known; instead an interval of time in which the event occurred is recorded. This article proposes a new computational approach that can deal with both the cured fraction issues and the interval censoring challenge. To do so, we extend the traditional mixture cure Cox model to accommodate data with partly interval censoring for the observed event times. The traditional method for estimation of the model parameters is based on the expectation‐maximization (EM) algorithm, where the log‐likelihood is maximized through an indirect complete data log‐likelihood function. We propose in this article an alternative algorithm that directly optimizes the log‐likelihood function. Extensive Monte Carlo simulations are conducted to demonstrate the performance of the new method over the EM algorithm. The main advantage of the new algorithm is the generation of asymptotic variance matrices for all the estimated parameters. The new method is applied to a thin melanoma dataset to predict melanoma recurrence. Various inferences, including survival and hazard function plots with point‐wise confidence intervals, are presented. An R package is now available at Github and will be uploaded to R CRAN.

and non-cured patients is of high importance to thin melanoma patients for their management decisions and appropriate follow-up schedules.
Multiple studies have analyzed death due to melanoma to evaluate the prognosis or predictive value of patient and tumor characteristics at the time of diagnosis. However, very few studies have analyzed time to melanoma recurrence in this context. To our view, one reason for why time to recurrence is underused is because the exact time of recurrence is challenging to determine (interval-censored). To alleviate this issue, the time at which recurrence is diagnosed is usually analyzed instead, but the interpretation is still made in terms of the time to recurrence. Alternative methods that model the recurrence time need to be developed and made accessible to the community for more accurate prognospis and predictive analyses.
In standard survival analysis, patients who do not experience the event of interest (eg, death or recurrence) during the follow-up time are simply right censored. However, in some circumstances, some individuals might be cured and therefore would never experience the event no matter the length of follow-up. Those patients constitute a cured fraction and should be considered as such. In general, a cured fraction is never observed directly; its information is contained in the right-censored times of a dataset. A common feature for a dataset having a cured fraction is the presence of a group of long-term survivors. Ignoring the cured fraction, and thus treating all the observed right censored times as true right censoring of event times, may lead to biased inferences.
The most popular approach to deal with a cured fraction is called the mixture cure model, and was first proposed by Farewell. 3 This model considers the population of interest to be a mixture of two sub-populations, one of which is susceptible to the event of interest, and another which is cured. Models for these two components are called latency and incidence models, respectively. The incidence is most commonly modeled using a logistic regression as in Reference 3. For the latency there has been consideration of parametric survival models, as well as much focus on the use of semi-parametric survival models such as the Cox model. [4][5][6][7][8] Such a model is appropriate in cases where there is evidence of a possible cured fraction in the data, based on biological feasibility and the presence of a large number of long-term survivors in the sample even after a lengthy follow-up time (see discussion in, for instance, Reference 9). When fitted to a sample without a cured fraction, the mixture cure model reduces to a standard Cox proportional hazards model.
In this article, we propose to extend the mixture cure Cox model to accommodate data with a more general partly interval censoring scheme that consists of exact event times and left, interval and right censoring times. We will develop a maximum penalized likelihood (MPL) method to estimate the model parameters, including the logistic regression parameters, the Cox regression coefficients and nonparametric baseline hazard. A penalty function is used for a smooth estimate of the baseline hazard.
Parameter estimation for a mixture cure Cox model is difficult to carry out using Cox's partial likelihood. A variety of mixture cure Cox model estimation methods for event times with right censoring have been proposed, including References 10-12. In more recent work, there has also been consideration of interval censoring. [13][14][15][16][17][18] Previous research on this topic has only rarely considered a smooth estimate of the baseline hazard function. The method in Reference 18 allows for a piecewise linear approximation to the cumulative baseline hazard of a mixture cure model for interval-censored data. Alternatively, Corbière et al 19 consider a smooth estimate of the baseline hazard function directly via the use of a penalized likelihood, but was limited to right censored samples only.
This article sits within a wider body of research concerning penalized likelihood estimation of proportional hazards models. [20][21][22] Generally, this approach introduces a roughness penalty term to the likelihood to produce a smooth baseline hazard function estimate that is approximated using a set of non-negative basis functions. Ma et al 23 demonstrate that a MPL approach can easily incorporate partly interval censored data. To date, there has been extremely limited consideration of methods for fitting a proportional hazards mixture cure model to partly interval censored data that provides a smooth estimate of the baseline hazard function. This article aims to address this gap by drawing on the MPL method for fitting a proportional hazards model to partly interval censored data.
The rest of this article is organized as follows. In Section 2, the mixture cure Cox model is introduced. It also explains an approximation of the baseline hazard function using M-spline basis functions, and presents the penalized likelihood function. In Section 3, we lay out the simultaneous estimation of the regression parameters and baseline hazard function using an alternating algorithm for the constrained MPL estimation. An automatic smoothing parameter estimation procedure is also presented in this section. Section 4 details the asymptotic properties of the estimates. These asymptotic results enable inferences on both regression parameters and survival quantities without computationally intensive methods such as bootstrapping. In Section 5, we report and discuss the results of two simulation studies, and in Section 6, an application to the thin melanoma study is illustrated. Finally, concluding remarks are included in Section 7.

MIXTURE CURE COX MODELS AND PENALIZED LIKELIHOOD
Let Y i be a random variable denoting the time to the event of interest of individual i (i = 1, … , n), where it is possible that some individuals are not susceptible to experiencing the event. Under a scenario of partly interval censored survival data, we may observe event times and also right, left and interval censoring times. In this article, the recorded survival time for each individual i will be denoted as a random vector , and we may have T R i = +∞ for a right-censored time and T L i = 0 for a left-censored time. If the event time Y i is observed directly, then we have T L i = T R i . Let U i be an unobserved random variable where U i = 1 indicates that individual i is susceptible to the event or is non-cured (ie, patient will experience the event of interest given a sufficient follow-up), and U i = 0 indicates otherwise (ie, patient is cured after treatment and will never experience the event of interest regardless of the length of his/her follow-up). Using a Cox proportional hazards regression model, we can denote the hazard function of Y i in the non-cured fraction as where h 0 (t) is an unknown baseline hazard function, x i is a vector for the values of of covariates, and is a q-vector of proportional hazards regression parameters. We may denote h(t|U i = 1, x i ) simply as h(t|x i ) when there is no confusion. Additionally, we can model the probability of having U i = 1 using a logistic regression model, such that where (z i ) = P(U i = 1|z i ) is the probability of being non-cured, z i is a set of covariates that may be identical to, or be completely different from x i . It is also possible that z i share some components with x i . In (2), is a p-vector for the logistic regression parameters. Now, we can specify the survival function for the whole population, consisting of both the cured and non-cured fractions, as Clearly, when (z i ) = 1 (ie, all members of the population are susceptible to the event and there is no cured fraction) the above survival function reduces to that of a standard Cox model. Also, we may denote the conditional survival function S(t|U i = 1, x i ) simply as S(t|x i ) when its meaning is clear in the context. For our derivations, it is convenient to denote the information of event, left, right, and interval censoring times using a set of indicator values for each i. Let i , L i , R i , and I i be, respectively, the indicators for event time, left, right, and interval censoring time for i. Thus, for each i, the set of observed information available is . Note that for the survival times corresponding to event, left or interval censoring we have U i = 1. Conversely, for the right-censored times their U i values are unknown.
Similar to the method of sieves (eg, Reference 24), the nonparametric baseline hazard function h 0 (t) can be approximated using some m non-negative basis functions, where m is a positive integer, such that where each coefficient u ≥ 0 and each u (t) is a non-negative basis function. Let the vector of u be denoted by . Here, the baseline hazard function will be approximated using cubic M-splines, following from previous work on MPL estimation of a proportional hazards model such as References 20 and 23. One convenience of using M-splines to approximate h 0 (t) is that ensuring the estimate of the baseline hazard function is non-negative requires only that the spline coefficient vector be non-negative. Another benefit is that it makes the computation of the cumulative baseline hazard function H 0 (t) straightforward: where Ψ u (t) are I-splines. 25 M-splines and their corresponding I-splines are readily available in, for example, the R splines2 package.
The proposed method in this article is to estimate the three parameter vectors, , , and simultaneously by maximizing a penalized likelihood function. Let = ( , , ).
Under independent interval censoring (eg, Reference 26), the log-likelihood is given by Direct maximization of l( ) for estimation of is not ideal. This is because: (1) h 0 (t) is usually a smooth function and this information should be incorporated into the estimation of ; and (2) it is possible that knots selected to approximate h 0 (t) may not be optimal, where particularly some knots may be unnecessary so that their corresponding u 's should be zero. We use a penalized log-likelihood to obtain a smooth estimate for h 0 (t) and to force the unnecessary u 's close to zero. We adopt a roughness quadratic penalty function and the penalized likelihood is now given by where J( ) is a roughness penalty function and ≥ 0 is a smoothing parameter. The roughness penalty is given by Given h 0 (t) is now given by (4), we can conveniently express this roughness penalty as J( We comment that in general the above roughness penalty is effective for imposing smoothness on h 0 (t) but less ideal for constraining unnecessary u to zero. A composite penalty employing both quadratic and l 1 -norm (equivalent to lasso) penalty could be more efficient. Further investigations are necessary to explore this option.

A constrained optimization algorithm
The MPL estimate of , denoted bŷ, is obtained bŷ= Given the constraint that ≥ 0, we have the following Karush-Kuhn-Tucker (KKT) conditions for a constrained optimal solution: These conditions are solved iteratively using an algorithm similar to the Newton-MI algorithm of Reference 23. This algorithm requires the score vector and the Hessian matrix for and , but for it only demands its score vector. Details of score vectors and Hessian matrices can be found in the Supplementary Materials. Before describing this algorithm we first introduce some notations. Let (k) , (k) , and (k) be, respectively, the estimates of , , and at iteration k. Also, for any function a(x), we let a(x) + and a(x) − be respective the positive and negative components of a(x), so that a(x) + − a(x) − = a(x). Iteration k + 1 of our algorithm is obtained in a three-step process as follows. First, obtain (k+1) using a modified Newton algorithm: where 1 ∈ (0, 1] is the line search step size used to ensure that Φ( (k+1) , (k) , (k) ) ≥ Φ( (k) , (k) , (k) ). The value of the line search step size can be determined by using, for instance, Armijo's rule. 27 Second, we compute (k+1) using again a modified Newton algorithm: where 2 is defined similarly to 1 . Finally, we get the update (k+1) using the multiplicative-iterative algorithm: where 3 is defined similarly to 1 and 2 and D (k) is a diagonal m × m matrix with elements (k) u ∕d (k) u for u = 1, … , m, and where Referring to the Supplementary Materials for the score vector of , we can see that for the problem considered in this article, d u becomes Note that u ≥ 0 is a small constant included simply to avoid the numerical issue of a zero denominator in the calculation of D (k) and this value does not have any impact on the final solution for .

Estimation of the smoothing parameter
A marginal likelihood method for the automatic selection of the smoothing parameter, previously outlined in, for example, References 23 and 28, can be implemented to the model of this article. In this method, the penalty function J( ) = T R is related to a normal prior distribution for the vector with ∼ N(0 m×1 , 2 R −1 ), where 2 = 1∕2 . We can then obtain the log-posterior: The marginal likelihood for 2 may be difficult to obtain directly, and as such we can approximate it using Laplace's method. Applying the Laplace approximation and substituting in the MPL estimates of , , and , we can obtain the approximated log-marginal likelihood for 2 : whereĜ is the negative Hessian matrix from l( , , ) evaluated at the MPL estimateŝ,̂, and̂, and An approximate maximum marginal likelihood solution for 2 is: where = tr{(Ĝ + Q(̂2)) −1 Q(̂2)}, which can be considered as equivalent to the model degrees of freedom. Given that the estimates of , , and depend on 2 , this approximate solution for 2 allows for the development of an iterative procedure with two steps. First, with a current 2 , the corresponding MPL estimates for , , and are obtained. Then, 2 is updated using the current̂2, and the just obtained̂,̂, and̂on the right-hand side of (13). These two steps are repeated until is stabilized, such as the difference between two consecutive values is less than 1.

ASYMPTOTIC PROPERTIES AND INFERENCE
Development of the asymptotic properties of the proposed model allows for large sample inference to be conducted without reliance on bootstrapping or other computationally intensive methods. Following from Reference 23, it is possible to demonstrate asymptotic consistency for the MPL estimates of both sets of regression parameters, and , and the baseline hazard function h 0 (t). We adopt 0 , 0 , and h 00 (t) to denote the true model parameters (ie, the parameters that gave rise to the observed data). Theorem 1 states this asymptotic property; the proofs can be found in the Supplementary Materials.

Theorem 1. Assume that conditions B1 to B4 (see the Supplementary Materials) hold. Assume that h 0 (t) is bounded and has some number r
Additionally, it is desirable to develop asymptotic normality results for all three parameters, , , and as this allows for inference to be made not only on regression parameters but also on other quantities, such as survival probabilities. In order to develop these results, however, it is necessary to restrict m to be a finite number, similar to References 23 and 29. Note that this fixed m is not predetermined as it depends on the given sample size n. Usually, a practical guide for m is m = n 1∕3 0 . where n 0 denotes the non-right censored samples size. Another issue we face when developing asymptotic normality results is that we must take into account the possibility of encountering active constraints in the estimation of ≥ 0. This is particularly likely to occur when the number of knots is larger than strictly necessary, or some knots are placed at non-important locations. The penalty function will push the corresponding u to zero. 23 Ignoring active constraints often leads to undesirable results, such as negative variances.
Recall that we have defined the parameter vector = ( , , ), which has a length of p + q + m, and that we can express the penalized likelihood function in terms of such that Let̂be the MPL estimate of . Let the true value of be represented by 0 . Without loss of generality, we assume that the estimates of first r elements of are zero, and so that they are actively constrained. Define where 0 is a matrix of zeros, I is an identity matrix. Clearly, U T U = I (m−r+p+q)×(m−r+p+q) is satisfied. Theorem 2 states the asymptotic normality results, with relevant proofs in the Supplementary Materials.

Theorem 2.
Let n = ∕n. Assume that n = o(n 1∕2 ) and that we have the first r active constraints in the MPL estimate of . Define matrix U as above. Let where matrix U is defined in (14). Under these conditions, when n → ∞, In order to implement the result of Theorem 2, it is necessary to define a method for identifying active constraints when they arise in the MPL estimation of . The method used here follows that proposed by Ma et al. 23 Active constraints can be identified by inspecting both the value of̂u and the corresponding gradient for each u. After the Newton-MI algorithm has reached convergence, somêu may be exactly zero with negative gradients, and thus are clearly subject to an active constraint. Furthermore, there may be somêu that are very close to, but not exactly, zero. For thesêu, a corresponding negative gradient value is indicative that they are also subject to an active constraint. In practice, active constraints are defined where, for a given u,̂u < 10 −3 and the corresponding gradient is less than − , where is a positive threshold value such as 10 −2 . After the indices associated with active constraints are identified, obtaining the matrixF( 0 ) −1 is a very straightforward computation. The matrix U T F( )U is obtained by removing the rows and columns of F( ) associated with the active constraints. The result is then inverted, and then padded with zeros in the deleted rows and columns to obtainF( 0 ) −1 .
To make use of these asymptotic results for inference on finite samples, it is necessary to approximate the distribution for̂when n is large. Doing so also incorporates nonzero values for the smoothing parameter into the inference on the parameter estimates. The necessary results are presented below in Corollary 1.

Corollary 1. Assume that the smoothing parameter ≪ n. Define
Then, when n is large, the distribution for the MPL estimatê− 0 can be approximated by a multivariate normal distribution having mean zero and covariance matrixv These results allow for inferences to be made not only on both sets of regression parameters but also on quantities associated with the baseline hazard function; see the results report in Section 5.

SIMULATION STUDIES
The performance of the proposed method is illustrated via two Monte Carlo experiments. The first simulation study evaluates the performance of our proposed MPL method for partly interval censored data with a comparison to the generalized odds rate mixture cure model proposed by Zhou et al, 17 implemented in the GORCure R package, which includes the mixture cure Cox model as a special case. This method uses an expectation-maximization (EM) algorithm with a gamma-Poisson data augmentation for the regression parameters, and a spline approximation to a transformed cumulative hazard function. The second simulation study compares the EM based method of Reference 12 with our approach as the former is already implemented in the R package smcure. This method uses an EM algorithm for both Cox and logistic regression parameter estimation and a Breslow estimator for the baseline survival function. However, smcure can only be implemented to right-censored survival datasets, so this simulation considers only right censoring.

Simulation study design and data generation
In order to generate event or censoring time(s) for individual i, we first computed the non-cured fraction probability (z i ) given by the following logistic model: where z i1 and z i2 are logistic regression covariates. Note that the value of controls the size of the cured fraction in the sample. Then, the cured indicator value u i was obtained according to where U i ∼ unif(0, 1). Note that u i = 1 means individual i was in the non-cured fraction (those who will experience the event of interest), and otherwise was in the cured fraction.
For u i = 1 in either study, we first simulated a follow-up time C i1 ∼ unif(0.5, ), where could be adjusted to control the extent of right censoring. Afterward, we simulated an event time Y i from a specified distribution. In Study 1, event times were drawn from one of three possible distributions (Weibull, exponential, and log-logistic). In the smaller Study 2, only a Weibull distribution was considered. For u i = 0, we simulated a follow-up time C i2 ∼ unif(0.5, ), where has been defined before. Afterwards, an observed survival time, denoted by T i , was simulated using the procedure described below.
For u i = 0, we took T i = C i2 and recorded it as a right censoring time. For u i = 1, T i was generated depending on Study 1 or Study 2. In Study 2 (right censoring), T i was simply generated using T i = min{Y i , C i1 }. In Study 1 (partly interval censoring), T i was generated based on a sequence of simulated "examination times" as described below. We first generated a number of "scheduled examinations" from a Poisson(8) (ie, mean 8) distribution, and denote this number by d i . Next, we simulated d i uniform random numbers from unif(0.3, 0.7), and denoted these numbers by w i1 , w i2 , … , w id i . Then, a series of "scheduled examination" times were then given by: However, the follow-up time C i1 meant not all of these examination times might be used. In fact, when C i1 was less than one of the examination times, the examination process would be terminated at C i1 and the corresponding final examination times formed time intervals and we then checked if the simulated Y i fell into one of these time intervals. If yes, then Y i was interval censored and interval censoring times were given by the two end-points of this interval. If no, Y i could be on the left or right of these intervals: if on the left then the minimum examination time was a left censoring time; if on the right then the maximum examination time was a right censoring time. Moreover, for an interval censoring, if the length of the interval was less than some then we recorded corresponding Y i as an event time rather than an interval-censoring time. Details of all simulation scenarios considered can be found in Table 1 for Study 1 (partly interval censoring) and Table 2 for Study 2 (right censoring). We will report in this article the simulation results associated with the Weibull distribution with sample sizes n = 200 and n = 1000 and cured fractions sizes 0.2 and 0.6. Results for other distributions, the other sample size (n = 50), and the other cured fraction size (0.05) can be found in the Supplementary Materials. TA B L E 1 Simulation study 1 (partly interval censoring) specifications, and the consequent cured and censoring proportions Event times distribution Weibull Exponential Log-logistic Covariates Covariates We adopted cubic M-splines (with some n knots) to approximate the baseline hazard function. Define a and b as the minimum and maximum observed survival times, respectively. The observed survival times included interval, left, and right censoring times. External knots for the M-splines were placed at a and b. Define c and d as the respective minimum and maximum of a set of times (called pseudo times) consisting of event times, the mid-point of any left censoring intervals, and the mid-point of any interval censoring intervals. The internal knots were placed at equal quantiles across the 5th and 95th percentiles of these pseudo times between c and d. Following Reference 23, we calculated the number of (internal) knots using a rough guideline of the cubic root of the number non-right censored individuals. Ma et al 23 remarked that the MPL method is fairly robust (due to the penalty function) to the number of knots as long as the smoothing parameter is appropriate. The smoothing parameter was selected automatically using the marginal likelihood function laid out above in Section 3.
Note that we have presented results based on estimates of the marginal (non-cured fraction only) baseline survival function rather than the marginal baseline hazard function, as quantities of the marginal baseline survival function are more readily available in the GORCure and smcure packages. Computing the marginal baseline survival function and associated asymptotic and Monte Carlo standard errors for the MPL method is straightforward. At time t, the MPL estimate of S 0 (t) can be computed by taking exp(− (t) T̂) , where (t) denotes a vector of I-spline values at t. The asymptotic variance estimate can be produced using the delta method, such that Var(Ŝ 0 (t)) = [Ψ(t) ⊤Ŝ is the first derivative ofŜ 0 (t) with respect tô. The Monte Carlo variance can be obtained simply by replacing Cov(̂) with the Monte Carlo covariance matrix for in the equation for Var(Ŝ 0 (t)). Table 3 shows the results for the estimation of and for Study 1 (partly interval censoring) for sample sizes n = 200 and n = 1000, for the MPL method and the GOR method, using a Weibull distribution for the baseline hazard. It includes, for each parameter, the absolute bias, the relative bias (in brackets beneath the absolute bias), the asymptotic standard error estimate, the Monte Carlo standard error estimate (in brackets beneath the asymptotic standard error), and the asymptotic 95% coverage probability. The MPL method appears to perform reasonably in a variety of the scenarios considered, with small biases, good agreement between the asymptotic and Monte Carlo standard errors, and reasonable coverage probabilities. In particular, when the sample size is small, the MPL estimates for the proportional hazards regression parameters ( 1 and 2 ) consistently have smaller biases than the GOR estimates. This is especially noticeable in the 1 − (z) = 0.6 scenarios, that is, when there is a larger number of cured individuals in the sample. Additionally, across all scenarios, the GOR estimate of the 0 has a large negative bias and has very low coverage probabilities, while the MPL estimate of that parameter is reasonable. This equates to an overestimation of the cured fraction size by the GOR method.     Table 4 shows the results for the estimation of the baseline survival function for Study 1 (partly interval censoring) for sample sizes n = 200 and n = 1000, for the MPL method and the GOR method, using a Weibull distribution for the baseline hazard. For the MPL method it reports the bias, asymptotic and (Monte Carlo) standard errors, and asymptotic and (Monte Carlo) 95% coverage probabilities. For the GOR method, it reports the bias, (Monte Carlo) standard errors and (Monte Carlo) 95% coverage probabilities, as asymptotic estimates for the standard error are unavailable from the GOR-Cure package. The ability to make fast and efficient inferences on survival model quantities using asymptotic standard errors can be considered as a key strength of the proposed MPL method when compared with the existing alternatives. The estimated survival function was evaluated at three time points, t 1 , t 2 , and t 3 , corresponding to the first, second, and third quartile of the observed survival times, excluding 0 and ∞. Across all scenarios, the bias for the MPL estimate of the baseline survival function was small. There is a fair agreement between the asymptotic and Monte Carlo standard error estimates for the MPL method across the scenarios. We note that in some instances, the coverage probabilities produced by the MPL method are low; specifically, this tends to occur at the later time point t 3 where event times are sparser. The baseline survival function estimates from the GOR method tend to have larger biases than the MPL estimates, particularly when E = 0. The Monte Carlo standard errors for the GOR estimates are generally larger than both the asymptotic and Monte Carlo standard error estimates from the MPL method, indicating that the MPL method is less variable.

TA B L E 3 (Continued)
We have included in the Supplementary Materials some additional simulation study results for partly interval censored data. These include results from scenarios where we have used different distributions for the baseline hazard function (namely, exponential and log-logistic), scenarios where the sample size was very small (n = 50), and scenarios where the cured fraction was very small (approximately 5% of the sample). The exponential and log-logistic baseline hazard distribution scenarios have generally produced comparable results to those discussed above. When the sample size is very small, both methods tend to produce larger biases in the regression parameter estimates. However, biases and coverage probabilities appear to be more reasonable for the MPL estimates, particularly for the proportional hazards regression parameter estimates ( 1 and 2 ) when the cured fraction in the sample is larger. When the cured fraction was very small, the MPL estimates of regression parameters and the baseline survival function were still reasonable. Table 5 exhibits the results for the estimation of and for Study 2 (right censoring) with n = 200 and n = 1000 for both the MPL and EM methods. It includes, for each parameter, the absolute bias, relative bias (in brackets below absolute bias), the average standard error (asymptotic and bootstrap for, respectively, MPL and EM), the Monte Carlo standard error (in brackets below average SE), and the 95% coverage probability. Note that the EM method in the smcure R package does not provide directly an asymptotic standard error, instead it provides a computationally intensive bootstrapping standard error. The results from Table 5 indicate the two methods appear largely equivalent under the scenario with cured fraction 1 − (z) = 0.7 and E = 0.85, with both methods producing small biases in the parameter estimates, having good agreement between the estimated (asymptotic for MPL and bootstrap for EM) and Monte Carlo standard errors, and giving reasonable coverage probabilities. When the cured fraction is 1 − (z) = 0.7 but E = 0.5, the bias for most parameter estimates is fairly large for both methods, although it is generally larger for the EM estimates compared to the MPL estimates, and the EM coverage probabilities for this scenario are particularly low. However, when the sample size is large, the MPL estimates for the proportional hazards regression parameters ( 1 and 2 ) have small biases, while the equivalent estimates for the EM method produce very large biases. However, for these MPL estimates the coverage probabilities are high, reflecting the larger asymptotic SE estimates compared to the Monte Carlo estimates. When the simulated data contained a larger cured fraction, both methods similarly yield little bias in the parameter estimates with the exception of the estimate for 0 . The EM estimate of 0 has a large negative bias and that is less severe in the MPL estimate. We observe from the results that in general the MPL standard error matches well the Monte Carlo standard error for the higher censoring scenario, but less well for the scenario with less censoring, especially for the proportional hazards parameters 1 and 2 . For a large sample size (ie, n = 1000), the asymptotic standard error produced by the MPL method are generally close to with the bootstrap standard errors from the EM method, except 0 . However, computations of MPL asymptotic standard errors are much faster than the EM bootstrap standard errors. Table 6 reports the biases, Monte Carlo standard errors and Monte Carlo coverage probabilities for the estimated baseline survival functionŜ 0 (t) from both the MPL and EM methods. The estimated survival function was evaluated at three time points, t 1 , t 2 , and t 3 , corresponding to the first, second, and third quartile of the observed event times, excluding

APPLICATION TO MELANOMA RECURRENCE DATA
Our new method was applied to the real dataset introduced in Section 1. Data from a cohort of 2968 patients diagnosed with thin melanoma (defined as patients diagnosed with a Breslow thickness ≤1 mm) between January 1992 and December 2014 were extracted from the prospectively maintained research database at the Melanoma Institute Australia (MIA). Information collected included baseline characteristics (age, sex, and body site of lesion), pathological factors (Breslow thickness, ulceration, and mitoses count), date of follow-up visit, date of diagnosis of first recurrence (either local, regional, or distant), and survival status of patient at last contact. All patients with known date of recurrence or last survival status were analyzed. All patients had given consent for use of their de-identified information for research purposes. Ethical approval was provided by the Sydney Local Health District Ethics Committee. The primary clinical outcome was time to first melanoma recurrence calculated from the date of the initial primary diagnosis. Patients who experienced melanoma recurrence were subject to interval censoring, as the exact time of recurrence was unknown. The time interval in which the recurrence occurred was denoted (t n , t r ) where t n was the preceding follow-up visit date of the recurrence, also called the last known recurrence free time, and t r was the date recurrence was first diagnosed. Some patients did not have recorded values of t n and t r , but had a status at their last follow up of either "alive with melanoma" or "dead, melanoma," indicating that they had experienced a tumor recurrence at some point and thus left-censored. These patients therefore had a left censoring interval of (0, t f − t d ), where t d and t f denote, respectively, the starting and last follow-up time. All other patients were right-censored at the time of their last follow up.
Six categorical covariates were considered in this analysis and coded as: Breslow thickness (≤0.8 mm; >0.8 mm), tumor ulceration (yes; no), age group (<50; ≥50), sex (male; female), tumor mitoses (yes; no), and site of tumor (arm; head & neck; leg; trunk). Thin melanoma recurrence was rare, with only 6.69% of the sample having experienced the event at some point during the follow-up time. The remaining individuals were right-censored. This prevalence of right-censored observations suggested that there was likely a cured fraction. A mixture cure Cox model was fitted with all six covariates in the latency model and the incidence model. The incidence and latency models were then reduced using backwards step-wise selection with P-values, until all covariates left in the model were significant at the 5% level. Table 7 shows the odds ratios (from the incidence logistic regression model) and hazard ratios (from the latency proportional hazards model), 95% confidence intervals (CIs) and P-values after implementing our proposed method. Results from the incidence model indicate that the odds of thin melanoma recurrence increase significantly in patients with Breslow thickness >0.8 mm, with mitoses, or who are male. Conversely, the odds of thin melanoma recurrence are significantly lower in patients who had a tumor on the leg or the trunk, compared to on the arm (the reference category). The most striking feature of the latency model for non-cured patients was that Breslow thickness was not significantly associated with recurrence. Tumor ulceration and having a tumor on the head & neck, the leg or the trunk instead of the arm significantly increased the risk of melanoma recurrence. Sex was also significantly associated with risk of recurrence, with males in the non-cured population having a significantly lower risk of melanoma recurrence than females. Age was not significant in either the incidence or the latency model.
Interpretation of the outputs of a mixture cure model demands some extra attention. Particularly, notice should be paid to the fact that the latency model is a conditional model, only relevant to the non-cured sub-population. For example, the latency model of this melanoma example suggests that, if we consider at the non-cured (ie, melanoma recurrence) sub-population, males have lower risk of recurrence than females. However, the results from the incidence model suggest males have higher odds of being classified into the non-cured sub-population. These two interpretations of sex in the incidence and latency model do not contradict each other. Corbière et al 19 similarly found that some parameter estimates for the same covariates had different signs between the latency and the incidence models in their analysis of tonsil tumor recurrence data. Table 7 also contains the results of fitting a proportional hazards mixture cure model using the GORCure package, and the Cox model results without a cured fraction (standard Cox model), fitted using the survivalMPL R package. The same six covariates were considered, and the same backwards step-wise selection was carried out to select significant covariates. There are substantial differences between the MPL and GOR fitted models. There was no overlap between the MPL and GOR incidence models in terms of significant predictors of recurrence susceptibility, and only one (body site: head & neck) shared between the MPL and GOR latency model for risk of recurrence. Notably, according to the GOR model there was no significant difference between males and females for either incidence or latency, while sex was F I G U R E 1 Estimated baseline hazard function for the non-cured population (with point-wise 95% confidence intervals) F I G U R E 2 Estimated conditional baseline survival function for susceptible sub-population (with point-wise 95% confidence interval) significant in both the latency and incidence MPL models. It is of interest to compare the results of the standard Cox model results with the latency model from the mixture cure model. For the standard Cox model, there was no significant difference between males and females in terms of melanoma recurrence risk, which contradicts the latency model result. Also, for the standard Cox model, Breslow thickness and mitoses are significant, but were not significant in the latency model. The magnitudes of the hazard ratios and the significance of the body site categories also differ between the latency sub-model and the standard Cox model. Age group was not significant in either the latency or the standard Cox model.
The estimate of the baseline hazard function for the melanoma recurrence sub-population can be seen in Figure 1. There is a notably rapid drop in the risk of tumor recurrence over the first 3 to 4 years, after which the risk of recurrence decreases slowly. Between 20 and 25 years there appears to be a small increase in the risk, although the point-wise confidence intervals are wide for this time period. Figure 2 exhibits the baseline survival function with 95% point-wise CIs for the melanoma recurrence sub-population. For this group of individuals, their baseline survival function decreases faster over the first five years, and then more slowly for the remainder of the follow-up period.
The ability to make predictions based on an estimated mixture cure survival function is a key strength of the method presented in this article. In fact, values of the mixture cure survival function can be easily computed from our regression parameter estimates (both incidence and latency) and the estimate for the conditional baseline hazard function. For example, the estimated probability of a person with a Breslow thickness ≤0.8 mm (all other covariates are fixed at their mean values) having no recurrence for 5 years is 0.95 (95% CI: 0.93, 0.96), while this probability for a person with a Breslow thickness >0.8 mm is down to 0.91 (95% CI: 0.86, 0.94). Furthermore, the probabilities of a patient having no recurrence Similarly, based on the fitted mixture cure model, we can also easily compute entire predictive survival functions, and their point-wise CIs. For example, Figure 3 displays mixture cure predictive survival functions illustrating the effect of Breslow thickness (top panel) and ulceration (bottom panel) with 95% point-wise CIs. These plots explain that, at the population level, those with a Breslow thickness ≤0.8 mm are more likely to be free of recurrence at any time t than those with a Breslow thickness >0.8 mm, when all other covariates are set to the sample mean values. Similarly at the population level, those with no ulceration are more likely to be free of recurrence at any time t than those with ulceration. However, the survival differences between these groups may not be significantly different, as the associated point-wise 95% CIs slightly overlap.
We calculated again the predicted survival curves using GOR model from GORCure as well as the standard Cox model (so there was no cured fraction). The mixture cure survival functions from the MPL and GOR models and the standard Cox survival curves are displayed in Figure 4, comparing again the Breslow thickness groups (top panel) and the ulceration groups (bottom panel). The estimated survival functions from the GOR model are noticeably higher than their MPL counterparts. As seen in Section 5, the GOR model produces consistently negative bias in the estimation of the intercept of the incidence logistic regression model, which is equivalent to over-estimating the size of the cured fraction and underestimating the risk of the event in the whole population. As a result, it appears that the survival function estimates produced by the GOR mixture cure model are too high. There are also clear discrepancies between the Cox and mixture cure predictive survival functions. Clearly, failing to account for the presence of a cured fraction also leads to an overestimation of survival probabilities in this thin melanoma example.

DISCUSSION AND CONCLUDING REMARKS
In this article, we propose a new penalized likelihood estimation method for fitting a mixture cure Cox model where observations are assumed partly interval-censored. Our approach finds the MPL estimation of regression parameters for the latency and incidence models as well as estimation of the latency baseline hazard function which is approximated using M-spline basis functions. The baseline hazard estimate is constrained to be non-negative. One advantage of our method, when compared with the existing EM methods, is that it also yields an asymptotic covariance matrix for all the parameters, and hence allowing inference on both regression parameters and survival quantities. Another advantage of our method is that it produces smooth baseline hazard estimates. The results of simulation studies indicate that this method produces regression parameter and baseline hazard function estimates that perform well in terms of bias, variance and coverage probability. Our method avoids computationally intensive methods like bootstrapping which is adopted by the competitor EM-algorithm for computing variances of the estimates. A package to implement the proposed method is available on Github, and we intend to upload the package to R CRAN in the near future. Existing methods for checking the diagnostics of mixture cure models are sparse, particularly for cases involving partly interval censored data. Schoenfeld residuals 30 are inappropriate for mixture cure models as the marginal hazards of any mixture cure model will be nonproportional. Wileyto et al 31 developed pseudo-residuals modeled on Schoenfeld to assess the fit of the latency model in the non-cured fraction, but considered parametric mixture cure models for right censored data only. Peng and Taylor 32 developed a modified martingale residual and a modified Cox-Snell residual appropriate for checking the latency model of a mixture cure model, but again only considered right censored samples. Scolas et al 33 developed a Cox-Snell residual applicable to a parametric mixture cure model for interval censored data, which can be used to check both the uncured and mixture population survival distributions. These authors also developed a deviance residual appropriate for checking the linearity of both the incidence and latency parts of the model. However, the methods