Penalized maximum likelihood inference under the mixture cure model in sparse data

When a study sample includes a large proportion of long‐term survivors, mixture cure (MC) models that separately assess biomarker associations with long‐term recurrence‐free survival and time to disease recurrence are preferred to proportional‐hazards models. However, in samples with few recurrences, standard maximum likelihood can be biased.


Background
In modeling time-to-event data, the observation of heavy censoring after a sufficient follow-up indicates the existence of "long-term survivors" (or "cured subjects"), who may never experience the event of interest. For instance, in some clinical studies in cancer there is a strong rationale for the existence of cured subjects, because if the treatment is successful, the subject will not experience recurrence of the disease. Particularly, this can occur more often among patients in early cancer stages. However, the challenge of handling censoring is heightened in analyzing survival data when there is a possibility of being cured, in that the proportional hazards assumption of conventional Cox proportional hazards (Cox-PH) model can no longer hold for heavily censored data. The mixture cure (MC) model is a natural and more suitable approach, since it incorporates the cure fraction, and simultaneously models the probability of being cured and the distribution of time to disease recurrence among non-cured. Let T be the time to recurrence, and let C be the cure indicator with C = 1 if the patient is cured and 0 if they are susceptible to having a recurrence. Then MC is defined as a generalized form of the conventional proportional hazards (PH) model with a cure part and a survival part: where p(X) = P(C = 1|X) is the probability of being cured conditional on a vector of covariates X (ie, prognostic factors), such that X = (1, X 1 , X 2 , … X r ) (r the total number of covariates) and S 0 (t|X) is the survival among non-cured conditional on X, which are considered as the incidence part and latency part of a MC model, respectively. The model was first proposed by Boag to estimate the proportion of patients cured by cancer therapy using a lognormal distribution to approximate the baseline hazard function, 1 and was also adopted by Berkson and Gage using a full parametric exponential model in the latency part of the model. 2 Farewell introduced further developments by replacing the constant in the incidence part of the original model with logistic regression, as well as including covariates, and expanding the baseline hazard approximation to other parametric functions. 3,4 In particular, Farewell used the Weibull distribution which is more flexible for modeling the hazard baseline. Later, non-parametric forms for the hazard baseline of the latency part (ie, resembling Cox-PH) were pursued by Kuk and Chen, 5 as well as Peng and Dear, 6 where the former and latter groups took approaches different than parametric MC model in solving for parameter estimates. We use logistic regression to model the association with cure probability conditional on X with corresponding effects expressed as a parameter vector ⊤ = ( 0 , 1 , 2 , … , r ): and a PH Weibull regression to model the time-to-event function conditional on X with effect parameter vector ⊤ = ( 0 , 1 , 2 , … , r ), where = exp ( ⊤ X ) , and > 0. Therefore, it allows for separately assessing prognostic factor associations, where a factor X j in X may exert dissimilar effects (ie, different values or signs between j and j ) on the probability of being cured and the distribution of time to recurrence for the susceptible (not cured) individuals. Parameter estimation of such MC models is usually based on standard ML. Under the framework of the MC model, the likelihood function can be expressed as the product of likelihoods for those individuals who experienced an event during the observation period and likelihoods for those who did not experience the event. Expectation-maximization (EM) and Newton-Raphson (NR) methods are the most common employed methods for parameter estimation.

Motivating study
Our methodology developments are motivated by research designed to investigate molecular genetic prognostic factors in a cohort of patients newly diagnosed with axillary lymph-node-negative (ANN) breast cancer, an early-stage breast cancer that was reported to have a disease recurrence rate of 20%. Patients were prospectively ascertained and enrolled from eight Toronto hospitals from September 1987 to October 1996. 7 Information regarding traditional prognostic factors (TPFs) of the patients was also collected at the time of diagnosis, including tumor size, tumor histologic grade, lymphatic invasion, menopausal status, and treatment received. Recurrence-free survival (RFS) time was recorded as the time from diagnosis to confirmation of non-breast recurrence (ie, distant metastasis). These studies identified neu-erbB2 amplification and P53 alterations as prognostic markers for early disease recurrence. 8,9 In subsequent studies, patients were followed up to 15 years, and 887 patients contributed archived tumor samples for tissue microarray (TMA) analysis to measure expression of certain breast cancer molecular biomarkers at diagnosis; 111 among them experienced distant disease recurrences. In these investigations, multiple molecular prognostic TMA biomarkers, such as estrogen receptor (ER), progesterone receptor (PR), CK5, HER2, EGFR, Ki67, TP53, podocalyxin, and related subtypes, were associated with the risk of distant recurrence alone and in multivariate survival models with TPFs. [8][9][10][11] Given the low event rate and long-term follow-up observed, the authors compared analyses of the conventional PH model to those of the MC model, and in some cases found the latter to be more informative. 10,11 However, as more prognostic covariates are included in the MC regressions while the number of events remains fixed, the potential increases for finite sample estimation bias and non-finite effect estimates. The latter can occur when one of the biomarkers defines a subgroup free of recurrences, leading to data separation and monotone likelihood. For instance, when indicators for multiple tumor subtypes as well as TPFs were evaluated jointly in a sparse MC model, some estimates did not converge and some seemed to be inflated. Therefore, in this report we propose penalized likelihood solutions, develop methods for estimation and inference, and evaluate performance by simulation and application. The aim is to improve inference under the MC model in sparse data settings and enable more robust identification of novel biomarkers with enhanced prognostic value.

Penalized likelihood for reducing finite sample bias
Maximum likelihood estimation in logistic and Cox-PH regression models sometimes can be troubled by data separation and/or monotone likelihood leading to perfect prediction of the outcome, particularly when the number of observed events is low. [12][13][14] This produces one or more regression coefficients that converge toward plus or minus infinity and invalid asymptotic test statistics. However, unlike in logistic and survival models where the issue of data separation and finite sample bias have been well studied, it has received little attention in MC modeling. In survival regression, we can denote f (z| ) the joint density or probability mass function implied by the distribution of the variable of interest, Z (ie, time-to-observation, and censoring status), via a parameter .̂is the value of which maximizes the loglikelihood function. Under usual regularity conditions,̂is consistent with bias of asymptotic order O ( n −1 ) , which means that its bias vanishes as n → ∞. Thus, when a sample size is finite, with resulting new estimator̃, the asymptotic bias, B( ) =̂−̃, might become non-negligible. 15,16 How to approximate the asymptotic bias B( ) under this circumstance is of essential importance in model parameter estimation.
Penalization is an established method to deal with sparse data and the problem of separation in discrete outcome and censored survival regression, for which the estimating function (eg, the score function of the original likelihood) is augmented and thus partially constrained by additional parameter functions. Penalization can reduce bias and/or mean-squared error (MSE). These augmenting constraints can be derived from prior densities when the original estimating function is the likelihood gradient (score function), penalized estimation is equivalent to posterior mode finding. 17 A well-known and widely used example is the Firth-type method for reducing finite sample bias, which can be derived from the Jeffreys prior. 16,18 First shown by Firth, this approach removes the first order term ( O ( n −1 )) in the asymptotic bias expansion of the MLEs in the family of generalized linear models (GLMs) by modifying the score equation. Therefore, different from some common penalization approaches (eg, lasso or ridge regression) that aim to shrink estimates toward zero for variable selection, the Firth-type method is designed to reduce estimation bias due to deviation from the asymptotic condition. Specific Firth-type approaches have been developed in binary and multinomial logistic regression, and in survival regression. 13,14,19 The Jefferys prior, which takes the prior density to be proportional to the square root of the determinant of Fisher information matrix, has been shown to be a proper alternative, 20 in contrast to uniform priors which are used as conventional noninformative priors. It has other attractive features, including parameterization invariance and being a uniform measure in an information metric which provides a posterior distribution that best reflects the information brought by the data. 20 It has also been demonstrated that the distribution of estimates from FT-PL and ML are equivalent as n → ∞, and they are equivalent under first order asymptotics since the Firth-type method removes the first order term of the asymptotic bias B( ) in its non-linear expansion. 16,18 Thus, the likelihood ratio test (LRT) is a valid means of hypothesis testing for both usual MLE and FT-PLE, and it is also more robust compared to the commonly applied Wald test, in that the former does not rely on a normality assumption for the distribution of estimated parameters. Although penalized likelihood ratio (LR) and profile likelihood are recommended to perform hypothesis testing and construct likelihood based confidence intervals for single or multiple parameters, in binary and multinomial logistic regressions, 13,21,22 methods for model comparison have not been as well evaluated. 23 An alternative method related to Firth-type penalization, also known to remove bias, is based on the log-F(m, m) prior, a class of penalty functions specified as ln indexed by m ≥ 0. The antilog of the penalty term of binomial logistic regression is proportional to a log-F(m, m) density for , which is the conjugate family for binomial logistic regression. 17 Under a single-parameter logistic model, the Firth method is equivalent to imposing a log-F (1) prior. Previous studies in logistic and survival regressions suggest that log-F (2) can reduce bias as does the Firth-type method, and introduce greater shrinkage due to the prior distribution. 17,24,25 A related adjustment to the FT-PL involves applying the Firth-type penalization excluding the intercept (noint-PL). The rationale is that accuracy of effect estimates does not have to be traded for better predictions and the intercepts are not necessarily always small, so enhanced parameter estimation may be achieved via noint-PL method. 26 Therefore, in this work we develop improved methods for MC modeling on highly censored data in which the low event rate creates a high degree of sparseness. We extend theory and methods for FT-PL into MC models to handle data separation and reduce bias due to data sparseness or finite sample size. In Section 2, we first develop estimation via a Newton-type algorithm and hypothesis testing procedures that we implement in R functions. In Section 3, using simulations, we evaluate whether FT-PL provides finite estimates in datasets when the ML method fails to, and improves estimation by biased-reduced point estimation with smaller MSE. We also evaluate statistical power for hypothesis testing. With the same simulations, we compare parameter estimation and hypothesis testing under the Cox-PH model using FT-PL and ML. In addition, we compare FT-PL with two aforementioned alternative methods for parameter estimation under MC model using simulations. In Section 4, we apply the FT-PL method to obtain bias-reduced effect estimates for tumor subtypes in the motivating case study. We conclude that the proposed methodology developments improve the validity and efficiency of MC modeling of multiple prognostic factors in disease recurrence, such as for ANN breast cancer, and will also be useful in other types of cancer with low event rate and be readily transferable to similar problems in genetic association analysis of disease susceptibility.

Derivation of FT-PL in MC models
In a time-to-event study sample, for each individual i, the vector of measured covariates is x i , the event-free survival time is recorded as t i , and censoring indicator observed is be the vector of all parameters in a MC model, following the notation of Cai et al 27 the loglikelihood is expressed as is the hazard baseline ( is the shape parameter of Weibull distribution).
The MLE vectors of regression parameter estimates are obtained as the solution to the score equations log L( ) = l( ) = U( ) = 0, where L is the likelihood function of MC model, and 0 is a vector of the same length as . Estimation of the proposed Firth-type penalization is based on the modified score equations for the parameters j , j = 1, … , r (jth parameter of ), where I( ) −1 is the inverse of information matrix evaluated at . The modified score equations U * ( ) correspond to the penalized loglikelihood and likelihood functions, that is, l * ( ) = log L * ( ) = l( ) + 1 2 log |I( )| and L * ( ) = L( ) × |I( )| 1 2 , where |I( )| 1 2 is known as Jefferys invariant prior. The penalty function |I( )| 1 2 is asymptotically negligible. For a number of models under the exponential family, Firth (1993) showed that the O ( n −1 ) bias of a MLE can be removed by modifying the score functions to U * ( + a j that coincides with the modified score functions under FT-PL (Equation 5). Although not explicitly shown for nonexponential family models, the approach was demonstrated to also remove the first order term of the nonlinear expansion of the asymptotic bias, as long as the expression of third derivatives of the loglikelihood can be specified. 14,16 For MC models, the elements of third derivatives of loglikelihood are straightforward to derive and are provided in Data S1.1.
An earlier study by Ibrahim et al demonstrated the propriety of the Jefferys prior and support for the resulting posterior in GLMs by establishing (i) sufficient and (ii) necessary and sufficient conditions. 20 Under the MC framework, by specifying the profile likelihood functions for regression parameters j and j (for a variable X j ) as an exponential density of the outcome variables, Y i and T i , we can demonstrate similarly that the Jefferys prior is proper for the regression parameters of an MC model. We show that (i) the sufficient condition as well as the necessary and sufficient condition for the existence of posterior moment generating function of regression parameters ( j and j ) of an MC model are met; and (ii) the sufficient condition, as well as the necessary and sufficient condition for Jefferys prior to be proper on the regression parameters are also fulfilled (see Data S1.2).

Parameter estimation and hypothesis testing
In the general formula of penalized loglikelihood (l * ( )) presented in Section 2.1.1, for MC models with the list of variables X and associated regression parameters for the incidence part ( j ) and latency part ( j , ), the determinant of information matrix inside the penalty term, |I( )| = |I( , , )| (detailed expression in Appendix) can be partitioned into four block matrices A, B, M AB , and M BA , such that, A and B are square matrices that respectively correspond to the Hessian matrices of parameters from the incidence part and latency part. M AB = M ⊤ BA , is related to the covariance between parameters from the two parts of the model (shown in Data S1.3).
Given the design matrices X = [1, x 1 , · · · , x r ], Z = [X, ln(t)], application of matrix operations to these block matrices leads to the following matrix decomposition: , y ⊤ = [y 1 , · · · , y n ], ln (t) ⊤ = [ln(t 1 ) , · · · , ln(t n )] are vectors of length n; i = ] −1 . It follows that )} , and The modified score function is then expressed in matrix operation, using the diagonal elements of the above diagonal matrix expressions as where U and U are vectors of score equations (U( )) for parameters from the incidence part and the latency part; V tr and V tr are vectors of tr for parameters from the two parts of the model as well; U and V tr are the score equation and tr for parameter . The exact expressions of the score vectors (U and U ) and the trace vectors (V tr and V tr ) are shown in the Appendix.
Under the assumption that the maximum penalized likelihood estimates exist, which satisfies U * ( * ) = 0, an iterative adjustment (ie, NR method) can be applied in the penalized score function to obtain the penalized estimates: * ) .
In situations of sparse data where separation can affect ML estimation, and the profile of the penalized likelihood function is also asymmetric, Wald tests are unsuitable as they assume estimated parameter values are normally distributed.
For hypothesis testing, we use LR statistics as the primary means for testing association of a covariate in MC models. Under FT-PL, to test the null hypothesis that variable X k is not associated with the survival outcomes (neither probability of cure nor distribution of time to recurrence), that is k = k = 0, the non-restricted loglikelihood is l * (̂), and the restricted loglikelihood is l * (̃0) (maximized under the constraints of k = k = 0). Assuming that under the null hypothesis test, a P value for rejecting the null hypothesis is obtained using the estimates k =̂ * k and k =̂ * k under the non-restricted model. To test the null hypothesis that a single variable X k is not associated with the probability of cure, k = 0, a non-restricted loglikelihood l * (̂) and a restricted loglikelihood l * (̃0 | k = 0 ) ) are used to construct the loglikelihood ratio * , that is, such that, Similar procedures apply for the hypothesis testing of k = 0. Like the LRT between the non-restricted (full) model and restricted (reduced) model under ML, the penalized loglikelihood ratio statistic (PLRS) has been applied for inference under FT-PL. 22,25,26,28 However, for the penalized loglikelihood, comparison of the full and reduced models under FT-PL depends on two possible realizations in the penalty term. For example, under null hypothesis k = k = 0, the two approaches are either (I) both parameters are equal to zero and the covariate X k is removed in calculating the determinant of Fisher info; or (II) the parameters are equal to zero but the covariate X k is still retained in the determinant calculation for penalty term.
For implementation I (known as "PLR" direct comparison [PLR-DC] method), 13,18,21 the reduced information matrix in the penalty term has two columns and two rows (correspond to k , k ) less than that of the full model (Equation A1 in Appendix). The resulting penalized loglikelihood for the reduced model, under In the implementation II (known as "nested" deviance method, 23,26 ) zeros are substituted for k , k with the dimension of information matrix in the penalty kept the same as the information matrix of full model. This leads to the final expression of the penalized loglikelihood under the same null, as For instance, instead of removing l 2 ( ) 0 k in Equation (A1), based on the original expression (Data S1.3) that l 2 ( ) are vectors of length r). For LRTs of single and multiple parameters, or model fit comparison between hierarchically nested models, the nested deviance method is recommended by Heinze et al, 23 in that it provides a better approximation to the 2 distribution under non-asymptotic condition compared to the other method.

Implementation and software
Following the approach of Yilmaz et al, 11 we apply the optimization function nlm (R function), which is a Newton-type algorithm to find the vector of parameter estimates that maximizes the defined loglikelihood, that is, the Firth-type penalized loglikelihood l * ( ), or the usual loglikelihood l( ). Within the nlm function, a Fortran function is called on to seek a joint solution for multiple parameters under unconstrained minimization; we set the maximum number of iterations to 200, and by default the gradient tolerance (gradtol) is set to 10 −6 , as the degree of approximation to zero accepted for terminating the iteration process. Aside from the parameter estimates, the nlm function also outputs a hessian matrix at and the determinant of the information matrix can be calculated by Using this result in our implementation, we reduce the complexity of computing inverse and determinants substantially.
We have implemented a new R package "mixcuref" compiled with built-in functions to estimate the model parameters under both ML and FT-PL methods for MC models, (procedures as described above). By carefully choosing starting values for the nlm optimization, the loglikelihood function can be minimized at̂ * ⊤ ,̂ * ⊤ ,̂ * for FT-PL, and̂⊤,̂⊤,̂for ML. Moreover, a separation detection algorithm, similar to the procedure described in "brglm" function in R 18 is implemented to detect cases of separation. 13 Although more sophisticated but computationally demanding procedures have been developed, 28,29 we implement a relatively simple practical procedure 13 : the estimated variance in each iteration step of refitting the parameter values (until convergence) is monitored by the sequence of ratios of SEs between the first iteration and the current iteration as a trace of divergence. For simplicity, we declare a dataset as a case of separation if there is any single parameter with SE ratio exceeding 10. (For the "mixcuref" package, the online link for download is https:// github.com/ChangchangXu-LTRI/mixcuref.)

Alternative penalization methods in MC
Under the log-F(m,m) prior method, the penalized likelihood is L * log f ( ) = L( ) × |I( )| m∕2 . In addition to the Firth-type penalization method (m = 1), we also adopt the log-F (2,2) penalized likelihood (logF (2,2)-PL) and Firth-type without intercept penalization (noint-PL) for MC model estimation. For logF (2,2)-PL, the penalty function is log |I( )|, and the penalized loglikelihood 17,24 is Under the noint-PL, the penalty function is modified such that the intercept parameters ( 0 , 0 ) are not included in calculating the determinant of the information matrix. The penalized loglikelihood is The same optimization function nlm is used to obtain maximized estimates for logF(2,2)-PL and noint-PL, through built-in functions in package "mixcuref." Using simulations under single variable MC models, we compare the bias and MSE of parameter estimates among the three penalization methods.

Simulation design
To resemble the ANN breast cancer study data, we simulate time-to-event outcomes with a sample size of n = 1000 patients across a range of event rates (and n = 2000 at lower event rates), and include prognostic regression variables based on the motivating cohort study TMA data introduced in Section 1.2. According to the binary expression levels (positive or negative) of four IHC biomarkers (ie, HER2, CK5, EGFR, and Ki67) quantified by TMA, and two ligand-binding assay measured biomarkers (ie, ERs and PRs determined at time of surgery), we define four intrinsic subtypes 8 : Her2, Luminal A, Luminal B, and Triple Negative (TN). Then, we generate prognostic covariate values for each patient according to the observed distributions of tumor subtype variable and two TPFs (tumor size and menopausal status), and survival outcomes (ie, probability of cure, and time to recurrence) under the MC model with various specifications of covariate associations with the survival outcomes, focusing particularly on the effect of variation in the event rate. In each dataset, we fit the univariate and multivariate MC models (correctly specified as the data generating model) using ML and FT-PL to estimate parameters and construct LR statistics to test for prognostic importance of the subtypes. We generate 5000 replicate datasets at each simulation setting.

Data generation
To be more precise, marginal and joint distributions of the binary variables follow those in the TMA data, with approximately 11.5% Her2 positivity, 22% TN positivity, 33% Luminal A positivity and 33.5% Luminal B positivity for the tumor subtypes, with 42% pre/peri-menopausal status (vs post), and 51% tumor size greater than 2 cm (vs less than 2 cm). The joint covariate distribution is generated from a multinomial factor with 16 categories created from the observed tumor subtype variable with four categories and two binary TPF variables, based on the distributions in the complete case ANN dataset. The subtypes and TPF are then coded as binary indicators. Average Pearson correlations of the resulting multi-correlated binary variables are reported in Table S2.
Cure status (C, s.t. C = 1 for cured and C = 0 for not cured), RFS time (T), and censoring indicator (Y , s.t. Y = 1 for observed event/recurrence and Y = 0 for censored) are generated for a pool of patients with mixed cure status. The cure indicator, C, is generated via logistic regression using the coefficients in the incidence part of MC model; coefficient values are based on fitting to the TMA data. The patients assigned to the cured group are assumed to enter the study following a Uniform(0, 60)-distribution and to have remained recurrence free until censored at the end of a subsequent follow-up period of up to 120 months.
For the susceptible (ie, non-cured) patients, a Weibull distribution is assumed for the baseline of the survival distribution in generating the time-to-disease recurrence. Explicitly, if we let the cdf of T be such that U ∼ Uniform(0, 1), then Thus the survival time T can be obtained as Then time to censoring is generated assuming a 60-month recruitment period, and a subsequent follow-up period of maximal 120 months. Values for the coefficients ( ⊤ and parameter estimates) are based on the fit of the latency part of MC to the TMA data. The event rate is determined by adjusting the intercept values, 0 and 0 , for example, high recurrence rate of 25%, medium rate of 15%, low rate of 10%, and extremely low rate of 5%.

Simulation settings
The true parameter values of log odds ratios ( j ) (or log OR) and log hazard ratios ( j ) (or log HR) for the tumor subtypes are given two sets of values of simulation scenarios, according to underlying model assumptions: (i) a "null" simulation created by assuming no association between the tumor subtype variable with either of the probability of being cured and distribution of time to recurrence, such that the coefficients of the tumor subtypes are all set to zeros in both the incidence part and latency part of the MC; (ii) an "alternative" simulation generated by setting the coefficients close to values fitted in the MC model for TMA data. In particular, the simulation scenarios are labeled by the following notations: that is, "1" for univariate "5" for multivariate MC models; "A" or "N" for alternative or null hypothesis for the effects of subtypes; and "25," "15," "10," and "5" for four event rates (%). For instance, Simulation-1A25-Her2, Simulation-1N25-Her2, Simulation-1A15-Her2, Simulation-1N15-Her2, Simulation-1A10-Her2, Simulation-1N10-Her2, Simulation-1A5-Her2, and Simulation-1N5-Her2 of 5000 replicates each, are generated under the above eight scenarios, based on the univariate MC "Model 1-Her2" with a highly unbalanced binary variable Her2 as covariate: Incidence part ∶ logit (P (C = 1|X Her2 )) = 0 + Her2 X Her2 , Likewise, eight scenarios were generated for each of two other univariate MC models "Model 1-LuminalA" and "Model 1-TN" are respectively For simulations of a 5-variable MC "Model 2" with three binary tumor subtype variables and two binary TPFs in the model, the eight scenarios are Simulation-5A25, Simulation-5N25, Simulation-5A15, Simulation-5N15, Simulation-5A10, Simulation-5N10, Simulation-5A5 and Simulation-5N5, with the model specified as: The higher prevalence Luminal B subtype is assumed as the reference category. The parameter values specified for the various simulation scenarios are summarized in Table 1. Subsequently, the cure variable C and outcome variables T, Y are generated according to the coefficient values. The four intended levels of recurrence rates realized with slight random variations across 5000 replicates of datasets are shown in Table S3.

Model fitting
In total, we fit the correctly-specified MC model and a Cox-PH model including the same covariates to each of the datasets simulated under eight scenarios, that is, the combination of null and alternative hypotheses and four recurrence rates. Distributions of individual parameter estimates are examined by box plots reporting median, quartiles, and outliers. For MC, the FT-PL parameter estimates and corresponding bias with respect to the true simulation values are recorded as an average over 5000 simulation replicates (ie, mean bias); the MSE is calculated as the mean of squared bias, and the number of non-separated cases (out of 5000 simulations) is recorded as counts for each simulation scenario. For ML, mean bias and MSE are calculated excluding datasets having one or more non-converged estimates. In the presence of simulation replicates where non-convergence occurs, mean bias and MSE are estimated by ML or FT-PL only in all converged cases (out of 5000) under ML method; and also estimated by FT-PL in all 5000 datasets (FT-PL always converged). For Cox-PH, mean bias and MSE are calculated similarly but limited to the null case for the subtype parameters. We use the existing "coxphf" package 14 to obtain ML and FT-PL estimates under Cox-PH.
Using LRT statistics for ML and FT-PL which can be calculated in all datasets (regardless of separations), T1E and power at different nominal test size (ie, 0.01, 0.02, and 0.05) are obtained by counting the percentage of simulation replicates achieving P values less than or equal to 1%, 2%, and 5% for an overall test of association of a covariate (both and ) with long-term and short-term survival (df = 2) and for a single parameter (one of or ) test of association (df = 1). The statistical power at varying nominal testing levels (ie, 0.005, 0.01, 0.02, and 0.05) is plotted for each simulation scenario. In the evaluations of hypothesis testing that follow, we focus mainly on bivariate tests of parameter pairs (both and ) associated with the same covariate. Results for single parameter tests are reported in Data S1.
In addition, using Simulation-1A25 to 1N5 for each of Her2, Luminal A and TN, we compare estimation bias with respect to the true parameter value under univariate MC models, the MSE, and the counts of non-separated datasets among the FT-PL, log-F (2) prior, and noint-PL methods.

Point estimation bias comparison between two methods
We compare methods with respect to parameter estimation and MSE fitted by the correctly-specified multivariate MC model (Model 2) to each data under Simulation-5A25 to 5N5. The conclusions from datasets of Simulation-1A25 to 1N5 that are fitted by corresponding univariate MC models are very similar to those from the multivariate simulations, with FT-PL converged in all datasets even when ML does not (see Figure S1 and Table S4). The contrast between the performance of FT-PL and ML is pronounced: 1. At high event rate (ie, 25%), FT-PLEs and MLEs are essentially equivalent under both alternative simulations (Figures 1  and 2 and Table S5) and null simulations (Figures S2 and S3 and Table S5) for all model parameters. However, the bias and variation grow in magnitude as the event rate decreases. FT-PL generates consistently smaller absolute mean bias than ML for event rate of 15% or lower. 2. In general for both ML and FT-PL, MSE grows as the expected event rate increases (Table 2, Figures 1, 2, S2, and S3). For all the covariates, FT-PL generates equivalent or smaller MSE than ML under various simulation scenarios ( Table 2). The difference between the two methods increases as event rate decreases, with largest MSE difference observed at the 5% rate (Table 2 and Figures 2 and S3), indicating substantial shrinkage by FT-PL when the event rate is very low. 3. The ML does not converge in 137 of Simulation-5N5 datasets and in 2 of Simulation-5A5 datasets, while FT-PL converges in all 5000 simulated datasets ( Table 2). For datasets with ML complete separation (non-convergence in one or more parameters), FT-PL generally shrinks the estimated value and reduces bias, but the degree of shrinkage varies. For instance, with Her2 , TN , Her2 and TN in null simulations, a few outlier FT-PLEs are still relatively large ( Figures S2  and S3).
We also compare the mean and variance of estimates fitted by the Cox-PH model with the same covariates as the MC under both ML and FT-PL (Figures 1, 2, S2, and S3). As in MC, FT-PL reduces variance of the Cox-PH log HR estimates for various simulation scenarios, and subtype estimates are unbiased under the null hypothesis of no subtype differences (Figures S2 and S3). However under the alternative of subtype differences, the Cox-PH subtype HRs tend to  be underestimates of the MC subtype HRs (which are conditional on being susceptible to recurrence) because the cured subgroup is recurrence free, 10,11 and a high cured fraction (greater than 70%) is specified in the simulation. Table 3 and Figure 3 report T1E and power of 2 df covariate-specific LRTs for ML and FT-PL (nested deviance method) for simulations under null and alternative hypotheses in the multivariate MC models across the range of event rates in sample sizes of 1000 or 2000. The T1E and power of 1 df LRTs for single parameters of the multivariate Cox-PH models fitted to the same datasets are shown in Table S11 and Figure 3 as well.

Hypothesis testing of two estimators by likelihood inference
For Simulations 5N25 to 5N5 under the joint null for all subtypes (as specified in Table 1), the LRTs for the corresponding multivariate MC analysis are valid for both ML and FT-PL at 25% event rate in a sample size of 1000, but tend to be liberal otherwise. Unlike the univariate scenarios (Table S8 and Figure S4) where tests are more often valid at lower event rates depending on the biomarker frequency, the ML and FT-PL LRTs show T1E slight elevation at 25% (ie, at 0.02 and 0.01 nominal levels). The Q-Q plots ( Figure S6) for the joint LRTs (df = 2) of j = j = 0 for each subtype variable under Simulation-5N25, 5N15, and 5N10 nevertheless indicate overall alignment of the observed with the expected P values for both FT-PL and ML, with some minor deviations at the tail. As the event rate decreases to 5% (Table 3), the T1E elevation becomes somewhat more evident for FT-PL than for ML in a sample size of 1000 under all testing levels, particularly for Her2 (Table 3). However, when the sample size increases to 2000, T1E is controlled at 10% and 5% event rates (   Table 1 for true parameter values), at nominal testing levels, for ML and FT-PL. The simulation scenarios (each 5000 replicates) include 5A25, 5A15, and 5A10 for sample size 1000 analyzed by MC (small circle) and Cox-PH (small triangle), and 5A10 and 5A5 for sample size 2000 analyzed by MC (large circle) and Cox-PH (large triangle) single parameters, using the same simulations for univariate (Table S10 and Figure S6) and multivariate analyses (Table S13 and Figure S11), reveals that T1E deviations are more frequent for the latency parameters for time-to-recurrence distribution. This suggests the T1E elevations seen in the 2 df LRT can be attributed mainly to the latency component estimation which may be more sensitive to low event numbers.
For Simulation 5A25-5A5 under the alternative model (Figure 3), power varies across event rates and nominal testing levels as expected, with lower power at decreasing event rates, higher power at increasing test levels, and higher power when the sample size increases from 1000 to 2000. Power of the 2 df LRT of FT-PL is consistently higher than LRT of ML for multivariate MC model simulations as well as for univariate simulations ( Figure S5 and Table S12), with similar findings for the 1 df LRTs (Figures S7 and S12).
Under Cox-PH analysis, the observed T1Es for Simulations 5N25 to 5N5 exhibit a similar pattern of only slight deviation at the tail for event rate 5% (Table S11 and Figure S9), indicating overall test validity under the null hypothesis. Compared to the 2 df MC LRTs in Simulation 5A25 to 5A5, however, the 1df LRTs for the same covariate yield consistently lower statistical power at varying nominal testing levels ( Figure 3). The loss of power arises from the existence of cured fraction that is not accounted for in the Cox-PH analysis.

Comparison of FT-PL and two other penalizations
The comparisons of FT-PL with logF(2,2)-PL and noint-PL focus on estimation bias in univariate model simulations of each subtype indicators under null and alternative hypotheses (Tables 4 and 5). For Model 1-Her2, which contains Her2 as the single covariate (notably the most sparse subtype), all three methods yield relatively large mean bias. Although all three penalization methods converge in all simulation scenarios, the mean bias by FT-PL is more consistently minimized among the three methods. With similar or smaller MSEs, logF(2,2)-PL generates comparable or slightly larger mean bias in some parameters in the null simulation, compared to FT-PL, and it produces noticeably larger mean bias in the alternative simulation scenarios. Noint-PL yields mean bias close to FT-PL for LuminalA and LuminalA (notably the parameters of least sparse subtype) under both null and alternative simulations, with comparable MSEs. As the event rate decreases (ie, ≤ 10%), noint-PLE tends to generate larger mean bias than FT-PL in both null and alternative simulations, especially for parameters of Her2 and TN. Although the parameter values for 0 set in the simulations are large in some cases (ie, | 0 | > 5), noint-PL does not necessarily generate smaller mean bias or MSE than FT-PL. We also noted under these univariate models, the three methods do not reduce mean bias in simulations under alternative as much as under null models. In general, compared to logF(2,2)-PL and noint-PL, FT-PL consistently exhibits the smallest mean bias across different simulation scenarios, with comparable MSEs.

APPLICATION OF FT-PL IN MC ANALYSIS OF TMA BREAST CANCER DATA
Among the 887 patients in the breast cancer cohort with tissue samples available for TMA biomarker expression analysis, four established tumor subtypes (ie, Her2, Luminal A, Luminal B and TN) were characterized for 589 patients, leaving the remaining as undefined (due to incomplete biomarker values). The classification of different tumor subtypes as combinations of biomarkers aims to better stratify patients with different prognosis. Among the 589 patients with defined subtype categories, 87 experienced distant recurrence during the observation period. The issue of missing tumor biomarker and/or subtype data will be re-visited in the discussion (Section 5).
To illustrate the differences between ML and FT-PL in application to the TMA study data, we present two alternative frames for MC analysis: in Section 4.1, the analysis includes nested models that examine the main effects of categorical subtype factors, together with TPFs in all 589 patients; in Section 4.2, the analysis examines a categorical variable derived from the complete combinations of four tumor subtypes and two tumor size categories, within the subset of patients who underwent hormonal therapy. The analysis in Section 4.1 represents a comprehensive approach to evaluate multiple molecular and histopathologic tumor factors as well as patient characteristics. On the other hand, the analysis in Section 4.2 represents an approach to stratify patients into more refined, homogeneous subgroups which better discriminate prognosis and may account for interactions between risk factors. In parallel to the MC analysis, we perform the Cox-PH analysis using the same models as in the comprehensive and interaction regression analyses.
A practical issue in both approaches relates to the choice of the baseline category in specifying categorical subtype factors in the incidence and latency regressions. For analyses that assess tumor subtype differences, we define Luminal B to be the baseline category; as a frequent subtype with the largest number of recurrence events it provides relatively stable and interpretable estimates. Although the contribution of the subtype factor to the likelihood is unchanged by recoding subtype category indicators (Figure 4 or Table 6 compared to Tables S14 and S16), the low event rate in the Luminal A category may recommend it as a more natural baseline category to characterize subtypes associated with poor prognosis.
Out of practicality and convention 10,11 in specification and interpretation of regression coefficients, we report logistic regression coefficients and odds ratios for long-term susceptibility to disease recurrence, the complement probability TA B L E 4 Mean bias of MC model parameter estimates of FT-PL, log-F(2,2)-PL, and noint-PL, as well as MSE of all datasets among 5000 simulation replicates (sample size 1000) based on univariate analyses in Model 1-Her2, Model 1-Luminal A, and Model 1-TN, under null hypothesis and four event rates (in bold values) Parameters under H 0 and recurrence rates 25%, 15%, 10%, and 5% are from simulations-1N25, 1N15, 1N10, and 1N5. All 5000 replicates are non-separated datasets using FT-PL, log-F(2,2)-PL and noint-PL.

Mean bias MSE
of cure (as originally formulated). For the latency component we report coefficients and hazard ratios for the expected distribution of time to recurrence in those susceptible to recurrence. Thus a positive-valued estimate of the incidence log-odds associated with a subtype means a higher long-term odds of recurrence compared to the reference subtype, and a positive estimate of the latency log hazard ratio corresponds to shorter time to recurrence in those susceptible to disease recurrence.

Prognostic relevance of tumor subtype and TPFs
When the survival data stratified by tumor subtype shown in the Kaplan-Meier (K-M) plot ( Figure 4A) are analyzed as a categorical factor in a comprehensive multivariate MC model including multiple TPFs ( Figure 4B and Table 6), the 2 df LRTs detect significant subtype differences, and confirm the known TPF associations. The FT-LRT P values are consistently smaller than the ML-LRT P values. In the presence of TPFs, inclusion of subtype improves the model   Table S15). It is to be noted that LRTs of the multivariate MC models here should be interpreted cautiously given the evidence from the simulation study that T1E can be elevated when the number of events is small. We observe that some covariates without statistically significant association under Cox-PH (ie, 1 df LRT P value ≤.05) exhibit significant P values for both or either one of HR and OR under MC. The conventional analysis fails to detect any subtype effects which reflects the simulation study findings on reduced power with Cox-PH for covariates with non-zero underlying values. Examination of the MC OR and HR effect estimates ( Figure 4B and Table 6) suggests that in comparison to Luminal B patients: (i) Her2 patients have similar long-term probability of recurrence but shorter time to recurrence; (ii) Luminal A patients exhibit lower long-term probability of recurrence and similar time to recurrence; (iii) TN subtype patients have lower long-term probability of recurrence but shorter time to recurrence. We note that compared to ML, FT-PL does not always shrink single parameter estimates toward the null. For single parameter estimates, we also report profile likelihood confidence intervals for subtype differences in ANN TMA data for both ML and FT-PL (Table S17). In the Cox-PH analysis, the effect estimates largely reflect the findings for long-term risk of recurrence under the MC model (ie, HRs of Cox-PH resemble the ORs of MC), due to the inability of the Cox model to capture differences between long-term and short-term risks of recurrence, 10,11 such as those in the Her2 and TN subtypes. When the event rate is low and there is a cured fraction, the Cox-PH analysis is fundamentally mis-specified and the log HR estimates can be misleading.

MC analysis with FT-PL in the presence of recurrence free subgroup
In this MC analysis that includes 259 patients with 27 recurrences, the reference category is set to the subgroup of Luminal B with tumor size <2 cm. Two of the eight subgroups obtained by stratification of tumor subtype and tumor size are completely event free as shown in the K-M plot ( Figure 5A): patients of tumor size <2 cm with either Her2 or TN subtype remain recurrence free until end of the observation. This leads to complete separation in the multivariate model. As a result, the estimates that compare the event free subgroups to the reference group do not converge under ML. In Figure 5B and Table 7, we report truncated ML estimates acquired by relaxing the original criterion (to SE ratio = 20) for the separation detection algorithm: that is, Her2 with tumor size <2 cm subgroup (log OR = −12.98 and log HR = −5.54) and TN with tumor size <2 cm subgroup (log OR = −12.2 and log HR = −6.28). In contrast, FT-PL provides sensible estimates for both log OR and log HR. Meanwhile, ML estimation of other subgroups are also adversely affected by complete separation. For instance, the log HR of Luminal A subgroup with tumor size <2 cm seems to have its sign reversed under ML, but corrected by FT-PL, given the later time to recurrence in Luminal A subgroup that of the Luminal B. The Cox-PH analysis with the same covariates also produces non-finite log HRs for the two event-free groups which have inflated log ORs and log HRs under MC analysis due to the condition of complete separation.
Utilization of FT-PL solves the problem of ML separation and provides more reasonable estimates and stronger test results. In light of simulation results, very small P values for the between-subgroup comparison need to be interpreted with caution due to the small event number counts, but FT-PL estimates are expected to be unbiased on average under such conditions. It is also noted that changing the reference category does not improve the problematic ML estimates, and can even lead to estimates of coefficients that are more difficult to interpret (Table S18).

DISCUSSION
One merit of adopting FT-PL for treating or preventing infinite estimates due to problems of separation is that, in several cases, finiteness of FT-PLE can be guaranteed especially when the MLEs are infinite. This has been proven for logistic regression models and a few other GLMs as well as nonlinear models, for example, in recent work by Kosmidis and Firth. 30 Through their proof for the finiteness of FT-PLE, it is demonstrated that a form of concave shape can be achieved by the penalized likelihood/loglikelihood, with bounding on the two ends of the likelihood profile as the parameter of interest goes to ∞∕ − ∞. Assuming L * ( ) = L( ) × |I( )| 1∕2 , if the penalty function (|I( )| 1∕2 ) can be expressed as some product of observed variable X and certain function of , and it can be shown that as → ∞∕ − ∞, |I( )| → 0, then the finiteness of the penalized estimates can be ensured in exponential family models (eg, logistic regression). Using the mathematical criteria proposed by Kosmidis and Firth, we show that FT-PL can guarantee finite estimates for MC model parameters under certain assumptions (Data S1.4), and we demonstrate in algebraic derivation, the propriety of Jefferys prior under MC framework (Data S1.2). Empirically, we validate finiteness of FT-PL under stringent conditions by simulations down to 25 events out of 1000 subjects.
In application of the MC model parameter estimation procedure of MC model in our "mixcuref" package that utilizes the nlm function, we have observed that optimization to the global maximum depends on the initial value input. This sensitivity to initial input increases as the degree of the model complexity increases (ie, many variables, or variables in nonlinear form). We have encountered this challenge for MC modeling when the number of covariates reaches 9 or more in the ANN TMA data. One solution for locating the global maxima for nlm is through an automatic procedure we developed that iteratively attempts a number of combinations of initial input values with small increments at the beginning of optimization process. Alternatively, we also applied the optimx function, in particular, the simulated annealing mechanism (SANN), which belongs to the class of stochastic global optimization methods and uses the Metropolis function. The SANN algorithm has been shown to robustly locate the global maxima of given parameters, 31 and also outputs a numerically differentiated Hessian matrix. However, the trade-off is that the maximization process can be highly time consuming (compared to nlm), so it was not the primary means we adopted in our work. In the recent literature of cure rate models, the EM algorithm is widely relied on for parameter estimation within available software packages 27,32 (Table S1). A recent paper 33 published while our work was in progress, adopted EM optimization for a Weibull-PH cure model under Firth adjusted score function for bias reduction. Nevertheless, unlike nlm which directly outputs the variance-covariance matrix, EM usually involves extra steps to obtain the SE of estimates. The model output reported in this article was acquired using the nlm function exclusively.
In Section 3.3, we compare point estimation bias (and MSE) of FT-PL with logF(2,2)-PL and noint-PL. Previous work by Greenland et al 17 compared the penalized estimates as the posterior mode from different priors, including log-F(m, m) with varying m, and Jefferys prior (log-F(1)). In their single parameter logistic model, the asymptotic variance decreases as m increases and the bias of the estimator increases as m moves away from 1 and true parameter value moves away from 0. These are consistent with our findings from the simulations, as log-F(2) tends to generate smaller MSE but not smaller mean bias compared to FT-PL. Compared to FT-PL, we did not observe smaller mean bias for the regression parameters of noint-PL when intercepts were set to relatively large values. This may imply more complicated trade-offs between estimation bias and the inclusion of intercept parameters for penalization in the MC model than other models. Risk prediction as one of the important metrics was explored by previous studies 24,26 that compared penalized methods including Firth-type method. For sparse or small data sets, evidence suggests that the risk predictive performance of logistic models benefit more from noint-PL, 26 and log-F prior excels over Firth-type methods in risk predictive performance in logistic regressions. 24 As a potential future direction, our comparisons could be extended to the risk prediction by the log-F prior method and FT-PL methods in MC models.
Unlike previous studies of MC models 11,27,32,33 that relied on Wald statistics for inference, we consider MLE and FT-PLE LR statistics, which are expected to be more robust under a non-quadratic (penalized) likelihood function due to small samples. In this report, we focus on evaluation of a two degree-of-freedom test conducted to jointly assess the contribution of incidence and latency parameters associated with the same prognostic variable. Compared to marginal testing of individual parameters, we find it more informative to assess the effects on incidence and latency outcomes jointly, followed by consideration of whether a factor strongly associated with long-term recurrence also has an effect on time-to-recurrence distribution, and vice versa. 11,34 Evaluation of LR statistics under a penalized likelihood approach warrants on-going research. The PLRS approach, that directly compares the maximum penalized loglikelihood of a full model and a reduced model with constraint on the variable(s) of interest, has been applied often in GLMs, 22,25,26,28 such as logistftest and brglm2. Since the MLE is asymptotically unbiased, the penalized likelihood approach is asymptotically equivalent to maximum likelihood and the loglikelihood derivative dominates the modified score function; it follows that the penalized loglikelihood ratio between the full and the reduced models can be assumed to follow an asymptotic 2 . However, in finite samples the implementation of the PLRS approach can differ (ie, PLR-DC and nested deviance), resulting in different expressions in the penalty term of the reduced model. We investigated both implementations by simulations, and found the PLR-DC method (a) generates very similar pattern of results for simulations under the null hypothesis with more elevated T1E under low event rates, and (b) produces higher power than the nested deviance for simulations under alternative hypothesis. Nevertheless, we have presented results using the nested deviance method, which uses the same penalty for two hierarchically nested models, according to evidence that it yields a better approximation to the asymptotic distribution than the PLCR-DC. 23 However, when non-nested models are compared, the PLR-DC implementation is preferred. 23 In summary, empirical studies in finite samples show that, in the presence of relatively low event rate, FT-PL generates estimates with smaller MSE in most cases, and tends to yield estimates with smaller mean bias than that of ML in most cases when the true parameter value is close to zero. What is more, FT-PL is able to provide finite estimates of non-null parameters in small samples when ML fails to converge. These results are consistent with findings in literature across a broad range of regression models. 13,14,18,19,25,26,33 Utilizing LRTs under alternative MC model, we find that hypothesis testing for FT-PL (via nested deviance method) always yields greater power than that of ML at varying nominal testing levels. Because the Cox-PH model assumes all subjects are susceptible without a cured fraction, the results of Cox-PH analysis of the data simulated under the MC model, including the associated HR estimates, can be misleading when the underlying parameter values are non-zero, for both FT-PL and ML; statistical power is also compromised.
In prognostic applications, where multiple prognostic factors are jointly analyzed in a multivariate model to model the risk of recurrence after initial diagnosis, one of the primary goals is to identify subgroup(s) with high or low event rates through the identification of new prognostic factors such as tumor biomarkers, which can be used to stratify patients into varying risk groups. As the number of prognostic factors and relevant interactions among them increase in regression model specifications, complete separation is more likely to occur as a result of imbalanced sub-categorical distribution and missing values in the covariates, leading to biased and inefficient analyses under ML analysis. It follows from our findings in this report that another major motivation for the use of FT-PL in molecular prognostic applications of MC models is to improve robustness in multiple imputation (MI) procedures for missing covariates. In applying MI in the ANN dataset, we found MI estimation and testing susceptible to complete separation and finite sample bias under ML and Wald statistics. In on-going investigations, we are incorporating FT-PL into the analysis of multiply imputed data, to improve the performance of MI for the MC models, as we have shown it addresses the accuracy (unbiasedness) and efficiency in parameter estimation when standard ML is unsatisfactory.