On near-redundancy and identifiability of parametric hazard regression models under censoring

We study parametric inference on a rich class of hazard regression models in the presence of right-censoring. Previous literature has reported some inferential challenges, such as multimodal or flat likelihood surfaces, in this class of models for some particular data sets. We formalize the study of these inferential problems by linking them to the concepts of near-redundancy and practical non-identifiability of parameters. We show that the maximum likelihood estimators of the parameters in this class of models are consistent and asymptotically normal. Thus, the inferential problems in this class of models are related to the finite-sample scenario, where it is difficult to distinguish between the fitted model and a nested non-identifiable (i.e., parameter-redundant) model. We propose a method for detecting near-redundancy, based on distances between probability distributions. We also employ methods used in other areas for detecting practical non-identifiability and near-redundancy, including the inspection of the profile likelihood function and the Hessian method. For cases where inferential problems are detected, we discuss alternatives such as using model selection tools to identify simpler models that do not exhibit these inferential problems, increasing the sample size, or extending the follow-up time. We illustrate the performance of the proposed methods through a simulation study. Our simulation study reveals a link between the presence of near-redundancy and practical non-identifiability. Two illustrative applications using real data, with and without inferential problems, are presented.


Introduction
The analysis of time-to-event data is of interest in many areas, including medicine, biology, and engineering.Survival data are typically right-censored, as not all of the times-to-event of interest are observed within the follow-up time (administrative censoring) or there may be loss of follow-up of some individuals (random censoring).In addition to the times-to-event, individual characteristics (covariates) x ∈ R p are often available.Thus, it is of interest to incorporate the information in the covariates for modelling the times-to-event.With this aim, a number of survival regression models have been proposed in the literature.Such models are commonly formulated on the hazard function, leading to different hazard structures.The Proportional Hazards (PH) model represents the most popular hazard structure in practice [Cox, 1972].This model is formulated at the hazard level h(t; x, β), and assumes that the covariates x have a multiplicative effect on the baseline hazard function h 0 (•), as follows h(t; x, β) = h 0 (t) exp x β , arXiv:2305.05641v1 [stat.ME] 9 May 2023 where β ∈ R p are the regression coefficients.Another popular survival regression model is the Accelerated Failure Time (AFT) model, which can be formulated at the hazard level assuming that the covariates have a simultaneous multiplicative effect on the time scale and the hazard function [Kalbfleisch and Prentice, 2011] h(t; x, β) = h 0 t exp x β exp x β .
The AFT model can also be formulated as a log-linear regression model.A richer hazard structure, referred to as the General Hazard (GH) hereafter, that contains the PH and AFT models, as particular cases, was proposed independently by Etezadi-Amoli and Ciampi [1987] and Chen and Jewell [2001].The GH model is also formulated at the hazard level h(t; x, α, β), and incorporates multiplicative effects on the time scale and the hazard scale: where α ∈ R p and β ∈ R p are the regression coefficients, and h 0 (•) is the baseline hazard.This sort of models have been used in survival analysis applications, as they represent a simple alternative for including time-level effects (through the coefficients α), while allowing for a tractable estimation.These applications include the analysis of the survival of cancer patients [Li et al., 2015, Rubio et al., 2021, Rubio and Drikvandi, 2022, Ribeiro-Amaral et al., 2023], recurrent event data [Xu et al., 2020], joint models for longitudinal and survival data [Rubio et al., 2019, Alvares andRubio, 2021], reliability analysis [Gámiz et al., 2011], and as a mean for comparing competing nested sub-models (e.g.PH and AFT) of interest [Zhao et al., 2009].
Inference on the parameters of the GH model can be conducted using parametric [Rubio et al., 2019] and semiparametric approaches [Chen and Jewell, 2001].The parametric approach represents an attractive option as parametric models are interpretable and relatively easy to implement.However, since parametric models impose specific assumptions on the shape of the baseline hazard h 0 (t; ξ), it is desirable to use flexible models that can capture a variety of shapes of the hazard function to avoid imposing restrictive assumptions.Flexible parametric models require the inclusion of additional parameters.This inclusion of parameters typically carries a cost in the finite-sample inference.For instance, by increasing the uncertainty on the model parameters (wider confidence intervals or wider profile likelihoods) [Raue et al., 2009].For GH models, some references have reported the presence of flat likelihoods or a "challenging" optimization of the likelihood function [Rubio et al., 2019, Burke et al., 2020], including non-convergence of some numerical optimization methods or singular Hessian matrices, when the sample size is small or moderate, or with high censoring rate.We focus on the study of these inferential problems from a more formal perspective, linking them to the concepts of "near-redundancy of parameters" and "practical non-identifiability" [Catchpole and Morgan, 1997, Catchpole et al., 1998, Raue et al., 2009, Cole, 2020].
Near-redundancy and practical non-identifiability refer to cases where the statistical inference may fail due to the inability to estimate the parameters of a model.This may be due to data quality, which makes the problem sampledependent rather than a general shortcoming of a parametric model.A near-redundant model is a model that, formally (i.e.theoretically), is not parameter-redundant, but might behave as a parameter-redundant model, because it is very similar to a model that is parameter-redundant for a particular data set [Cole, 2020].This occurs when the parameter estimates are close to a redundant nested model [Cole, 2012] (that is, a model that can be written in terms of a smaller set of parameters).On the other hand, a similar and related problem is that of practical non-identifiability.A model is practically non-identifiable if the log-likelihood has a unique maximum, but the length of the individual parameter's likelihood-based confidence region tends to infinity in either or both directions [Raue et al., 2009].These concepts have been largely used in Statistical Ecology and Biology [Gimenez et al., 2003, Cole et al., 2012, Cole and McCrea, 2016, Simpson et al., 2022].Although these two definitions are presented as separate concepts, we will show that there is an intersection between them in the context of the GH model (1).
The inferential challenges reported in the use of GH models, discussed above, can be formally classified as problems with near-redundancy of parameters and practical non-identifiability.Therefore, it is of interest to develop tools to identify these types of inferential problems in the GH model.Methods for detecting near-redundancy or practical non-identifiability are classified into symbolic or numerical methods [Catchpole andMorgan, 1997, Cole, 2020].The symbolic method or symbolic differentiation method, is a method based on symbolic algebra that can be used to determining whether or not a model is parameter-redundant or non-identifiable [Catchpole et al., 2001].The symbolic method basically consists in determining the rank of matrix of derivatives of exhaustive summary terms [Cole, 2020].This method has been used for different classes of models.However, for structurally complex models, computer algebra packages may not be able to calculate the required symbolic rank of the derivative matrices, due to computer memory limitations [Cole et al., 2010, Cole, 2012, Cai et al., 2021].In such cases, numerical methods have been used as an alternative.These numerical methods include the Hessian method, the simulation method, data cloning, and the profile log-likelihood method (see Cole, 2020 for a discussion on these methods).
In this article we focus on identifying inferential problems associated with the parameter estimation in GH models in the presence of right-censored observations.For GH models, we are able to pinpoint the nested non-identifiable model (i.e., parameter-redundant). Consequently, we can link the inferential problems with the closeness of the estimated model to the nested non-identifiable model.More specifically, we propose a general method for detecting near redundancy based on measuring the distance between the fitted model and the nested non-identifiable model.We develop two criteria, based on the Kullback-Leibler divergence and the Hellinger distance, which take into account the sample size and censoring rate.In section 2, we present the parametric general hazard regression model and discuss the identifiability of parameters in this model.Section 3 presents a result on the consistency and asymptotic normality of the maximum likelihood estimators of the parameters in the GH regression model, and describes some methods used to detect near-redundancy or practical non-identifiability.This section also presents our proposed methodology to detect near-redundancy, which is based on using the Hellinger distance and/or the Kullback-Leibler divergence.In section 4, we conduct a simulation study to investigate the effect of the sample size and the censoring rate on the inference on the parameters of the GH regression model, as well as the performance of the proposed methodology for detecting inferential problems.In section 5, we apply the proposed methodology for detecting near-redundancy and practical non-identifiability of parameters in two real-data examples.Finally, we discuss the results obtained in this work in section 6.

Parametric General Hazard Regression Models
In this section, we describe the GH model and discuss several choices for the baseline hazard.We also discuss identifiability of parameters in this model, which we will later connect with the concepts of near-redundancy and practical non-identifiability.
Let us first introduce some notation.Let o i ∈ R + be the survival times (or times-to-event) for individuals i = 1, . . ., n.
Let c i ∈ R + be right-censoring times, such that we only observe o i ≤ c i , δ i = I(o i ≤ c i ) be the indicator that observation i is uncensored, t i = min{o i , c i } be the observed times, and n o = n i=1 δ i be the number of uncensored observations.Let x i = (x i1 , . . ., x ip ) ∈ R p denote the vectors of available covariates.Consider the parametric general hazard (GH) structure [Chen andJewell, 2001, Rubio et al., 2019]: where θ = (ξ , α , β ) ; x i ∈ R p are the covariates that have an effect at the hazard-level; xi ∈ R q are the covariates that have an effect at the time-level, with xi ⊆ x i typically; h 0 (•; ξ) is a parametric baseline hazard function, with parameter ξ ∈ Ξ ⊆ R r .The hazard structure (2) is slightly more general than (1), as (2) allows for the inclusion of different covariates as time-level and hazard-level effects.The GH structure (2) contains the Proportional Hazards (PH, α = 0), the Accelerated Failure Time (AFT, α = β and x = x), and the Accelerated Hazards (AH, β = 0) models as particular cases (see Rubio et al., 2019 for a more extensive discussion on the interpretation of this hazard structure).
The corresponding cumulative hazard function can be written as: Common (three-parameter) choices for the baseline hazard are: the Power Generalized Weibull (PGW), Exponentiated Weibull (EW), Generalized Gamma (GG), among other distributions [Alvares and Rubio, 2021] (see the Supplementary Material for more information on these distributions).These distributions can capture the basic shapes of the hazard function: increasing, decreasing, unimodal, and bathtub.Simpler (two-parameter) models such as the gamma, lognormal, and log-logistic distributions can also be employed, although they impose more restrictions on the shape of the baseline hazard.These models are implemented in the R package 'HazReg' (https://github.com/FJRubio67/HazReg)for several choices of the parametric baseline hazard.

Identifiability
Recall that a model is identifiable if two different sets of parameter values do not result in the same model [Lehmann andCasella, 2006, Cole, 2020].More formally, let P = {P θ : θ ∈ Θ} be a statistical parametric model with parameter space Θ.The model P is said to be identifiable if the mapping θ → P θ is one-to-one [Lehmann and Casella, 2006].That is, P θ1 = P θ2 implies that θ 1 = θ 2 for all θ 1 , θ 2 ∈ Θ.The model defined by the GH structure (2) is identifiable except for the case when the baseline hazard corresponds to the Weibull distribution [Chen and Jewell, 2001], provided that there is no collinearity between the covariates and the baseline hazard is identifiable.Indeed, for the Weibull baseline hazard the PH, AFT, and AH models coincide [Chen and Jewell, 2001], making (some or all of) the parameters α and β redundant.Consequently, in the context of the GH model (2), the concepts of parameter-redundancy and non-identifiability are equivalent.To obtain a theoretically identifiable GH model, it is required to use an identifiable baseline hazard which does not belong to the Weibull family, and that there is no collinearity between the covariates.However, some flexible parametric distributions such as the PGW, EW and GG distributions contain the Weibull distribution as a particular case (when the parameter γ = 1 in these distributions, see Supplementary Material).Consequently, the GH model with baseline hazard given by these distributions contains a nested non-identifiable model, for γ = 1, while it is identifiable for γ = 1.General families of distributions containing the Weibull distribution as a particular case (or as a limit case) are discussed in Sinner et al. [2022].We also refer the reader to Ley et al. [2021] for a general discussion on basic desiderata in the use of flexible parametric models.Indeed, non-identifiability, along with near redundancy and practical non-identifiability, make parametric models fail one of their key requirements, namely, "Straightforward parameter estimation".
3 Inference in the General Hazard model In this section, we present the likelihood function associated with the GH model (2) and discuss maximum likelihood estimation (MLE) of the parameters of this model.We present a result on the consistency and asymptotic normality of the MLEs, under standard regularity conditions, which shows that the GH model has good asymptotic properties.Then, we present several methods for detecting practical non-identifiability and near-redundancy of parameters in this model.

Maximum likelihood estimation
The availability of the hazard and cumulative hazards (2)-(3) allows for a tractable implementation of the log-likelihood function, where X = (x 1 , . . ., x n ) is the design matrix, t = (t 1 , . . ., t n ) , and δ = (δ 1 , . . ., δ n ) .Thus, parameter estimation can be performed by using numerical methods, provided the baseline hazard is not Weibull, to guarantee identifiability of parameters.Although this is not an onerous condition, in practice, inferential problems with these models have been reported for small and moderate samples or high censoring rates [Rubio et al., 2019, Burke et al., 2020], even when using baseline hazards different from the Weibull distribution.We will show that these inferential problems are related to flat likelihoods and/or likelihood surfaces with multiple local maxima, which complicate the numerical estimation of the parameters.This allows us to connect these inferential challenges with the concepts of near-redundancy and practical non-identifiability of parameters [Cole, 2020], which are finite-sample problems.In contrast, the following result shows that the maximum likelihood estimators of the parameters in the GH model are consistent and asymptotically normal.
Theorem 1.Consider the hazard regression model defined by the hazard structure (2) with parametric baseline hazard h 0 (•; ξ).Let θ = (ξ , α , β ) be the true values of the parameter.Under conditions C1-C7 in the Supplementary Material, it follows that (i) Consistency: (ii) Asymptotic normality: Conditions C1-C3 are satisfied by models of practical interest, such as the PGW, GW, and EW distributions.The remaining conditions are standard regularity conditions on parametric models.This theorem indicates that the model defined by the GH structure (2) has good asymptotic properties.However, near-redundancy and practical nonidentifiability represent finite-sample problems, which is the scenario of interest in real-life problems.We now discuss general and ad hoc methods for detecting these inferential problems in the GH model.

Detecting Practical Non-Identifiability: the Profile Likelihood
As explained in Section 1, if a model exhibits practical non-identifiability, the likelihood surface will contain flat, or nearly flat, ridges.Thus, naturally, a method for detecting these inferential problems consists of visualizing or evaluating the profile likelihood function associated with each parameter.This allows for individually identifying which parameters are involved in the practical non-identifiability problem.This method has been used in biological models [Raue et al., 2009], epidemiological models [Tönsing et al., 2018], or capture-recapture models [Lebreton and Cefe, 2002].For completeness, we present the definition, using our notation on hazard regression models, of the profile likelihood below.We refer the reader to Sprott [2008] for an extensive discussion on the use of the relative profile likelihood in statistical inference.
Suppose that the vector of model parameters θ can be decomposed into two subsets of parameters (ψ , λ ) , where ψ = (ψ 1 , . . ., ψ d ψ ) denotes the parameters of interest and λ = (λ 1 , . . ., λ d λ ) are the nuisance parameters.Let also L(θ; X, t, δ) = exp { (θ; X, t, δ)} denote the likelihood function of θ.The profile likelihood function of ψ is Now, the relative profile likelihood function of ψ is a standardized version of (4), which takes a value of one at the maximum of the profile likelihood function of ψ, .
A level c profile likelihood region for ψ is given by where 0 ≤ c ≤ 1.Since −2 log R P (ψ; X, t, δ) follows an asymptotically χ 2 d ψ distribution [Sprott, 2008], a profile likelihood region for ψ with approximately 95% confidence is obtained when c ≈ 0.147.
The profile likelihood method, for identifying practical non-identifiability, consists of checking if the length of the profile likelihood intervals, of a level of interest c, for a specific parameter are infinite in either or both directions.The profile likelihood function is rarely available in closed-form, but evaluating this function numerically is usually feasible [ Van der Vaart, 2000].The profile likelihood (equivalently, profile log-likelihood) method is a suitable numerical method for detecting practically non-identifiable models, as it allows inspecting each parameter individually and detecting the parameters producing inferential problems.We will explore the use of the profile likelihood for detecting practical non-identifiability in our simulation study and real-data applications.

Detecting near-redundancy: Distance-based Methods
Inferential problems may also appear when the true generating model is theoretically identifiable but it is close to a non-identifiable model.In the GH model, the nested non-identifiable model corresponds to the case where the baseline hazard belongs to the Weibull family.Inferential problems associated with this closeness are thus related to the concept of near-redundancy of parameters.Indeed, the PGW, EW, and GG distributions contain the Weibull distribution as a particular case, and consequently are prone to producing near-redundancy.
A possible way for detecting when the fitted model is close to the nested non-identifiable model consists of measuring the similarity between the fitted baseline hazard and the Weibull family.The similarity between the fitted model and the closest Weibull model can be calculated by using a distance or a divergence between probability measures.Let F 0 = F ξ be the parametric cumulative distribution associated with the baseline hazard in model (2).Let F 0 = F ξ be the fitted distribution function (associated with the fitted baseline hazard), with corresponding density function f 0 , and let F W be the Weibull distribution function with parameters η = (σ, ν) , with density function f W .Following this line of thought, let us define the minimum distance between the fitted model and the Weibull family as: for some distance or divergence between probability measures d(•, •).In order to use this criterion in practice, we need to specify a distance or divergence between distributions, and a threshold to identify when the two distributions are close.That is, we would like to define a threshold U (n, c, d) (which depends on the sample size n, the number of censored observations c, and the distance or divergence d) such that if the following inequality holds, then the sample and the model are classified as near-redundant (subject to stochastic error).
Regarding the choice of d(•, •), there exist a number of distances (or divergences) between distributions that could be employed [Gibbs and Su, 2002], but it is also desirable to use an interpretable distance.The threshold to identify problematic cases should also be linked to the sample size since, as the sample size grows, inferential problems also disappear or become less likely, given that the GH model has good asymptotic properties.Note that if the true value of the parameter ξ (see Theorem 1) is such that F ξ belongs to the Weibull family, then Consequently, a desirable property of the threshold Next, we present some specific choices for the distance d(•, •), and discuss some heuristic approaches to establish a threshold U (n, c, d) to identify near redundant cases.The performance of these criteria will be assessed later in the simulation study.
(a) The Kullback-Leibler (KL) divergence, where To establish a threshold to identify near redundant cases, we borrow some concepts from information theory.
In information theory, a quantity of interest, related to the KL divergence, is the KL minimax redundancy.Under our notation, this is defined as [Cover and Thomas, 2006] where F ξ is the parametric distribution associated with the baseline hazard.Several works in information theory have shown that, under regularity conditions, R * decreases in probability to zero at a rate k log(n)/(2n), where k is the number of parameters (see Acharya et al., 2012, and the references therein, for a general review on this kind of results and Barron and Hengartner [1998] for a theoretical treatment of this result).
Given that the sample may contain censored observations, instead of using the sample size n, we will use the effective sample size n e = n − ρc, where ρ ∈ (0, 1), which accounts for the loss of information due to censoring.Then, motivated by the KL minimax redundancy convergence rate, we propose the threshold U = M k log(n e )/(2n e ), where M > 0 is a positive constant specified by the user (arising from the convergence rate of the KL minimax redundancy, which is only defined up to a proportionality constant).Following this line of thought, we propose the following heuristic criterion to detect near-redundancy.We say that, if the following inequality holds, then the corresponding GH model (2) and the data are classified as near-redundant.In our applications, and based on our simulations, we will focus on the choice M = 0.05, but other choices are also possible.We also assume that k is the total number of parameters in the GH model (rather than just those in the baseline hazard), as the model parameters are estimated jointly.(b) The Hellinger distance, where We will now apply some results from Le Cam [1973] to establish a criterion to assess the closeness of F 0 and the Weibull family based on this distance.Following the results in Le Cam [1973], we cannot distinguish between two probability distributions F and G, without incurring in a maximal error κ ∈ (0, 1), if where n is the sample size (see Baraud [2021] for a discussion on this result).Recall that we are interested in distinguishing between F 0 and the closest F W , based on a sample of size n with n − c uncensored observations.Replacing the sample size n by the effective sample size n e and d by D in Le Cam's criterion (7), we obtain the following heuristic criterion to detect near-redundancy.We say that, if the following inequality holds then the corresponding GH model (2) and the data are classified as near-redundant.Equivalently, The parameter κ > 0 controls the maximal error in distinguishing the two probabilities.However, its interpretation as a maximal error is only approximate as the distribution F 0 is based on parameter estimates (from a censored sample) and we are considering the (minimum) distance to an entire family of distributions (Weibull).By fixing κ and for a fixed sample size, this inequality provides a heuristic criterion for identifying the closeness of the estimated model to the non-identifiable Weibull case.In our applications and simulations we will explore the use of κ = 0.05.
We note that these criteria impose different thresholds than the original criteria as we are using the effective sample size n e .Following the rule in Liu [2012], the conventional choice for ρ = 0.5, which we adopt in both criteria to define the effective sample size.The performance of criteria ( 6) and ( 8) under this choice of the values for M , κ, and ρ will be assessed in the simulation study.However, we emphasize that other choices are also possible.
Note also that the criteria ( 6) and ( 8) can be used to either classify near redundant models based on a specific value of D and n e , or to identify the minimum number of uncensored observations required to reduce problems with near-redundancy based on a specific value of D. Of course, other distances (such as the Total Variation or Wasserstein-d distances) could be used instead.However, this would require proposing appropriate thresholds to identify near-redundancy.
One of the limitations of criteria ( 6) and ( 8) is that they are based on point estimates of the parameters, and thus are prone to misclassification in cases where the estimates of the parameters exhibit large uncertainty (e.g. for small samples, high censoring rates, or short follow-up).In our applications in Section 5, we will also estimate the probability that inequalities ( 6) and ( 8) hold by using a nonparametric bootstrap sample of the maximum likelihood estimates (i.e.obtained by resampling with replacement).

Detecting Near-Redundancy: The Hessian Method
In addition to the distance-based criteria presented in the previous subsection, in our applications and simulations we will consider the Hessian matrix method [Gimenez et al., 2004, Little et al., 2010, Cole, 2020].This method consists of calculating the ratio of the smallest eigenvalue and the largest eigenvalue of the Fisher Information matrix or the Hessian (of the log-likelihood function) matrix evaluated at the maximum likelihood estimate (see Cole, 2020 for a detailed discussion on this criterion).For completeness, we briefly describe the steps in the Hessian method in Algorithm 1, which we quote verbatim from Cole [2020].
1: Find the Hessian matrix using a numerical method.
2: Find the eigenvalues of the Hessian matrix.
3: Standardise the eigenvalues by taking the modulus of the eigenvalues and dividing through by the largest eigenvalue.
4: The model with that data set is near-redundant if the smallest eigenvalue is less than 0.001.
Several authors have used the threshold 0.001 to investigate near-redundancy (see Cole, 2020, chapter 4 for a discussion).This threshold is based on the concept of "sloppiness" of a model (see Cole, 2020, chapter 3 for a discussion on this topic), and the threshold value proposed by Chis et al. [2016] to detect sloppiness.Other rules for selecting the threshold are discussed in Cole [2020], who also suggested that this threshold should depend on the model.We will explore the use of the Hessian method in our simulations and applications in sections 4 and 5.
Although the concepts of near-redundancy of parameters and practical non-identifiability are studied separately, in the context of the GH model there is an overlap between these two definitions.As shown in the following simulation study and applications, when the baseline hazard is sufficiently close to the Weibull family, the model is classified as near-redundant and we generally observe flat and multimodal profile likelihoods.Keeping this in mind, the following simulation study illustrates the link between these two definitions.

Simulation study
In this section we present a simulation study which aims at evaluating the performance of the distance-based methods proposed in section 3.3, as well as the Hessian method in Algorithm 1, for detecting near-redundancy in the GH model.Practical non-identifiability is diagnosed using the profile likelihood, as discussed in section 3.2.We evaluate the effect of the sample size and the censoring rate on the presence of near-redundancy and practical non-identifiability.This simulation study also aims at understanding the link between the presence of near-redundancy and practical non-identifiability problems.

Simulation scenarios
We present seven simulation scenarios in total.In each case, we simulate M = 250 samples of varying sizes n = 250, 500, 1000.For each scenario, we select administrative censoring times that produce censoring rates of approximately 30% and 50%.In addition, in all scenarios, we only include one covariate as this is the simplest structure where, if any inferential problems are detected, such problems could only become more severe with the inclusion of more covariates.This covariate is generated from the normal distribution with mean 0 and standard deviation 1.We employ the regression parameter values α = 1.5 and β = 2.5 for all scenarios.
In the first three scenarios, the times-to-event are simulated from the GH model ( 2) with PGW baseline hazard (denoted as PGW-GH) model using the probability integral transform [Rubio et al., 2019].In Scenario 1, we select parameter values that produce a unimodal baseline hazard function σ = 0.3, ν = 1.5, γ = 5.This represents a scenario where the baseline hazard shape cannot be captured by the Weibull baseline hazard, and thus no inferential problems would be expected.In Scenario 2, we select parameter values that produce an increasing baseline hazard function σ = 1.2, ν = 1.3, γ = 0.85.This hazard shape can indeed be obtained with a Weibull hazard, although the hazard tails between the PGW and Weibull distributions differ for γ = 1, and thus the model is theoretically identifiable.In Scenario 3, we select parameter values that produce a decreasing baseline hazard function σ = 0.1, ν = 0.9, γ = 4.This hazard shape can also be captured by the Weibull baseline hazard.In Scenario 4, the data are simulated from a GH model (2) with lognormal (LN) baseline hazard (denoted as LN-GH).The shape of the baseline hazard function is unimodal, and we select the parameters μ = 0 and σ = 1.5 (mean and standard deviation of the distribution on the log scale).This represents a hazard shape that cannot be captured by the Weibull hazard, and the baseline hazard is outside the PGW family.In Scenario 5, the data are simulated from a GH model ( 2) with Exponentiated Weibull (EW) baseline hazard (denoted as EW-GH).The shape of the baseline hazard function is increasing, and we select the parameters σ = 0.7, ν = 1.2, γ = 0.85.This represents a hazard shape that can be captured by the Weibull hazard, and the baseline hazard is outside the PGW family.For Scenarios 1-5, we fit the PGW-GH model.Thus, for Scenarios 1-3 we have correct model specification, while for Scenarios 4-5 there is model misspecification.However, the fitted model (PGW-GH) can capture the baseline hazard shapes from the true generating model in these scenarios.In Scenario 6, we use the data simulated in Scenario 2 from the PGW-GH model but we fit a GH model ( 2) with Exponentiated Weibull (EW) baseline hazard (denoted as EW-GH).In Scenario 7, we use the data simulated in Scenario 2 from the PGW-GH model but we fit a GH model ( 2) with Generalized Gamma (GG) baseline hazard (denoted as GG-GH).The aim of Scenarios 6-7 is to illustrate that the inferential problems (near-redundancy and practical non-identifiability) are present for models that contain the Weibull baseline hazard, and that these problems are not specific to the PGW-GH model.Figure 1 in the Supplementary Material shows the shape of all the baseline hazards used in this simulation study.
All models are fitted using the R package 'HazReg', which is available at https://github.com/FJRubio67/HazReg.The integral required for criteria (6) and ( 8) is calculated using the R command integrate.The profile likelihood functions are implemented directly and the optimization step is performed using the R command nlminb.
The Hessian matrix, required to implement the Hessian method, is calculated using the R package numDeriv.

Results
To evaluate the performance of the proposed distance-based criteria ( 6) and ( 8), we calculate the proportion of times that these criteria classify a simulated sample as a sample producing near-redundancy of parameters of the fitted models.For the KL divergence criterion (6) we use M = 0.05, and for the Hellinger distance criterion (8) we choose κ = 0.05.We have compared other values of M , and we found that this value produced good results in terms of accurate classification and indeed producing very similar results to those obtained with criterion (8).Thus, a sample is classified as a sample producing near redundancy if the inequalities in criteria ( 6) and/or (8) hold.We also employ the Hessian method in Algorithm 1 to identify near-redundancy.
An important aim of this simulation study is to compare the classification of samples producing near-redundancy against the classification of samples producing practical non-identifiability.To detect practical non-identifiability, we assess the flatness of the profile likelihoods of α and β.More specifically, we evaluate the profile likelihoods at α ± ∆ α and β ± ∆ β , with ∆ α = ∆ β = 3, and verify if these values are below 0.147 (the level used to construct 95% profile likelihood confidence intervals).The values of ∆ α and∆ β where chosen by visually inspecting several profile likelihood functions and identifying a suitable conservative length.
Tables 1-3 show the 2-way classification of samples producing "near-redundancy" (NR) and/or "practical nonidentifiability" (PNI) problems, using the aforementioned methods.The cases where no NR nor PNI problems are detected, using the aforementioned methods, will be referred to as "identifiable" (I).The reason why results are presented in 2-way tables follows the main aim of the simulation study, which is about connecting and comparing the presence of the two inferential problems of interest (NR and PNI).Note that the true classification of NR and PNI cases is not known beforehand.Thus, the results presented in this section should be read considering the profile method (used to classify PNI problems) as a reference, as this method is based on directly detecting flat ridges in the profile likelihood.
For Scenario 1 (Table 1), corresponding to a baseline hazard with unimodal shape that cannot be captured by the Weibull hazard function, we can see that very few samples induce inferential problems, and that there is an interplay between censoring and sample size that produces problematic samples.As expected, the larger the sample or the lower the censoring rate, the fewer problematic samples are observed.For Scenario 2 (Table 2), we observe a very large proportion of samples producing near-redundancy and practical non-identifiability, and that the proportion of samples without inferential problems very slowly decreases as the sample size grows or the censoring rate decreases.This is a challenging scenario where it is very difficult to distinguish the fitted model from the nested Weibull (non-identifiable) model.We also notice that there is a non-negligible proportion of cases that are classified as identifiable using criteria (6) and/or (8).This indicates that the proposed distance-based methods for detecting near redundancy may not necessarily be able to detect practical non-identifiability problems in all samples.That is, these methods are based on point estimates of the parameters, which can be inaccurate (e.g.γ far from 1) for some samples, provide no indication of near-redundancy.On the other hand, the profile likelihood is able to detect practical non-identifiability problems as it takes the (large) uncertainty in the estimation of the parameters into account.Another interesting behaviour is that 30% censoring rate seems to produce more samples that induce inferential problems than 50% censoring rate, which sounds counter-intuitive.The reason for this is that the estimates of γ have much more variability when the censoring rate is 50%, so some estimates of this parameter are further away from γ = 1, but also far way from the true value of the parameter (see Tables 3-4 in the Supplementary Material).This scenario shows that not even samples of size n = 1000 are enough to substantially reduce the presence of inferential issues.In this case, fitting a simpler, identifiable, model would be an alternative.For instance, fitting an AFT model with Weibull baseline hazard, and comparing those models using a model selection tool.We will explore this idea in the applications presented in section 5.For Scenario 3 (Table 3), we notice a much lower proportion of problematic samples (compared to Scenario 2), and a clear reduction in the proportion of inferential problems when the sample size grows or the censoring rate decreases.This is in line with the intuition that suggests that the larger the sample, or the lower censoring rate, the more information is contained in the data to estimate the model parameters (and thus being able to distinguish the fitted model from the nested non-identifiable model).Tables 1-6 in the Supplementary Material show summaries of the MLEs for Scenarios 1-3 that complement the information and comments provided here.Interestingly, in Scenarios 1-3 the Hessian method produces very similar performance to the distance-based methods.This is both, reassuring about the results obtained with the distance-based methods, and a favorable outcome.Indeed, rather than considering distance-based methods and the Hessian method as competitors, they represent tractable and interpretable tools to identify near-redundancy problems, which may be jointly used.
Results for Scenario 4 are presented in Tables 7-9 in the Supplementary Material.We observe a similar performance of the distance-based measures and the profile likelihood function as in Scenario 1, despite model misspecification.
The reason for this is that the baseline hazard is also unimodal, even though it is not PGW, a hazard shape that can be closely captured by the PGW distribution.Results for Scenario 5 are presented in Tables 10-12 in the Supplementary Material.We observe a similar performance of the distance-based measures and the profile likelihood function as in Scenario 2. Again, the reason for this is that the baseline hazard is also increasing and close to the Weibull distribution, even though it is not PGW.Results for Scenario 6 and 7 are presented in Tables 13-15 and 16-18, respectively, in the Supplementary Material.In both scenarios, we observe high proportions of cases with near-redundancy and practical non-identifiability.MLEs for both scenarios exhibit high bias and large variability.These scenarios show that even under model misspecification, if the fitted model has a baseline hazard that cannot be distinguished from the Weibull hazard, it is possible to observe problems of near-redundancy and practical non-identifiability.
We have found that the proposed distance-based methods and the Hessian method have a comparable performance in classifying samples into NR and I cases.Moreover, Tables 1-3 indicate a considerable agreement between the classification of "PNI vs. NR" and "I vs. I" cases.Consequently, our results reveal a link between the presence (or absence) of practical non-identifiability and near-redundancy problems in the GH model.However, we have also found cases where the true model is very close to the non-identifiable model (Scenario 2), and the profile likelihood contains flat ridges, but the distance-based methods and the Hessian method do not classify such samples as NR.This indicates that the two concepts (PNI and NR) are not equivalent, at least with the detection methods studied here.Indeed, in Scenario 2, the Hessian method leads to a slightly larger overlap in the classification of "PNI vs. NR" cases than that obtained with distance-based methods.Apparently, a reason for this difference is that the Hessian method incorporates information about the curvature of the likelihood function (Hessian matrix) at the MLE, while distance-based methods only utilize the parameter values that maximize the likelihood function (i.e. the MLE).Moreover, distance-based methods and the Hessian method do not account for uncertainty on the parameters as they are based on point estimates (which might be biased).In our applications in section 5, we will explore the idea of incorporating uncertainty about the estimation of the parameters using nonparametric bootstrap.Overall, we recommend combining the distance-based methods and the Hessian method with the inspection of the profile likelihood curves to aid the detection of inferential problems.

Applications
This section presents two real-data examples that illustrate the use of the proposed methodology for detecting nearredundancy and practical non-identifiability of parameters.We apply the distance-based methods and the Hessian method presented in section 3, as well as the evaluation of the profile likelihood function.In the first example, we illustrate a case where the GH model exhibits near-redundancy and practical non-identifiability of parameters due to the closeness of the baseline hazard to the Weibull family.In the second example we present a case without near-redundancy and where the fitted model can be easily distinguished from the Weibull family.The code and data for these examples are available at https://github.com/FJRubio67/NRPNISurv

Case study I: Lung cancer data
We analyze survival data of patients with advanced lung cancer from the North Central Cancer Treatment Group.The data set was obtained from the survival R package, and contains information about n = 227 patients.For each patient, the following variables were recorded: survival time in days (converted to years for our analysis), vital status, age in years (standardized for our analysis), gender and ECOG performance score.The 25%, 50% and 75% quantiles of the patients' survival time were 0.456, 0.699, and 1.084 years.Among the patients, 63 had an ECOG performance score of 0, 113 had an ECOG performance score of 1, 50 had an ECOG performance score of 2, and 1 had an ECOG performance score of 3. Female patients corresponded to 39% of the total sample.
To analyze this data set, we fit the GH model (2) with PGW baseline hazard (denoted as PGW-GH) with time-level covariate age, and hazard-level covariates age, sex and ph.ecog (ECOG performance score).The maximum likelihood estimates of the parameters of this model are shown in the Table 4.In this case γ = 0.861, which indicates that the fitted model is not a (theoretically) parameter-redundant model (i.e.γ = 1) at this parameter value.The relative profile likelihoods of the parameters are presented in Figure 1.In these plots we can notice that the maximum likelihood estimates exist, in the sense that an overall maximum of the likelihood function is attained.However, we can also notice a number of issues with the profile likelihoods.Regarding the estimation of the parameters of the baseline hazard, the profile likelihood of σ has a second mode at a value σ < 1, and an inflection point around σ ≈ 1.The profile likelihood of ν is unimodal, but it contains a point at ν ≈ 1 where the derivative changes abruptly.From the profile likelihood  of γ, we notice the presence of local maxima, together with an inflection point at γ = 1.We emphasize that these problems are not related to numerical issues [Cole, 2019], but to changes in the direction of the profile likelihood at specific values of the parameters associated with the nested non-identifiable model (γ = 1).For the parameters α 1 and β 1 , the relative profile likelihood is flat in both directions, indicating problems of practical non-identifiability of these parameters.The parameters α 1 and β 1 are indeed associated with the same covariate (age).Therefore, even though the maximum likelihood estimator of γ is not exactly one, flat and multimodal relative profile likelihoods are obtained for the parameters associated with this covariate.In contrast, the profile likelihoods for the parameters β 2 and β 3 are unimodal and concentrated around their maximum.
We now look at the proposed distance-based criteria.The minimum KL divergence between the fitted PGW baseline hazard and the Weibull family is equal to 0.00036, while the upper bound, with M = 0.05, in criterion ( 6) is 0.033, and the effective sample size in 195.5.Thus, the KL divergence criterion (6) suggests near-redundancy of parameters of the PGW-GH model.Also, the minimum Hellinger distance between the fitted PGW baseline hazard and the Weibull family is equal to 0.0096.Thus, the Hellinger distance criterion (8), with κ = 0.05, also indicates near-redundancy of the parameters of the PGW-GH model.To incorporate the uncertainty in the estimation of the parameters into the distance based criteria ( 6) and ( 8), we apply these criteria to B = 1000 bootstrap samples of the MLEs (i.e.obtained via re-sampling with replacement).We estimate the probability of these criteria by taking the proportion of times the inequalities ( 6) and ( 8) hold.We found that the KL divergence criterion (6) indicates near-redundancy with probability 0.997, while the Hellinger criterion (8) indicates near-redundancy with probability 0.92.Thus, there is a high probability that ( 6) and (8) hold, suggesting the presence of near-redundancy of parameters.On the other hand, the smallest standardized eigenvalue of the Hessian Matrix of the log-likelihood function is 6.5 × 10 −5 .Consequently, the Hessian method (with 0.001 threshold) indicates near-redundancy of parameters of this model and data.The bootstrap probability of the Hessian method with 0.001 threshold is 1, providing further evidence of near-redundancy.
To complete our analysis, we consider alternative models: the GH model (2) with exponentiated Weibull (EW) baseline hazard (denoted as EW-GH), which represents a competitor of the PGW-GH model in terms of hazard structure and flexibility, a PH model with PGW baseline hazard (PGW-PH), an AFT model with PGW baseline hazard (PGW-AFT), and an AFT model with Weibull baseline hazard (W-AFT).The EW distribution, like the PGW, is equal to a Weibull distribution if the shape parameter γ is equal to one (see Supplementary Material).The maximum likelihood estimates of the parameters of these models are also shown in Table 4.We notice that the MLE of γ in the EW model and the XXXX A PREPRINT PGW model are all different from one.From Table 4, the W-AFT model is favored by the Akaike information criterion (AIC), followed by the PGW-PH model, showing that indeed simpler models are favored by the data.Although AIC is not a method to detect near-redundancy, we can see that model selection techniques can also be useful to identify simpler models without these inferential problems.

Case study II: Leukemia data
We now analyze the LeukSurv data set from the spaBayesSurv R package.This data set contains information about n = 1043 patients with acute myeloid leukemia.For each case, the following variables were recorded: survival time in days (converted to years for our analysis), vital status, age in years (standardized for our analysis), gender, white blood cell count (standardized) at diagnosis and Townsend score (standardized).The 25%, 50% and 75% quantiles of the patients' survival time were 0.112, 0.507 and 1.468 years.Female patients corresponded to 48% of the sample.
We fit the GH model (2) with PGW baseline hazard (denoted as PGW-GH) with the time-level covariates age, wbc (white blood cell count at diagnosis) and tpi (Townsend score, which is a measure of deprivation), and hazard level covariates age, sex, wbc and tpi.Table 5 shows the values of the maximum likelihood estimates of the parameters of this model.We first notice that the estimate of the parameter γ is far from one but, in order to discard inferential issues, we need to inspect the different criteria.In this line, Figure 2 shows the relative profile likelihood functions of the parameters of the PGW-GH model.Indeed, in all cases we notice that the profile likelihood functions are unimodal and seem to quickly decrease to zero.Thus, in this example we do not observe problems with multimodality or flat ridges of the profile likelihoods.Consequently, there is no evidence of practical non-identifiability for this model and this data set.Now, the minimum KL divergence between the fitted PGW baseline hazard and the Weibull family is equal to 0.056, while the upper bound in criterion ( 6) is 0.0178, and the effective sample size in 961.Thus, the KL divergence criterion (6), with M = 0.05, does not indicate near-redundancy of parameters of the PGW-GH model.Also, the minimum Hellinger distance between the fitted PGW baseline hazard and the Weibull family is equal to 0.101.Thus, the Hellinger distance criterion (8) with κ = 0.05 does not indicate near-redundancy of the parameters of the PGW-GH model.The estimated probability (based on B = 1000 bootstrap samples) that criteria ( 6) and ( 8) hold is zero.On the other hand, the smallest standardized eigenvalue of the Hessian Matrix of the log-likelihood function is 0.0035.Consequently, the Hessian method (with threshold 0.001) does not indicate near-redundancy of parameters of this model and data.The bootstrap probability of the Hessian method with 0.001 threshold is 0. Thus, the Hessian method does not indicate near-redundancy of parameters of this model and data, even after accounting for uncertainty.
For comparison, we now fit the GH model (2) with Exponentiated Weibull baseline hazard (EW-GH), as an alternative flexible model as well as simpler identifiable models such as the AFT model with PGW baseline hazard (PGW-AFT), PH model with PGW baseline hazard (PGW-PH), and the AFT model with Weibull baseline hazard (W-AFT).Table 5 shows the MLEs and Akaike information criterion (AIC) for the different models.The AIC favors the EW-GH model overall, followed by the PGW-GH, respectively.This indicates that, although the different distributions can capture similar shapes, they are theoretically different and one of them may offer a better fit for a specific data set.

Discussion
Parametric models have regained popularity in survival analysis thanks to the availability of richer formulations of survival regression models that allow for capturing complex effects, such as time-level, hazard-level, and nonlinear effects of the covariates (see Eletti et al., 2022 for a general overview).However, an increase in model complexity also carries an inferential cost, which may be reflected, in mild scenarios, as wider confidence intervals or, in more extreme scenarios, in flat or multimodal likelihood functions.These types of inferential problems have been reported for the GH model ( 2), and we have shown that such problems can be classified through the concepts of near-redundancy of parameters and practical non-identifiability.More specifically, we have shown that near-redundancy and practical non-identifiability problems appear in the GH model when the baseline hazard of the fitted model is close to the Weibull family.
We have introduced a method for detecting near-redundancy in GH regression models based on the distance of the fitted model to a nested non-identifiable model.In addition, we have explored the use of the Hessian method to detect near-redundancy, while practical non-identifiability has been studied using of the profile likelihood function.We have also introduced the use of bootstrap methods to incorporate the uncertainty on the parameter estimation into the methods for detecting near-redundancy.In practice, detecting near-redundancy and practical non-identifiability is useful to understand the limitations in fitting a more complex model.For instance, the use of asymptotic results (such as standard errors and normal confidence intervals) may not be valid or may produce inaccurate results in the presence of these problems.Moreover, the presence of these inferential problems may put into question the use of such complex models for important tasks such as prediction, policy or decision making.
The simulation study shows that the distance-based methods and the Hessian method have comparable performance in detecting near-redundancy.Moreover, comparing the cases classified as near-redundant with those cases classified as practically non-identifiable reveals a strong connection between the presence of these two inferential problems in the GH model.The simulation results also show that the interplay of model complexity, sample size, and censoring rate plays an important role in the appearance of near-redundancy and practical non-identifiability in the GH model.In problematic samples in our simulation study and case studies (i.e.leading to near-redundancy and/or practical non-identifiability), we noticed that the profile likelihood may not only contain flat ridges, but also local maxima and abrupt changes in the derivative of the profile likelihood.
Whenever a case with near-redundancy or practical non-identifiability is obtained, a possible alternative is to use a simpler identifiable model that can be written in terms of a smaller set of parameters (i.e.not containing a parameterredundant model).These simpler models include the PH and AFT models.Indeed, recent references [Simpson et al., 2022] have suggested the use of model selection aided by information about practical non-identifiability of the models under comparison, and that more complex models should be used when there is enough data to estimate the model parameters.However, there are other possible solutions, such as increasing the effective sample size (either by adding new samples or increasing the follow-up time).In summary, the three possible solutions are: reducing censoring, increasing the sample size, or using a simpler identifiable model.We also recommend looking for other problems in the likelihood function, beyond flat ridges.Indeed, the presence of flat ridges also depends on the parameterization, as one parameterization may lead to flat ridges while another parameterization of the same model may lead to a different behavior.
Although we have focused on the study of right-censoring (the most common type of censoring), our conclusions are likely to apply to samples with left-censoring, interval-censoring, and truncation.Other potential extensions of our work consist of analyzing near-redundancy and practical non-identifiability in other types of flexible parametric survival models, including generalized additive models [Eletti et al., 2022], other general classes of survival models [Muse et al., 2022], cure models [Hanin and Huang, 2014], or distributional-regression models [Rigby andStasinopoulos, 2005, Burke et al., 2020].The fact that simpler models represent an alternative to avoid problematic inferential cases, points to the need for developing formal model selection tools in the context of the GH model (see Rossell and Rubio, 2023 for a general overview on model and variable selection in survival models).Finally, other areas in Statistics where flat likelihoods appear, such as inference in circular models [Miyata et al., 2022, Johnson, 2022], may also benefit from linking inferential problems in those models with the concepts of near-redundancy and practical non-identifiability.
Simulation study: Scenario 1. Classification of samples into Identifiable (I), Practical non-identifiable (PNI), and Near-redundant (NR).The rows represent the classification of NR and I cases, and the columns represent the classification of PNI and I cases.H indicates the NR classification based on the Hellinger distance, KL denotes the NR classification based on the Kullback-Leibler divergence, and Hessian denotes the NR classification based on the Hessian method.
(I), Practical non-identifiable (PNI), and Near-redundant (NR).The rows represent the classification of NR and I cases, and the columns represent the classification of PNI and I cases.H indicates the NR classification based on the Hellinger distance, KL denotes the NR classification based on the Kullback-Leibler divergence, and Hessian denotes the NR classification based on the Hessian method.
study: Scenario 3. Two-way classification of samples into Identifiable (I), Practical non-identifiable (PNI), and Near-redundant (NR).The rows represent the classification of NR and I cases, and the columns represent the classification of PNI and I cases.H indicates the NR classification based on the Hellinger distance, KL denotes the NR classification based on the Kullback-Leibler divergence, and Hessian denotes the NR classification based on the Hessian method.

Figure 1 :
Figure 1: Lung cancer data.Profile likelihood plots for the parameters of the PGW-GH model.

Figure 2 :
Figure 2: Leukemia cancer data.Profile likelihood plots for the parameters of the PGW-GH model.

Table 2 :
Simulation study: Scenario 2. Classification of samples into Identifiable

Table 4 :
Lung cancer data.Maximum likelihood estimation summary.

Table 5 :
Leukemia data.Maximum likelihood estimation summary.