Most Likely Transformations
Abstract
We propose and study properties of maximum likelihood estimators in the class of conditional transformation models. Based on a suitable explicit parameterization of the unconditional or conditional transformation function, we establish a cascade of increasingly complex transformation models that can be estimated, compared and analysed in the maximum likelihood framework. Models for the unconditional or conditional distribution function of any univariate response variable can be set up and estimated in the same theoretical and computational framework simply by choosing an appropriate transformation function and parameterization thereof. The ability to evaluate the distribution function directly allows us to estimate models based on the exact likelihood, especially in the presence of random censoring or truncation. For discrete and continuous responses, we establish the asymptotic normality of the proposed estimators. A reference software implementation of maximum likelihood‐based estimation for conditional transformation models that allows the same flexibility as the theory developed here was employed to illustrate the wide range of possible applications.
1 Introduction
In a broad sense, we can understand all statistical models as models of distributions or certain characteristics thereof, especially the mean. All distributions
for at least ordered responses Y can be characterized by their distribution, quantile, density, odds, hazard or cumulative hazard functions. In a fully parametric setting, all these functions have been specified up to unknown parameters, and the ease of interpretation can guide us in looking at the appropriate function. In the semi‐parametric and non‐parametric contexts, however, the question arises how we can obtain an estimate of one of these functions without assuming much about their shape. For the direct estimation of distribution functions, we deal with monotonic functions in the unit interval, whereas for densities, we need to make sure that the estimator integrates to one. The hazard function comes with a positivity constraint, and monotonicity is required for the positive cumulative hazard function. These computationally inconvenient restrictions disappear completely only when the log‐hazard function is estimated, and this explains the plethora of research papers following this path. However, the lack of any structure in the log‐hazard function comes at a price. A too‐erratic behaviour of estimates of the log‐hazard function has to be prevented by some smoothness constraint; this makes classical likelihood inference impossible. The novel characterization and subsequent estimation of distributions via their transformation function in a broad class of transformation models that are developed in this paper can be interpreted as a compromise between structure (monotonicity) and ease of parameterization, estimation and inference. This transformation approach to modelling and estimation allows standard likelihood inference in a large class of models that have so far commonly been dealt with by other inference procedures.
Since the introduction of transformation models based on non‐linear transformations of some response variable by Box & Cox (1964), this attractive class of models has received much interest. In regression problems, transformation models can be understood as models for the conditional distribution function and are sometimes referred to as ‘distribution regression’, in contrast to their ‘quantile regression’ counterpart (Chernozhukov et al., 2013). Traditionally, the models were actively studied and applied in the analysis of ordered categorical or censored responses. Recently, transformation models for the direct estimation of conditional distribution functions for arbitrary responses received interest in the context of counterfactual distributions (Chernozhukov et al., 2013), probabilistic forecasting (Gneiting & Katzfuss, 2014), distribution and quantile regression (Leorato & Peracchi, 2015; Rothe & Wied, 2013), probabilistic index models (Thas et al., 2012) and conditional transformation models (Hothorn et al., 2014). The core idea of any transformation model is the application of a strictly monotonic transformation function h for the reformulation of an unknown distribution function
as
, where the unknown transformation function h is estimated from the data. Transformation models have received attention especially in situations where the likelihood contains terms involving the conditional distribution function
with inverse link function FZ, most importantly for censored, truncated and ordered categorical responses. For partially linear transformation models with transformation function h(y|x)=hY(y)+hx(x), much emphasis has been given to estimation procedures treating the baseline transformation hY (e.g. the log‐cumulative baseline hazard function in the Cox model) as a high‐dimensional nuisance parameter. Prominent members of these estimation procedures are the partial likelihood estimator and approaches influenced by the estimation equations introduced by Cheng et al. (1995). Once an estimate for the shift hx is obtained, the baseline transformation hY is then typically estimated by the non‐parametric maximum likelihood estimator (see, e.g. Cheng et al., 1997). An overview of the extensive literature on the simultaneous non‐parametric maximum likelihood estimation of hY and hx, that is, estimation procedures not requiring an explicit parameterization of hY, for censored continuous responses is given in Zeng & Lin (2007).
An explicit parameterization of hY is common in models of ordinal responses (Tutz, 2012). For survival times, Kooperberg et al. (1995) introduced a cubic spline parameterization of the log‐conditional hazard function with the possibility of response‐varying effects and estimated the corresponding models by maximum likelihood. Crowther & Lambert (2014) followed up on this suggestion and used restricted cubic splines. Many authors studied penalized likelihood approaches for spline approximations of the baseline hazard function in a Cox model, for example, Ma et al. (2014). Less frequently, the transformation function hY was modelled directly. Mallick & Walker (2003), Chang et al. (2005) and McLain & Ghosh (2013) used Bernstein polynomials for hY, and Royston & Parmar (2002) proposed a maximum likelihood approach using cubic splines for modelling hY and also time‐varying effects. The connection between these different transformation models is difficult to see because most authors present their models in the relatively narrow contexts of survival or ordinal data. The lack of a general understanding of transformation models made the development of novel approaches in this model class burdensome. Hothorn et al. (2014) decoupled the parameterization of the conditional transformation function h(y|x) from the estimation procedure and showed that many interesting and novel models can be understood as transformation models. The boosting‐based optimization of proper scoring rules, however, was only developed for uncensored and right‐censored observations in the absence of truncation and requires the numerical approximation of the true target function. In a similar spirit, Chernozhukov et al. (2013) applied the connection
for estimation in the response‐varying effects transformation model
; this approach can be traced back to Foresi & Peracchi (1995).
A drawback of all but the simplest transformation models is the lack of a likelihood estimation procedure. Furthermore, although important connections to other models have been known for some time (Doksum & Gasko, 1990), it is often not easy to see how broad and powerful the class of transformation models actually is. We address these issues and embed the estimation of unconditional and conditional distribution functions of arbitrary univariate random variables under all forms of random censoring and truncation into a common theoretical and computational likelihood‐based framework. In a nutshell, we show in Section 2 that all distributions can be generated by a strictly monotonic transformation of some absolute continuous random variable. The likelihood function of the transformed variable can then be characterized by this transformation function. The parameters of appropriate parameterizations of the transformation function, and thus the parameters of the conditional distribution function in which we are interested, can then be estimated by maximum likelihood under simple linear constraints that allow classical asymptotic likelihood inference, as will be shown in Section 3. Many classical and contemporary models are introduced as special cases of this framework. In particular, all transformation models sketched in this introduction can be understood and estimated in this novel likelihood‐based framework. Extensions of classical and contemporary transformation models as well as some novel models are derived from our unified theoretical framework of transformation functions in Section 4, and their empirical performance is illustrated and evaluated in Section 5.
2 The likelihood of transformations
Let
denotes a probability space and
a measurable space with at least ordered sample space Ξ. We are interested in inference about the distribution
of a random variable Y, that is, the probability space
defined by the
measurable function
. For the sake of notational simplicity, we present our results for the unconditional case first; regression models are discussed in Section 4.2. The distribution
is dominated by some measure μ and characterized by its density function fY, distribution function
, quantile function
, odds function OY(y)=FY(y)/(1−FY(y)), hazard function λY(y)=fY(y)/(1−FY(y)) or cumulative hazard function
. For notational convenience, we assume strict monotonicity of FY, that is, FY(y1)<FY(y2)∀y1<y2∈Ξ. Our aim is to obtain an estimate
of the distribution function FY from a random sample
. In the following, we will show that one can always write this potentially complex distribution function FY as the composition of a much simpler and a priori specified distribution function FZ and a strictly monotonic transformation function h. The task of estimating FY is then reduced to obtaining an estimate
. The latter exercise, as we will show in this paper, is technically and conceptually attractive.
Let
denotes the Euclidian space with Borel σ‐algebra and
an
measurable function such that the distribution
is absolutely continuous (μL denotes the Lebesgue measure) in the probability space
. Let FZ and
denote the corresponding distribution and quantile functions. We furthermore assume
and
for a log‐concave density fZ as well as the existence of the first two derivatives of the density fZ(z) with respect to z; both derivatives shall be bounded. We do not allow any unknown parameters for this distribution. Possible choices include the standard normal, standard logistic (SL) and minimum extreme value (MEV) distribution with distribution functions
and
, respectively. In the first step, we will show that there always exists a unique and strictly monotonic transformation g such that the unknown and potentially complex distribution
that we are interested in can be generated from the simple and known distribution
via
. More formally, let
denotes a
measurable function. The composition g∘Z is a random variable on
. We can now formulate the existence and uniqueness of g as a corollary to the probability integral transform.
Corollary 1.For all random variables Y and Z, there exists a unique strictly monotonically increasing transformation g, such that
.
Proof.Let
and
. Then U:=FZ(Z)∼U[0,1] and
by the probability integral transform. Let
, such that FY(y)=FZ(h(y)). From
, we get the uniqueness of h and therefore g. The quantile function
and the distribution function FY exist by assumption and are both strictly monotonic and right continuous. Therefore, h is strictly monotonic and right continuous and so is g.
Corollary 2.For μ=μL, we have g=h−1 and
.
This result for absolutely continuous random variables Y can be found in many textbooks (Lindsey, 1996, e.g.); Corollary 1 also covers the discrete case.
Corollary 3.For the counting measure
is a right‐continuous step function because FY is a right‐continuous step function with steps at y∈Ξ.
denote the space of all strictly monotonic transformation functions. With the transformation function h, we can evaluate FY as FY(y|h)=FZ(h(y))∀y∈Ξ. Therefore, we only need to study the transformation function h; the inverse transformation g=h−1(Bickel et al., 1993, used to define a ‘group model’ by) is not necessary in what follows. The density for absolutely continuous variables Y(μ=μL) is now given by
. For discrete responses Y(μ=μC) with finite sample space Ξ={y1,…,yK}, the density is


and
, we use the more compact notation fY(yk|h)=FZ(h(yk))−FZ(h(yk−1)) in the sequel.
is defined in terms of the distribution function (Lindsey, 1996)

and y_=yk−1, such that
. For absolutely continuous random variables Y, we almost always observe an imprecise datum
and, for short intervals
, approximate the exact likelihood
by the term
or simply fY(y|h) with
(Lindsey, 1999). This approximation only works for relatively precise measurements, that is, short intervals. If longer intervals are observed, one speaks of ‘censoring’ and relies on the exact definition of the likelihood contribution instead of using the aforementioned approximation (Klein & Moeschberger, 2003). In summary, the likelihood contribution of a conceptually ‘exact continuous’ or left‐censored, right‐censored or interval‐censored continuous or discrete observation
is given by

of an ordered factor in category yk is equivalent to the term
contributed by an interval‐censored observation
, when category yk is defined by the interval
. Thus, the expression
for the likelihood contribution reflects the equivalence of interval censoring and categorization at corresponding cut‐off points.
, the aforementioned likelihood contribution is defined in terms of the distribution function conditional on the truncation


.
Definition 1. (Most likely transformation)Let C1,…,CN denotes an independent sample of possibly randomly censored or truncated observations from
. The estimator

Log‐concavity of fZ ensures concavity of the log‐likelihood (except when all observations are right censored) and thus ensures the existence and uniqueness of
.
Many distributions are defined by a transformation function h, for example, the Box–Cox power exponential family (Stasinopoulos & Rigby, 2007), the sinh‐arcsinh distributions (Jones & Pewsey, 2009) or the T‐X family of distributions (Alzaatreh et al., 2013). In what follows, we do not assume any specific form of the transformation function but parameterize h in terms of basis functions. We now introduce such a parameterization, a corresponding family of distributions, a maximum likelihood estimator and a large class of models for unconditional and conditional distributions.
3 Transformation analysis
We parameterize the transformation function h(y) as a linear function of its basis‐transformed argument y using a basis function
, such that
. The choice of the basis function a is problem specific and will be discussed in Section 4. The likelihood
only requires evaluation of h, and only an approximation thereof using the Lebesgue density of ‘exact continuous’ observations makes the evaluation of the first derivative of h(y) with respect to y necessary. In this case, the derivative with respect to y is given by
, and we assume that
is available. In the following, we will write h=a⊤ϑ and
for the transformation function and its first derivative, omitting the argument y, and we assume that both functions are bounded away from
and
. For a specific choice of FZ and a, the transformation family of distributions consists of all distributions
whose distribution function FY is given as the composition FZ∘a⊤ϑ; this family can be formally defined as follows.
Definition 2. (Transformation family)The distribution family
with parameter space
is called transformation family of distributions
with transformation functions
‐densities fY(y|ϑ),y∈Ξ, and error distribution function FZ.
The classical definition of a transformation family relies on the idea of invariant distributions, that is, only the parameters of a distribution are changed by a transformation function but the distribution itself is not changed. The normal family characterized by affine transformations is the most well‐known example (e.g. Fraser, 1968; Lindsey, 1996). Here, we explicitly allow and encourage transformation functions that change the shape of the distribution. The transformation function a⊤ϑ is, at least in principle, flexible enough to generate any distribution function FY=FZ∘a⊤ϑ from the distribution function FZ. We borrow the term ‘error distribution’ function for FZ from Fraser (1968), because Z can be understood as an error term in some of the models discussed in Section 4. The problem of estimating the unknown transformation function h, and thus the unknown distribution function FY, reduces to the problem of estimating the parameter vector ϑ through maximization of the likelihood function. We assume that the basis function a is such that the parameters ϑ are identifiable.
Definition 3. (Maximum likelihood estimator)

, we define plug‐in estimators of the most likely transformation function and the corresponding estimator of our target distribution FY as
and
. Because the problem of estimating an unknown distribution function is now embedded in the maximum likelihood framework, the asymptotic analysis benefits from standard results on the asymptotic behaviour of maximum likelihood estimators. We begin with deriving the score function and Fisher information. The score contribution of an ‘exact continuous’ observation
from an absolutely continuous distribution is approximated by the gradient of the log‐density
(1)
(the constant terms
and
vanish), the score contribution is
(2)
.
(3)
(4)For a truncated observation, the Fisher information is given by
.
We will first discuss the asymptotic properties of the maximum likelihood estimator
in the parametric setting with fixed parameters ϑ in both the discrete and continuous case. For continuous variables Y and a transformation function parameterized using a Bernstein polynomial, results for sieve maximum likelihood estimation, where the number of parameters increases with N, are then discussed in Section 3.2.
3.1 Parametric inference
Conditions on the densities of the error distribution fZ and the basis functions a ensuring consistency and asymptotic normality of the sequence of maximum likelihood estimators
and an estimator of their asymptotic covariance matrix are given in the following three theorems. Because of the full parameterization of the model, the proofs are simple standard results for likelihood asymptotics, and a more complex analysis (as required for estimation equations in the presence of a nuisance parameter hY, e.g. in Cheng et al., 1995) is not necessary. We will restrict ourselves to absolutely continuous or discrete random variables Y, where the likelihood is given in terms of the density fY(y|ϑ). Furthermore, we will only study the case of a correctly specified transformation h=a⊤ϑ and refer the reader to Hothorn et al. (2014), where consistency results for arbitrary h are given.
Theorem 1.For
and under the assumptions (A1), the parameter space Θ is compact and (A2)
where ϑ0 is well separated:

converges to ϑ0 in probability,
, as
.
Proof.The log‐likelihood is continuous in ϑ, and because of (A2), each log‐likelihood contribution is dominated by an integrable function. Thus, the result follows from van der Vaart (1998) (Theorem 5.8 with example 19.7; see note at bottom of page 46).
Remark 1.Assumption (A1) is made for convenience, and relaxations of such a condition are given in van de Geer (2000) or van der Vaart (1998). The assumptions in (A2) are rather weak: the first one holds if the functions a are not arbitrarily ill posed, and the second one holds if the function
is strictly convex in ϑ(if the assumption would not hold, we would still have convergence to the set
).
Theorem 2.Under the assumptions of Theorem 1 and in addition (A3)

and (for the absolutely continuous case μ=μL only)
are non‐singular, and (A5)
and
, the sequence
is asymptotically normal with mean zero and covariance matrix

.
Proof.Because the map
is continuously differentiable in ϑ for all y in both the discrete and absolutely continuous case and the matrix

is differentiable in quadratic mean with Lemma 7.6 in van der Vaart (1998). Furthermore, assumptions (A4 and A5) ensure that the expected Fisher information matrix is non‐singular at ϑ0. With the consistency and (A3), the result follows from Theorem 5.39 in van der Vaart (1998).
Remark 2.Assumption (A4) is valid for the densities fZ of the normal, logistic and MEV distribution. The Fisher information 3 and 4 evaluated at the maximum likelihood estimator
can be used to estimate the covariance matrix Σϑ0.
Theorem 3.Under the assumptions of Theorem 2 and assuming
, a consistent estimator for Σϑ0 is given by

Proof.With the law of large numbers, we have

Based on Theorems 1–3, we can perform standard likelihood inference on the model parameters ϑ. In particular, we can construct confidence intervals and confidence bands for the conditional distribution function from confidence intervals and bands for the linear functions a⊤ϑ. We complete this part by formally defining the class of transformation models.
Definition 4. (Transformation model)The triple (FZ,a,ϑ) is called transformation model.
The transformation model (FZ,a,ϑ) fully defines the distribution of Y via FY=FZ∘a⊤ϑ and thus the corresponding likelihood
. Our definition of transformation models as (FZ,a,ϑ) is strongly tied to the idea of structural inference (Fraser, 1968) and group models (Bickel et al., 1993). Fraser (1968) described a measurement model
for Y by an error distribution
and a structural equation Y=g∘Z, where g is a linear function, thereby extending the location‐scale family Y=α+σZ. Group models consist of distributions generated by possibly non‐linear g. The main difference to these classical approaches is that we parameterize h instead of g=h−1. By extending the linear transformation functions g dealt with by Fraser (1968) to non‐linear transformations, we approximate the potentially non‐linear transformation functions
by a⊤ϑ, with subsequent estimation of the parameters ϑ. For given parameters ϑ, a sample from
can be drawn by the probability integral transform, that is,
is drawn and then
ϑ⩾Zi}.
3.2 Non‐parametric inference
For continuous responses Y, any unknown transformation h can be approximated by Bernstein polynomials of increasing order (Farouki, 2012). For uncensored and right‐censored responses and under the same conditions for FZ as stated in Section 3.1, McLain & Ghosh (2013) showed that the non‐parametric sieve maximum likelihood estimator is consistent with rate of convergence N2/5 for h with continuous bounded second derivatives in unconditional and linear transformation models (Section 4.3). In the latter class, the linear shift parameters β are asymptotically normal and semi‐parametrically efficient. Numerical approximations to the observed Fisher information
were shown to lead to appropriate standard errors of
by McLain & Ghosh (2013). Hothorn et al. (2014) established the consistency of boosted non‐parametric conditional transformation models (Section 4.2). For sieve maximum likelihood estimation in the class of conditional transformation models, the techniques employed by McLain & Ghosh (2013) require minor technical extensions, which are omitted here.
In summary, the same limiting distribution arises under both the parametric and the non‐parametric paradigm for transformation functions parameterized or approximated using Bernstein polynomials, respectively. In the latter case, the target is then the best approximated transformation function with Bernstein polynomials, say
(where the index N indicates that we use a more complex approximation when N increases). If the approximation error
is of smaller order than the convergence rate of the estimator, the estimator's target becomes the true underlying transformation function h, and otherwise, a bias for estimating h remains.
4 Applications
The definition of transformation models tailored for specific situations ‘only’ requires the definition of a suitable basis function a and a choice of FZ. In this section, we will discuss specific transformation models for unconditional and conditional distributions of ordered categorical, discrete and continuous responses Y. Note that the likelihood function
allows all these models to be fitted to arbitrarily censored or truncated responses; for brevity, we will not elaborate on the details.
4.1 Unconditional transformation models
Finite sample space

, and the unconditional distribution function of FY is FY(yk)=FZ(ϑk). This parameterization underlies the common proportional odds and proportional hazards model for ordered categorical data (Tutz, 2012). Note that monotonicity of h is guaranteed by the K−2 linear constraints ϑ2−ϑ1>0,…,ϑK−1−ϑK−2>0 when constrained optimization is performed. In the absence of censoring or truncation and with
, we obtain the maximum likelihood estimator for ϑ as

maximizes the equivalent multinomial (or empirical) log‐likelihood
, and we can rewrite this estimator as

is invariant with respect to FZ.
Assumption (A4) is valid for these basis functions because we have
for Y∼PY,ϑ0.
If we define the sample space Ξ as the set of unique observed values and the probability measure as the empirical cumulative distribution function (ECDF), putting mass N−1 on each observation, we see that this particular parameterization is equivalent to an empirical likelihood approach, and we get
. Note that although the transformation function depends on the choice of FZ, the estimated distribution function
does not and is simply the non‐parametric empirical maximum likelihood estimator. A smoothed version of this estimator for continuous responses is discussed in the next paragraph.
Infinite sample space
, should be smooth in y; therefore, any polynomial or spline basis is a suitable choice for a. For the empirical experiments in Section 5, we applied Bernstein polynomials (for an overview, see Farouki, 2012) of order M(P=M+1) defined on the interval
with

and fBe(m,M) is the density of the Beta distribution with parameters m and M. This choice is computationally attractive because strict monotonicity can be formulated as a set of M linear constraints on the parameters ϑm<ϑm+1 for all m=0,…,M(Curtis & Ghosh, 2011). Therefore, application of constrained optimization guarantees monotonic estimates
. The basis contains an intercept. We obtain smooth plug‐in estimators for the distribution, density, hazard and cumulative hazard functions as
and
. The estimator
must not be confused with the estimator
for Y∈[0,1] obtained from the smoothed empirical distribution function with coefficients
corresponding to probabilities evaluated at the quantiles m/M for m=0,…,M(Babu et al., 2002).
The question arises how the degree of the polynomial affects the estimated distribution function. On the one hand, the model (Φ,aBs,1,ϑ) only allows linear transformation functions of a standard normal, and FY is restricted to the normal family. On the other hand, (Φ,aBs,N−1,ϑ) has one parameter for each observation, and
is the non‐parametric maximum likelihood estimator ECDF, which, by the Glivenko–Cantelli lemma, converges to FY. In this sense, we cannot choose a ‘too large’ value for M. This is a consequence of the monotonicity constraint on the estimator
, which, in this extreme case, just interpolates the step function
. Empirical evidence for the insensitivity of results when M is large can be found in Hothorn (2017b) and in the discussion.
4.2 Conditional transformation models
In the following, we will discuss a cascade of increasingly complex transformation models where the transformation function h may depend on explanatory variables X∈χ. We are interested in estimating the conditional distribution of Y given X=x. The corresponding distribution function FY|X=x can be written as FY|X=x(y)=FZ(h(y|x)). The transformation function
is said to be conditional on x. Following the arguments presented in the proof of Corollary 1, it is easy to see that for each x, there exists a strictly monotonic transformation function
such that FY|X=x(y)=FZ(h(y|x)). Because this class of conditional transformation models and suitable parameterizations was introduced by Hothorn et al. (2014), we will only sketch the most important aspects here.
Let
denotes a basis transformation of the explanatory variables. The joint basis for both y and x is called
; its dimension d(P,Q) depends on the way the two basis functions a and b are combined (e.g.
or
). The conditional transformation function is now parameterized as h(y|x)=c(y,x)⊤ϑ. One important special case is the simple transformation function h(y|x)=hY(y)+hx(x), where the explanatory variables only contribute a shift hx(x) to the conditional transformation function. Often this shift is assumed to be linear in x; therefore, we use the function
to denote linear shifts. Here,
is one row of the design matrix without intercept. These simple models correspond to the joint basis
, with
and
. The results presented in Section 3, including Theorems 1–3, carry over in the fixed design case when a is replaced by c.
In the rest of this section, we will present classical models that can be embedded in the larger class of conditional transformation models and some novel models that can be implemented in this general framework.
4.3 Classical transformation models
Linear model
The normal linear regression model Y∼N(α+m(x),σ2) with conditional distribution function FY|X=x(y)=Φ(σ−1(y−(α+m(x)))) can be understood as a transformation model with transformation function h(y|x)=y/σ−α/σ−m(x)/σ parameterized via basis functions
and c=(a⊤,b⊤)⊤ with parameters ϑ=(σ−1,−σ−1α,−σ−1β⊤)⊤ under the constraint σ>0 or in more compact notation
. The parameters of the model are the inverse standard deviation and the inverse negative coefficient of variation instead of the mean and variance of the original normal distribution. For ‘exact continuous’ observations, the likelihood
is equivalent to least squares, which can be maximized with respect to α and β without taking σ into account. This is not possible for censored or truncated observations, where we need to evaluate the conditional distribution function that depends on all parameters; this model is called Type I Tobit model (although only the likelihood changes under censoring and truncation, but the model does not). Using an alternative basis function c would allow arbitrary non‐normal conditional distributions of Y, and the simple shift model
is then a generalization of additive models and leads to the interpretation
. The choice
implements the log‐normal model for Y>0. Implementation of a Bernstein basis a=aBs,M allows arbitrarily shaped distributions, that is, a transition from the normal family to the transformation family, and thus likelihood inference on ϑ2 without strict assumptions on the distribution of Y. The transformation
must increase monotonically in y. Maximization of the log‐likelihood under the linear inequality constraint DM+1ϑ1>0, with DM+1 representing first‐order differences, implements this requirement.
Continuous ‘survival time’ models
For a continuous response Y>0, the model
with basis functions
and
and parameters ϑ=(−α,σ−1,−β⊤)⊤ under the constraint σ>0 is called the accelerated failure time (AFT) model. The model
with σ≡1 (and thus fixed transformation function
) is the exponential AFT model because it implies an exponential distribution of Y. When the parameter σ>0 is estimated from the data, the model
is called the Weibull model,
is the log‐logistic AFT model and
is the log‐normal AFT model. For a continuous (not necessarily positive) response Y, the model FY|X=x(y)=FMEV(hY(y)−m(x)) is called the proportional hazards, relative risk or Cox model. The transformation function hY equals the log‐cumulative baseline hazard and is treated as a nuisance parameter in the partial likelihood framework, where only the regression coefficients β are estimated. Given
, non‐parametric maximum likelihood estimators are typically applied to obtain
. Here, we parameterize this function as
(e.g. using a=aBs,M) and fit all parameters in the model
simultaneously. The model is highly popular because m(x) is the log‐hazard ratio to m(0). For the special case of right‐censored survival times, this parameterization of the Cox model was studied theoretically and empirically by McLain & Ghosh (2013). Changing the distribution function in the Cox model from FMEV to FSL results in the proportional odds model
; its name comes from the interpretation of m(x) as the constant log‐odds ratio of the odds OY(y|X=x) and OY(y|x=0). An additive hazards model with the conditional hazard function
results from the choice
(Aranda‐Ordaz, 1983) under the additional constraint λY(y|X=x)>0. In this case, the function
is the positive baseline cumulative hazard function ΛY(y|X=0).
Discrete models
For ordered categorical responses y1<⋯<yK, the conditional distribution FY|X=x(yk)=FZ(hY(yk)−m(x)) is a transformation model with a(yk)=eK−1(k). The model
is called the discrete proportional odds model, and
is the discrete proportional hazards model. Here, m(x) is the log‐odds ratio or log‐hazard ratio to m(0) independent of k; details are given in Tutz (2012). For the special case of a binary response (K=2), the transformation model
is the logistic regression model,
is the probit model and
is called the complementary log–log model. Note that the transformation function hY is given by the basis function
, that is, ϑ1 is just the intercept. The connection between standard binary regression models and transformation models is explained in more detail by Doksum & Gasko (1990).
Linear transformation model
The transformation model
for any a and FZ is called the linear transformation model and contains all models discussed in this section. Note that the transformation of the response
is non‐linear in all models of interest (AFT, Cox etc.) and the term ‘linear’ only refers to a linear shift m(x) of the explanatory variables. Partially linear or additive transformation models allow non‐linear shifts as part of a partially smooth basis b, that is, in the form of an additive model. The number of constraints only depends on the basis a but not on the explanatory variables.
4.4 Extension of classical transformation models
A common property of all classical transformation models is the additivity of the response transformation and the shift, that is, the decomposition h(y|x)=hY(y)+hx(x) of the conditional transformation function. This assumption is relaxed by the following extensions of the classical models. Allowing for deviations from this simple model is also the key aspect for the development of novel transformation models in the rest of this section.
Discrete non‐proportional odds and hazards models
For ordered categorical responses, the model FY|X=x(yk)=FZ(hY(yk)−mk(x)) allows a category‐specific shift
; with FSL, this cumulative model is called the non‐proportional odds model, and with FMEV, it is the non‐proportional hazards model. Both models can be cast into the transformation model framework by defining the joint basis c(yk,x)=(a(yk)⊤,a(yk)⊤⊗b(x)⊤)⊤ as the Kronecker product of the two simple basis functions a(yk)=eK−1(k) and
(assuming that b does not contain an intercept term). Note that the conditional transformation function h(y|x) includes an interaction term between y and x.
Time‐varying effects
One often studied extension of the Cox model is
, where the regression coefficients β(y) may change with time y. The Cox model is included with β(y)≡β, and the model is often applied to check the proportional hazards assumption. With a smooth parameterization of time y, for example, via a=aBs,M, and linear basis
, the transformation model
implements this Cox model with time‐varying (linear) effects. This model (with arbitrary FZ) has also been presented in Foresi & Peracchi (1995) and is called distribution regression in Chernozhukov et al. (2013).
4.5 Novel transformation models
Because of the broadness of the transformation family, it is straightforward to set up new models for interesting situations by allowing more complex transformation functions h(y|x). We will illustrate this possibility for two simple cases the independent two‐sample situation and regression models for count data. The generic and most complex transformation model is called the conditional transformation model and is explained at the end of this section.
Beyond shift effects
Assume we observe samples from two groups A and B and want to model the conditional distribution functions FY|X=A(y) and FY|X=B(y) of the response Y in the two groups. Based on this model, it is often interesting to infer whether the two distributions are equivalent and, if this is not the case, to characterize how they differ. Using an appropriate basis function a and the basis
, the model
parameterizes the conditional transformation function as
and
. Clearly, the second term is constant zero (hB−A(y)≡0) iff the two distributions are equivalent (FY|X=A(y)=FY|X=B(y) for all y). For the deviation function
, we can apply standard likelihood inference procedures for
to construct a confidence band or use a test statistic like
to assess deviations from zero. If there is evidence for a group effect, we can use the model to check whether the deviation function is constant, that is, hB−A(y)≡c≠0. In this case, the simpler model
with shift β=−ϑ2 might be easier to interpret. This model actually corresponds to a normal analysis of variance model with FZ=Φ and a(y)⊤=(1,y)⊤ or the Cox proportional hazards model with
.
Count regression ‘without tears’
are not affected by over‐dispersion or under‐dispersion because higher moments are handled by hY independently of the effects of the explanatory variables m(x). If there are excess zeros, we can set up a joint transformation model
such that we have a two‐components mixture model consisting of the count distribution FY|X=x(y)=FZ(hY(y)−m(x)) for y∈Ξ and the probability of an excess zero

. Hence, the transformation analogue to a hurdle model with hurdle at zero is the transformation model
.
Conditional transformation models

and include all special cases discussed in this section. It is convenient to assume monotonicity for each of the partial transformation functions; thus, the linear constraints for aj are repeated for each basis function in bj(detailed descriptions of linear constraints for different models in this class are available in Hothorn, 2017b). Hothorn et al. (2014) introduced this general model class and proposed a boosting algorithm for the estimation of transformation functions h for ‘exact continuous’ responses Y. In the likelihood framework presented here, conditional transformation models can be fitted under arbitrary schemes of censoring and truncation, and classical likelihood inference for the model parameters ϑ becomes feasible. Of course, unlike in the boosting context, the number of model terms J and their complexity are limited in the likelihood world because the likelihood does not contain any penalty terms that induce smoothness in the x‐direction.
A systematic overview of linear transformation models with potentially response‐varying effects is given in Table 1. Model nomenclature and interpretation of the corresponding model parameters is mapped to specific transformation functions h and distribution functions FZ. To the best of our knowledge, models without names have not yet been discussed in the literature, and their specific properties await closer investigation.
| FZ | ||||||
|---|---|---|---|---|---|---|
| Φ | FSL | FExp | FMEV | |||
| Ξ | h | Meaning of | ||||
| K=2 | Binary Regression | |||||
![]() |
probit BGLM | logistic BGLM | clog BGLM | cloglog BGLM | ||
| ϑ1 |
![]() |
ΛY(y|X=0) |
![]() |
|||
| β | log‐OR | AH | log‐HR | |||
| K>2 | Polytomous Regression | |||||
![]() |
discrete PO | discrete PH | ||||
![]() |
![]() |
ΛY(y|X=0) |
![]() |
|||
| β | log‐OR | AH | log‐HR | |||
![]() |
non‐PO | non‐PH | ||||
![]() |
![]() |
ΛY(y|X=0) |
![]() |
|||
![]() |
Count Regression | |||||
![]() |
||||||
![]() |
![]() |
ΛY(y|X=0) |
![]() |
|||
| β | log‐OR | AH | log‐HR | |||
![]() |
||||||
![]() |
![]() |
ΛY(y|X=0) |
![]() |
|||
![]() |
Survival Analysis | |||||
![]() |
Exponential AFT | |||||
| β | log‐OR | log‐HR | ||||
![]() |
log‐normal AFT | log‐logistic AFT | Weibull AFT | |||
![]() |
![]() |
ΛY(y|X=0) |
![]() |
|||
| β | log‐OR | AH | log‐HR | |||
![]() |
Continuous Regression and Survival Analysis | |||||
![]() |
NLRM | |||||
![]() |
variance | |||||
![]() |
mean | |||||
![]() |
Aalen AH | Cox PH | ||||
![]() |
![]() |
ΛY(y|X=0) |
![]() |
|||
| β | log‐OR | AH | log‐HR | |||
![]() |
Distribution Regression | Time‐varying Cox | ||||
![]() |
![]() |
ΛY(y|X=0) |
![]() |
|||
5 Empirical evaluation
We will illustrate the range of possible applications of likelihood‐based conditional transformation models. In Section 5.2, we will present a small simulation experiment highlighting the possible advantage of indirectly modelling conditional distributions with transformation functions.
5.1 Illustrations
Density estimation: Old Faithful geyser
The duration of eruptions and the waiting time between eruptions of the Old Faithful geyser in the Yellowstone National Park became a standard benchmark for non‐parametric density estimation. The nine parameters of the transformation model (Φ,aBs,8(waiting),ϑ) were fitted by maximization of the approximate log‐likelihood (treating the waiting times as ‘exact’ observations) under the eight linear constraints D9ϑ>0. The model depicted in Fig. 1A reproduces the classic bimodal unconditional density of waiting time along with a kernel density estimate. It is important to note that the transformation model was fitted likelihood based, whereas the kernel density estimate relied on a cross‐validated bandwidth. An unconditional density estimate for the duration of the eruptions needs to deal with censoring because exact duration times are only available for the daytime measurements. At night, the observations were left censored (‘short’ eruption), interval censored (‘medium’ eruption) or right censored (‘long’ eruption). This censoring was widely ignored in analyses of the Old Faithful data because most non‐parametric kernel techniques cannot deal with censoring. We applied the transformation model (Φ,aBs,8(duration),ϑ) based on the exact log‐likelihood function under eight linear constraints and obtained the unconditional density depicted in Fig. 1B. In Hothorn (2017b), results for M=40 are computed, which led to almost identical estimates of the distribution function.

Quantile regression: head circumference
The Fourth Dutch Growth Study is a cross‐sectional study on growth and development of the Dutch population younger than 22 years. Stasinopoulos & Rigby (2007) fitted a growth curve to head circumferences (HCs) of 7040 boys using a generalized additive models for location, scale and shape (GAMLSS) model with a Box–Cox t distribution describing the first four moments of HC conditionally on age. The model showed evidence of kurtosis, especially for older boys. We fitted the same growth curves by the conditional transformation model
by maximization of the approximate log‐likelihood under 3×4 linear constraints (D4⊗I4)ϑ>0. Figure 2 shows the data overlaid with quantile curves obtained via inversion of the estimated conditional distributions. The figure very closely reproduces the growth curves presented in Fig. 16 of Stasinopoulos & Rigby (2007) and also indicates a certain asymmetry towards older boys.

Survival analysis: German Breast Cancer Study Group‐2 trial
This prospective, controlled clinical trial on the treatment of node‐positive breast cancer patients was conducted by the German Breast Cancer Study Group. Out of 686 women, 246 received hormonal therapy, whereas the control group of 440 women did not. Additional variables include age, menopausal status, tumour size, tumour grade, number of positive lymph nodes, progesterone receptor and oestrogen receptor. The right‐censored recurrence‐free survival time is the response variable of interest.
implements the transformation function
, where
is the log‐cumulative baseline hazard function parameterized by a Bernstein polynomial and
is the log‐hazard ratio of hormonal therapy. This is the classical Cox model with one treatment parameter β but with fully parameterized baseline transformation function, which was fitted by the exact log‐likelihood under ten linear constraints. The model assumes proportional hazards, an assumption whose appropriateness we wanted to assess using the non‐proportional hazards model
with the transformation function

is the time‐varying difference of the log‐hazard functions of women without and with hormonal therapy and can be interpreted as the deviation from a constant log‐hazard ratio treatment effect of hormonal therapy. Under the null hypothesis of no treatment effect, we would expect ϑ2≡0. This monotonic deviation function adds ten linear constraints D11ϑ1+D11ϑ2>0, which also ensure monotonicity of the transformation function for treated patients. We first compared the fitted survivor functions obtained from the model including a time‐varying treatment effect with the Kaplan–Meier estimators in both treatment groups. Figure 3A shows a nicely smoothed version of the survivor functions obtained from this transformation model. Figure 3B shows the time‐varying treatment effect
, together with a 95% confidence band computed from the joint normal distribution of
for a grid over time; the method is much simpler than other methods for inference on time‐varying effects (e.g. Sun et al., 2009). The 95% confidence interval around the log‐hazard ratio
is also plotted, and as the latter is fully covered by the confidence band for the time‐varying treatment effect, there is no reason to question the treatment effect computed under the proportional hazards assumption.

(dashed line) with 95% confidence interval (dark grey) is fully covered by a 95% confidence band for the time‐varying treatment effect (the time‐varying log‐hazard ratio is in light grey; the estimate is the solid line) computed from a non‐proportional hazards model.
In the second step, we allowed an age‐varying treatment effect to be included in the model
. For both treatment groups, we estimated a conditional transformation function of survival time y given age parameterized as the tensor basis of two Bernstein bases. Each of the two basis functions comes with 10×3 linear constraints; therefore, the model was fitted under 60 linear constraints. Figure 4 allows an assessment of the prognostic and predictive properties of age. As the survivor functions were clearly larger for all patients treated with hormones, the positive treatment effect applied to all patients. However, the size of the treatment effect varied greatly. The effect was most pronounced for women younger than 30 years and levelled off a little for older patients. In general, the survival times were longest for women between 40 and 60 years old. Younger women suffered the highest risk; for women older than 60 years, the risk started to increase again. This effect was shifted towards younger women when hormonal treatment was applied.

5.2 Simulation experiment
The transformation family includes linear as well as very flexible models, and we therefore illustrate the potential gain of modelling a transformation function h by comparing a very simple transformation model with a fully parametric approach and to a non‐parametric approach using a data‐generating process introduced by Hothorn et al. (2014).
(5)
Figure 5 shows the empirical distributions of the minimum, median and maximum MAD for the three competitors. Except for the minimum MAD in the absence of any irrelevant explanatory variables (p=0), the conditional distributions fitted by the transformation models were closer to the true conditional distribution function by means of the MAD. This result was obtained because the transformation model only had to estimate a simple transformation function, whereas the other two procedures had a difficult time approximating this simple transformation model on another scale. However, the comparison illustrates the potential improvement one can achieve when fitting simple models for the transformation function instead of more complex models for the mean (GAMLSS) or distribution function (Kernel). The kernel estimator led to the largest median MAD values but seemed more robust than GAMLSS with respect to the maximum MAD. These results were remarkably robust in the presence of up to five non‐informative explanatory variables, although of course the MAD increased with the number of non‐informative variables p.

6 Discussion
The contribution of a likelihood approach for the general class of conditional transformation models is interesting both from a theoretical and a practical perspective. With the range of simple to very complex transformation functions introduced in Section 4 and illustrated in Section 5, it becomes possible to understand classical parametric, semi‐parametric and non‐parametric models as special cases of the same model class. Thus, analytic comparisons between models of different complexity become possible. The transformation family PY,Θ, the corresponding likelihood function and the most likely transformation estimator are easy to understand. This makes the approach appealing also from a teaching perspective. Connections between standard parametric models (e.g. the normal linear model) and potentially complex models for survival or ordinal data can be outlined in very simple notation, placing emphasis on the modelling of (conditional) distributions instead of just modelling (conditional) means. Computationally, the log‐likelihood
is linear in the number of observations N and, for contributions of ‘exact continuous’ responses, only requires the evaluation of the derivative
of the transformation function h instead of integrals thereof.
Based on the general understanding of transformation models outlined in this paper, it will be interesting to study these models outside the strict likelihood world. A mixed transformation model for cluster data (Cai et al., 2002; Zeng et al., 2005; Choi & Huang, 2012) is often based on the transformation function h(y|x,i)=hY(y)+δi+hx(x) with random intercept (or ‘frailty’ term) δi for the ith observational unit. Conceptually, a more complex deviation from the global model could be formulated as h(y|x,i)=hY(y)+hY(y,i)+hx(x), that is, each observational unit is assigned its own ‘baseline’ transformation hY(y)+hY(y,i), where the second term is an integral zero deviation from hY. For longitudinal data with possibly time‐varying explanatory variables, the model h(y|x(t),t)=hY(y,t)+x(t)β(t)(Ding et al., 2012; Wu & Tian, 2013) can also be understood as a mixed version of a conditional transformation model. The penalized log‐likelihood
for the linear transformation model
leads to Ridge‐type or Lasso‐type regularized models, depending on the form of the penalty term. Priors for all model parameters ϑ allow a fully Bayesian treatment of transformation models.
It is possible to relax the assumption that FZ is known. The simultaneous estimation of FZ in the model
was studied by Horowitz (1996) and later extended by Linton et al. (2008) to non‐linear functions hx with parametric baseline transformation hY and kernel estimates for FZ and hx. For AFT models, Zhang & Davidian (2008) applied smooth approximations for the density fZ in an exact censored likelihood estimation procedure. In a similar set‐up, Huang (2014) proposed a method to jointly estimate the mean function and the error distribution in a generalized linear model. The estimation of FZ is noteworthy in additive models of the form hY+hx because these models assume additivity of the contributions of y and x on the scale of
. If this model assumption seems questionable, one can either allow unknown FZ or move to a transformation model featuring a more complex transformation function.
From this point of view, the distribution function FZ in flexible transformation models is only a computational device mapping the unbounded transformation function h into the unit interval strictly monotonically, making the evaluation of the likelihood easy. Then, FZ has no further meaning or interpretation as error distribution. A compromise could be the family of distributions
for ρ>0(suggested by McLain & Ghosh, 2013) with simultaneous maximum likelihood estimation of ϑ and ρ for additive transformation functions h=hY+hx, as these models are flexible and still relatively easy to interpret.
In light of the empirical results discussed in this paper and the theoretical work of McLain & Ghosh (2013) on a Cox model with log‐cumulative baseline hazard function parameterized in terms of a Bernstein polynomial with increasing order M, one might ask where the boundaries among parametric, semi‐parametric and non‐parametric statistics lie. The question how the order M affects results practically has been repeatedly raised; therefore, we will close our discussion by looking at a Cox model with increasing M for the German Breast Cancer Study Group‐2 data. All eight baseline variables were included in the linear predictor, and we fitted the model with orders M=1,…,30,35,40,45,50 of the Bernstein polynomial parameterizing the log‐cumulative baseline hazard function. In Fig. 6A, the log‐cumulative baseline hazard functions start with a linear function (M=1) and quickly approach a function that is essentially a smoothed version of the Nelson‐Aalen‐Breslow estimator plotted in red. In Fig. 6B, the trajectories of the estimated regression coefficients become very similar to the partial likelihood estimates as M increased. For M⩾10, for instance, the results of the ‘semi‐parametric’ and the ‘fully parametric’ Cox models are practically equivalent. An extensive collection of such head‐to‐head comparisons of most likely transformations with their classical counterparts can be found in Hothorn (2017b). Our work for this paper and practical experience with its reference software implementation convinced us that rethinking classical models in terms of fully parametric transformations is intellectually and practically a fruitful exercise.

obtained for varying M, which are represented as dots. The horizontal lines represent the partial likelihood estimates.
6.1 Computational details
A reference implementation of most likely transformation models is available in the mlt package (Hothorn, 2017a). All data analyses can be reproduced in the dynamic document Hothorn (2017b). Augmented Lagrangian Minimization implemented in the auglag() function of package alabama (Varadhan, 2015) was used for optimizing the log‐likelihood. PackagegamboostLSS (version 1.2‐2, Hofner et al., 2016) was used to fit GAMLSS models and kernel density, and distribution estimation was performed using package np (version 0.60‐2, Racine & Hayfield, 2014). All computations were performed using R version 3.4.0 (R Core Team, 2017). Additional applications are described in an extended version of this paper (Hothorn et al., 2017).
Acknowledgements
Torsten Hothorn received financial support by Deutsche Forschungsgemeinschaft under grant number HO 3242/4‐1. We thank Karen A. Brune for improving the language.
References
Citing Literature
Number of times cited according to CrossRef: 16
- Kevin Burke, M. C. Jones, Angela Noufaily, A flexible parametric modelling framework for survival analysis, Journal of the Royal Statistical Society: Series C (Applied Statistics), 10.1111/rssc.12398, 69, 2, (429-457), (2020).
- Yilin Ning, Nathalie C. Støer, Peh Joo Ho, Shih Ling Kao, Kee Yuan Ngiam, Eric Yin Hao Khoo, Soo Chin Lee, E-Shyong Tai, Mikael Hartman, Marie Reilly, Chuen Seng Tan, Robust estimation of the effect of an exposure on the change in a continuous outcome, BMC Medical Research Methodology, 10.1186/s12874-020-01027-6, 20, 1, (2020).
- Alan E. Gelfand, Statistical challenges in spatial analysis of plant ecology data, Spatial Statistics, 10.1016/j.spasta.2020.100418, (100418), (2020).
- Sandra Siegfried, Torsten Hothorn, Count transformation models, Methods in Ecology and Evolution, 10.1111/2041-210X.13383, 11, 7, (818-827), (2020).
- John Bjørnar Bremnes, Ensemble Postprocessing Using Quantile Function Regression Based on Neural Networks and Bernstein Polynomials, Monthly Weather Review, 10.1175/MWR-D-19-0227.1, 148, 1, (403-414), (2020).
- Muriel Buri, Armin Curt, John Steeves, Torsten Hothorn, Baseline-adjusted proportional odds models for the quantification of treatment effects in trials with ordinal sum score outcomes, BMC Medical Research Methodology, 10.1186/s12874-020-00984-2, 20, 1, (2020).
- Felix M. Kluxen, Ludwig A. Hothorn, Alternatives to statistical decision trees in regulatory (eco-)toxicological bioassays, Archives of Toxicology, 10.1007/s00204-020-02690-w, (2020).
- Jim E. Griffin, Gelly Mitrodima, A Bayesian Quantile Time Series Model for Asset Returns, Journal of Business & Economic Statistics, 10.1080/07350015.2020.1766470, (1-12), (2020).
- Michael Stanley Smith, Nadja Klein, Bayesian Inference for Regression Copulas, Journal of Business & Economic Statistics, 10.1080/07350015.2020.1721295, (1-17), (2020).
- Julia Krzykalla, Axel Benner, Annette Kopp‐Schneider, Exploratory identification of predictive biomarkers in randomized trials with normal endpoints, Statistics in Medicine, 10.1002/sim.8452, 39, 7, (923-939), (2019).
- Yuqi Tian, Torsten Hothorn, Chun Li, Frank E. Harrell, Bryan E. Shepherd, An empirical comparison of two novel transformation models, Statistics in Medicine, 10.1002/sim.8425, 39, 5, (562-576), (2019).
- Natalia Korepanova, Heidi Seibold, Verena Steffen, Torsten Hothorn, Survival forests under test: Impact of the proportional hazards assumption on prognostic and predictive forests for amyotrophic lateral sclerosis survival, Statistical Methods in Medical Research, 10.1177/0962280219862586, (096228021986258), (2019).
- Torsten Hothorn, Transformation boosting machines, Statistics and Computing, 10.1007/s11222-019-09870-4, (2019).
- Torsten Hothorn, Letter to the Editor response: Garcia et al., Biostatistics, 10.1093/biostatistics/kxy079, 20, 3, (546-548), (2018).
- Torsten Hothorn, Top-down transformation choice, Statistical Modelling, 10.1177/1471082X17748081, 18, 3-4, (274-298), (2018).
- Tina Lohse, Sabine Rohrmann, David Faeh, Torsten Hothorn, Continuous outcome logistic regression for analyzing body mass index distributions, F1000Research, 10.12688/f1000research.12934.1, 6, (1933), (2017).










































