Most Likely Transformations

We propose and study properties of maximum likelihood estimators in the class of conditional transformation models. Based on a suitable explicit parameterisation of the unconditional or conditional transformation function, we establish a cascade of increasingly complex transformation models that can be estimated, compared and analysed in the maximum likelihood framework. Models for the unconditional or conditional distribution function of any univariate response variable can be set-up and estimated in the same theoretical and computational framework simply by choosing an appropriate transformation function and parameterisation thereof. The ability to evaluate the distribution function directly allows us to estimate models based on the exact likelihood, especially in the presence of random censoring or truncation. For discrete and continuous responses, we establish the asymptotic normality of the proposed estimators. A reference software implementation of maximum likelihood-based estimation for conditional transformation models allowing the same flexibility as the theory developed here was employed to illustrate the wide range of possible applications.


Introduction
In a broad sense, we can understand all statistical models as models of distributions or certain characteristics thereof, especially the mean. All distributions P Y for at least ordered responses Y can be characterized by their distribution, quantile, density, odds, hazard or cumulative hazard functions. In a fully parametric setting, all these functions have been specified up to unknown parameters, and the ease of interpretation can guide us in looking at the appropriate function. In the semi-parametric and non-parametric contexts, however, the question arises how we can obtain an estimate of one of these functions without assuming much about their shape. For the direct estimation of distribution functions, we deal with monotonic functions in the unit interval, whereas for densities, we need to make sure that the estimator integrates to one. The hazard function comes with a positivity constraint, and monotonicity is required for the positive cumulative hazard function. These computationally inconvenient restrictions disappear completely only when the log-hazard function is estimated, and this explains the plethora of research papers following this path. However, the lack of any structure in the log-hazard function comes at a price. A too-erratic behaviour of estimates of the log-hazard function has to be prevented by some smoothness constraint; this makes classical likelihood inference impossible. The novel characterization and subsequent estimation of distributions via their transformation function in a broad class of transformation models that are developed in this paper can be interpreted as a compromise between structure (monotonicity) and ease of parameterization, estimation and inference. This transformation approach to modelling and estimation allows standard likelihood inference in a large class of models that have so far commonly been dealt with by other inference procedures.
Since the introduction of transformation models based on non-linear transformations of some response variable by Box & Cox (1964), this attractive class of models has received much interest. In regression problems, transformation models can be understood as models for the conditional distribution function and are sometimes referred to as 'distribution regression', in contrast to their 'quantile regression' counterpart (Chernozhukov et al., 2013). Traditionally, the models were actively studied and applied in the analysis of ordered categorical or censored responses. Recently, transformation models for the direct estimation of conditional distribution functions for arbitrary responses received interest in the context of counterfactual distributions (Chernozhukov et al., 2013), probabilistic forecasting (Gneiting & Katzfuss, 2014), distribution and quantile regression (Leorato & Peracchi, 2015;Rothe & Wied, 2013), probabilistic index models (Thas et al., 2012) and conditional transformation models (Hothorn et al., 2014). The core idea of any transformation model is the application of a strictly monotonic transformation function h for the reformulation of an unknown distribution function P.Y Ä y/ as P.h.Y / Ä h.y//, where the unknown transformation function h is estimated from the data. Transformation models have received attention especially in situations where the likelihood contains terms involving the conditional distribution function P.Y Ä y j X D x/ D F Z .h.y j x// with inverse link function F Z , most importantly for censored, truncated and ordered categorical responses. For partially linear transformation models with transformation function h.y j x/ D h Y .y/ C h x .x/, much emphasis has been given to estimation procedures treating the baseline transformation h Y (e.g. the log-cumulative baseline hazard function in the Cox model) as a high-dimensional nuisance parameter. Prominent members of these estimation procedures are the partial likelihood estimator and approaches influenced by the estimation equations introduced by Cheng et al. (1995). Once an estimate for the shift h x is obtained, the baseline transformation h Y is then typically estimated by the non-parametric maximum likelihood estimator (see, e.g. Cheng et al., 1997). An overview of the extensive literature on the simultaneous non-parametric maximum likelihood estimation of h Y and h x , that is, estimation procedures not requiring an explicit parameterization of h Y , for censored continuous responses is given in Zeng & Lin (2007).
An explicit parameterization of h Y is common in models of ordinal responses (Tutz, 2012). For survival times, Kooperberg et al. (1995) introduced a cubic spline parameterization of the log-conditional hazard function with the possibility of response-varying effects and estimated the corresponding models by maximum likelihood. Crowther & Lambert (2014) followed up on this suggestion and used restricted cubic splines. Many authors studied penalized likelihood approaches for spline approximations of the baseline hazard function in a Cox model, for example, Ma et al. (2014). Less frequently, the transformation function h Y was modelled directly. Mallick & Walker (2003), Chang et al. (2005) and McLain & Ghosh (2013) used Bernstein polynomials for h Y , and Royston & Parmar (2002) proposed a maximum likelihood approach using cubic splines for modelling h Y and also time-varying effects. The connection between these different transformation models is difficult to see because most authors present their models in the relatively narrow contexts of survival or ordinal data. The lack of a general understanding of transformation models made the development of novel approaches in this model class burdensome. Hothorn et al. (2014) decoupled the parameterization of the conditional transformation function h.y j x/ from the estimation procedure and showed that many interesting and novel models can be understood as transformation models. The boosting-based optimization of proper scoring rules, however, was only developed for uncensored and rightcensored observations in the absence of truncation and requires the numerical approximation of the true target function. In a similar spirit, Chernozhukov et al. (2013) applied the connection

The likelihood of transformations
Let . ; A; P/ denotes a probability space and ."; C/ a measurable space with at least ordered sample space ". We are interested in inference about the distribution P Y of a random variable Y , that is, the probability space ."; C; P Y / defined by the A C measurable function Y W ! ". For the sake of notational simplicity, we present our results for the unconditional case first; regression models are discussed in Section 4.2. The distribution P Y D f Yˇ is dominated by some measure and characterized by its density function For notational convenience, we assume strict monotonicity of F Y , that is, F Y .y 1 / < F Y .y 2 / 8y 1 < y 2 2 ". Our aim is to obtain an estimate O F Y;N of the distribution function F Y from a random sample Y 1 ; : : : ; Y N iid P Y . In the following, we will show that one can always write this potentially complex distribution function F Y as the composition of a much simpler and a priori specified distribution function F Z and a strictly monotonic transformation function h. The task of estimating F Y is then reduced to obtaining an estimate O h N . The latter exercise, as we will show in this paper, is technically and conceptually attractive.
Let .R; B/ denotes the Euclidian space with Borel -algebra and Z W ! R an A B measurable function such that the distribution P Z D f Zˇ L is absolutely continuous ( L denotes the Lebesgue measure) in the probability space .R; B; P Z /. Let F Z and F 1 Z denote the corresponding distribution and quantile functions. We furthermore assume 0 < f Z .´/ < 1 8´2 R, F Z . 1/ D 0 and F Z .1/ D 1 for a log-concave density f Z as well as the existence of the first two derivatives of the density f Z .´/ with respect to´; both derivatives shall be bounded. We do not allow any unknown parameters for this distribution. Possible choices include the standard normal, standard logistic (SL) and minimum extreme value (MEV) distribution with distribution functions F Z .´/ Dˆ.´/, F Z .´/ D F SL .´/ D .1 C exp. ´// 1 and F Z .´/ D F MEV .´/ D 1 exp. exp.´//, respectively. In the first step, we will show that there always exists a unique and strictly monotonic transformation g such that the unknown and potentially complex distribution P Y that we are interested in can be generated from the simple and known distribution P Z via P Y D P gıZ . More formally, let g W R ! " denotes a B C measurable function. The composition g ıZ is a random variable on ."; C; P gıZ /. We can now formulate the existence and uniqueness of g as a corollary to the probability integral transform.
Corollary 1. For all random variables Y and Z, there exists a unique strictly monotonically increasing transformation g, such that P Y D P gıZ .
we get the uniqueness of h and therefore g. The quantile function F 1 Z and the distribution function F Y exist by assumption and are both strictly monotonic and right continuous. Therefore, h is strictly monotonic and right continuous and so is g.
This result for absolutely continuous random variables Y can be found in many textbooks (Lindsey, 1996, e.g.); Corollary 1 also covers the discrete case.
Corollary 3. For the counting measure D C , h D F 1 Z ı F Y is a right-continuous step function because F Y is a right-continuous step function with steps at y 2 ".
We now characterize the distribution F Y by the corresponding transformation function h, set up the corresponding likelihood of such a transformation function and estimate the transformation function based on this likelihood. Let H D ¹h W " ! R j C B measurable; h.y 1 / < h.y 2 / 8y 1 < y 2 2 "º denote the space of all strictly monotonic transformation functions. With the transformation function h, we can evaluate F Y as F Y .y j h/ D F Z .h.y// 8y 2 ". Therefore, we only need to study the transformation function h; the inverse transformation g D h 1 (Bickel et al., 1993, used to define a 'group model' by) is not necessary in what follows. The density for absolutely continuous variables Y ( D L ) is now given by f Y .y j h/ D f Z .h.y//h 0 .y/. For discrete responses Y ( D C ) with finite sample space " D ¹y 1 ; : : : ; y K º, the density is h.y k 1 // k D 2; : : : ; K 1 1 F Z .h.y k 1 // k D K; and for countably infinite sample spaces " D ¹y 1 ; y 2 ; y 3 ; : : :º, we get the density With the conventions F Z .h.y 0 // WD F Z .h. 1// WD 0 and F Z .h.y K // WD F Z .h.1// WD 1, we use the more compact notation f Y .y k j h/ D F Z .h.y k // F Z .h.y k 1 // in the sequel. Scand J Statist 45 For a given transformation function h, the likelihood contribution of a datum C D .y; N y 2 C is defined in terms of the distribution function (Lindsey, 1996) This 'exact' definition of the likelihood applies to most practical situations of interest and, in particular, allows discrete and (conceptually) continuous as well as censored or truncated observations C . For a discrete response y k , we have N y D y k and y D y k 1 , such that h.y//. For absolutely continuous random variables Y , we almost always observe an imprecise datum .y; N y R and, for short intervals .y; N y, approximate the exact likelihood L.h j Y 2 .y; N y/ by the term . N y y/f Y .y j h/ or simply f Y .y j h/ with y D .y C N y/=2 (Lindsey, 1999). This approximation only works for relatively precise measurements, that is, short intervals. If longer intervals are observed, one speaks of 'censoring' and relies on the exact definition of the likelihood contribution instead of using the aforementioned approximation (Klein & Moeschberger, 2003). In summary, the likelihood contribution of a conceptually 'exact continuous' or left-censored, right-censored or intervalcensored continuous or discrete observation .y; N y is given by under the assumption of random censoring. The likelihood is more complex under dependent censoring (Klein & Moeschberger, 2003), but we will not elaborate on this issue. The likelihood contribution L.h j Y 2 .y k ; y k 1 / of an ordered factor in category y k is equivalent to the term L.h j Y 2 .y; N y/ contributed by an interval-censored observation .y; N y, when category y k is defined by the interval .y; N y. Thus, the expression F Z .h. N y// F Z .h.y// for the likelihood contribution reflects the equivalence of interval censoring and categorization at corresponding cut-off points.
For truncated observations in the interval .y l ; y r ", the aforementioned likelihood contribution is defined in terms of the distribution function conditional on the truncation F Y .y j Y 2 .y l ; y r / D F Z .h.y/ j Y 2 .y l ; y r / D F Z .h.y// F Z .h.y r // F Z .h.y l // 8y 2 .y l ; y r ; and thus, the likelihood contribution changes to (Klein & Moeschberger, 2003) It is important to note that the likelihood is always defined in terms of a distribution function (Lindsey, 1999) and it therefore makes sense to directly model the distribution function of interest. The ability to uniquely characterize this distribution function by the transformation function h gives rise to the following definition of an estimator O h N .
Definition 1 (Most likely transformation). Let C 1 ; : : : ; C N denotes an independent sample of possibly randomly censored or truncated observations from P Y . The estimator is called the most likely transformation.
Log-concavity of f Z ensures concavity of the log-likelihood (except when all observations are right censored) and thus ensures the existence and uniqueness of O h N . Many distributions are defined by a transformation function h, for example, the Box-Cox power exponential family (Stasinopoulos & Rigby, 2007), the sinh-arcsinh distributions (Jones & Pewsey, 2009) or the T-X family of distributions (Alzaatreh et al., 2013). In what follows, we do not assume any specific form of the transformation function but parameterize h in terms of basis functions. We now introduce such a parameterization, a corresponding family of distributions, a maximum likelihood estimator and a large class of models for unconditional and conditional distributions.

Transformation analysis
We parameterize the transformation function h.y/ as a linear function of its basis-transformed argument y using a basis function a W " ! R P , such that h.y/ D a.y/ > #; # 2 R P . The choice of the basis function a is problem specific and will be discussed in Section 4. The likelihood L only requires evaluation of h, and only an approximation thereof using the Lebesgue density of 'exact continuous' observations makes the evaluation of the first derivative of h.y/ with respect to y necessary. In this case, the derivative with respect to y is given by h 0 .y/ D a 0 .y/ > #, and we assume that a 0 is available. In the following, we will write h D a > # and h 0 D a 0 > # for the transformation function and its first derivative, omitting the argument y, and we assume that both functions are bounded away from 1 and 1. For a specific choice of F Z and a, the transformation family of distributions consists of all distributions P Y whose distribution function F Y is given as the composition F Z ı a > #; this family can be formally defined as follows.
Definition 2 (Transformation family). The distribution family P Y;‚ D ¹F Z ı a > # j # 2 ‚º with parameter space ‚ D ¹# 2 R P j a > # 2 Hº is called transformation family of distributions P Y;# with transformation functions a > # 2 H, -densities f Y .y j #/; y 2 ", and error distribution function F Z .
The classical definition of a transformation family relies on the idea of invariant distributions, that is, only the parameters of a distribution are changed by a transformation function but the distribution itself is not changed. The normal family characterized by affine transformations is the most well-known example (e.g. Fraser, 1968;Lindsey, 1996). Here, we explicitly allow and encourage transformation functions that change the shape of the distribution. The transformation function a > # is, at least in principle, flexible enough to generate any distribution function F Y D F Z ı a > # from the distribution function F Z . We borrow the term 'error distribution' function for F Z from Fraser (1968), because Z can be understood as an error term in some of the models discussed in Section 4. The problem of estimating the unknown transformation function h, and thus the unknown distribution function F Y , reduces to the problem of estimating the parameter vector # through maximization of the likelihood function. We assume that the basis function a is such that the parameters # are identifiable.
Based on the maximum likelihood estimator O # N , we define plug-in estimators of the most likely transformation function and the corresponding estimator of our target distribution Because the problem of estimating an unknown distribution function is now embedded in the maximum likelihood framework, the asymptotic analysis benefits from standard results on the asymptotic behaviour of maximum likelihood estimators. We begin with deriving the score function and Fisher information. The score contribution of an 'exact continuous' observation y D .y C N y/=2 from an absolutely continuous distribution is approximated by the gradient of the log-density For an interval-censored or discrete observation y and N y (the constant terms F Z .a.
For a truncated observation, the score function is s.# j Y 2 .y; N y/ s.# j Y 2 .y l ; y r /. The contribution of an 'exact continuous' observation y from an absolutely continuous distribution to the Fisher information is approximately For a censored or discrete observation, we have the following contribution to the Fisher information For a truncated observation, the Fisher information is given by We will first discuss the asymptotic properties of the maximum likelihood estimator O # N in the parametric setting with fixed parameters # in both the discrete and continuous case. For continuous variables Y and a transformation function parameterized using a Bernstein polynomial, results for sieve maximum likelihood estimation, where the number of parameters increases with N , are then discussed in Section 3.2.

Parametric inference
Conditions on the densities of the error distribution f Z and the basis functions a ensuring consistency and asymptotic normality of the sequence of maximum likelihood estimators O # N and an estimator of their asymptotic covariance matrix are given in the following three theorems. Because of the full parameterization of the model, the proofs are simple standard results for likelihood asymptotics, and a more complex analysis (as required for estimation equations in the presence of a nuisance parameter h Y , e.g. in Cheng et al., 1995) is not necessary. We will restrict ourselves to absolutely continuous or discrete random variables Y , where the likelihood is given in terms of the density f Y .y j # /. Furthermore, we will only study the case of a correctly specified transformation h D a > # and refer the reader to Hothorn et al. (2014), where consistency results for arbitrary h are given.
Proof. The log-likelihood is continuous in # , and because of (A2), each log-likelihood contribution is dominated by an integrable function. Thus, the result follows from van der Vaart (1998) (Theorem 5.8 with example 19.7; see note at bottom of page 46).
Remark 1. Assumption (A1) is made for convenience, and relaxations of such a condition are given in van de Geer (2000) or van der Vaart (1998). The assumptions in (A2) are rather weak: the first one holds if the functions a are not arbitrarily ill posed, and the second one holds if the function

asymptotically normal with mean zero and covariance matriẋ
Proof. Because the map # 7 ! p f Y .y j # / is continuously differentiable in # for all y in both the discrete and absolutely continuous case and the matrix is continuous in # as given in (1) and (2), the transformation family P Y;‚ is differentiable in quadratic mean with Lemma 7.6 in van der Vaart (1998). Furthermore, assumptions (A4 and A5) ensure that the expected Fisher information matrix is non-singular at # 0 . With the consistency and (A3), the result follows from Theorem 5.39 in van der Vaart (1998).
Remark 2. Assumption (A4) is valid for the densities f Z of the normal, logistic and MEV distribution. The Fisher information (3) and (4) evaluated at the maximum likelihood estimator O # N can be used to estimate the covariance matrix˙# 0 .
Theorem 3. Under the assumptions of Theorem 2 and assuming E #0 jF .# 0 j Y /j < 1, a consistent estimator for˙# 0 is given by Proof. With the law of large numbers, we have Because the map # 7 ! F .# j y/ is continuous for all y (as can be seen from (3) and (4)), the result follows with Theorem 1.
Based on Theorems 1-3, we can perform standard likelihood inference on the model parameters #. In particular, we can construct confidence intervals and confidence bands for the conditional distribution function from confidence intervals and bands for the linear functions a > #. We complete this part by formally defining the class of transformation models.
Our definition of transformation models as .F Z ; a; # / is strongly tied to the idea of structural inference (Fraser, 1968) and group models (Bickel et al., 1993). Fraser (1968) described a measurement model P Y for Y by an error distribution P Z and a structural equation Y D g ı Z, where g is a linear function, thereby extending the location-scale family Y D˛C Z. Group models consist of distributions generated by possibly non-linear g. The main difference to these classical approaches is that we parameterize h instead of g D h 1 . By extending the linear transformation functions g dealt with by Fraser (1968) to non-linear transformations, we approximate the potentially nonlinear transformation functions h D g 1 D F 1 Z ı F Y by a > #, with subsequent estimation of the parameters #. For given parameters # , a sample from P Y can be drawn by the probability integral transform, that is, Z 1 ; : : : ; Z N iid P Z is drawn and then Y i D inf¹y 2 " j a.y/ > # Z i º.

Non-parametric inference
For continuous responses Y , any unknown transformation h can be approximated by Bernstein polynomials of increasing order (Farouki, 2012 In summary, the same limiting distribution arises under both the parametric and the non-parametric paradigm for transformation functions parameterized or approximated using Bernstein polynomials, respectively. In the latter case, the target is then the best approximated transformation function with Bernstein polynomials, say h ? N (where the index N indicates that we use a more complex approximation when N increases). If the approximation error h ? N h is of smaller order than the convergence rate of the estimator, the estimator's target becomes the true underlying transformation function h, and otherwise, a bias for estimating h remains.

Applications
The definition of transformation models tailored for specific situations 'only' requires the definition of a suitable basis function a and a choice of F Z . In this section, we will discuss specific transformation models for unconditional and conditional distributions of ordered categorical, discrete and continuous responses Y . Note that the likelihood function L allows all these models to be fitted to arbitrarily censored or truncated responses; for brevity, we will not elaborate on the details.

Unconditional transformation models
Finite sample space For ordered categorical responses Y from a finite sample space " D ¹y 1 ; : : : ; y K º, we assign one parameter to each element of the sample space except y K . This corresponds to the basis function a.y k / D e K 1 .k/, where e K 1 .k/ is the unit vector of length K 1, with its kth element being one. The transformation function h is This parameterization underlies the common proportional odds and proportional hazards model for ordered categorical data (Tutz, 2012). Note that monotonicity of h is guaranteed by the K 2 linear constraints # 2 # 1 > 0; : : : ; # K 1 # K 2 > 0 when constrained optimization is performed. In the absence of censoring or truncation and with # 0 D 1; # K D 1, we obtain the maximum likelihood estimator for # as 1.Y i D y k /; 1 Ä k < K maximizes the equivalent multinomial (or empirical) log-likelihood P N iD1 log. k.i/ /, and we can rewrite this estimator as The estimated distribution function O F Y;N D F Z ı O h N is invariant with respect to F Z . Assumption (A4) is valid for these basis functions because we have E #0 .e K 1 .Y /e K 1 .Y / > / D diag.P .Y D y k //; 1 Ä k < K for Y P Y;# 0 . If we define the sample space " as the set of unique observed values and the probability measure as the empirical cumulative distribution function (ECDF), putting mass N 1 on each observation, we see that this particular parameterization is equivalent to an empirical likelihood approach, and we get O h N D F 1 Z ı ECDF. Note that although the transformation function depends on the choice of F Z , the estimated distribution function O F Y;N D F Z ı O h N D ECDF does not and is simply the non-parametric empirical maximum likelihood estimator. A smoothed version of this estimator for continuous responses is discussed in the next paragraph.
Infinite sample space For continuous responses Y , the parameterization h.y/ D a.y/ > #, and thus also O F Y;N , should be smooth in y; therefore, any polynomial or spline basis is a suitable choice for a. For the empirical experiments in Section 5, we applied Bernstein polynomials (for an overview, see Farouki, 2012)   The question arises how the degree of the polynomial affects the estimated distribution function. On the one hand, the model .ˆ; a Bs;1 ; # / only allows linear transformation functions of a standard normal, and F Y is restricted to the normal family. On the other hand, .ˆ; a Bs;N 1 ; #/ has one parameter for each observation, and O F Y;N is the non-parametric maximum likelihood estimator ECDF, which, by the Glivenko-Cantelli lemma, converges to F Y . In this sense, we cannot choose a 'too large' value for M . This is a consequence of the monotonicity constraint on the estimator a > O # N , which, in this extreme case, just interpolates the step function F 1 Z ı ECDF. Empirical evidence for the insensitivity of results when M is large can be found in Hothorn (2017b) and in the discussion.

Conditional transformation models
In the following, we will discuss a cascade of increasingly complex transformation models where the transformation function h may depend on explanatory variables X 2 . We are interested in estimating the conditional distribution of Y given X D x. The corresponding distribution function F Y jX Dx can be written as F Y j X Dx .y/ D F Z .h.y j x//. The transformation function h. j x/ W " ! R is said to be conditional on x. Following the arguments presented in the proof of Corollary 1, it is easy to see that for each x, there exists a strictly monotonic transformation function h. j h.y j x//. Because this class of conditional transformation models and suitable parameterizations was introduced by Hothorn et al. (2014), we will only sketch the most important aspects here. Let b W ! R Q denotes a basis transformation of the explanatory variables. The joint basis for both y and x is called c W " ! R d.P;Q/ ; its dimension d.P; Q/ depends on the way the two basis functions a and b are combined (e.g. c D .a > ; b > / > 2 R P CQ or c D .a >˝b> / > 2 R PQ ). The conditional transformation function is now parameterized as h.y j x/ D c.y; x/ > # . One important special case is the simple transformation function h.y j x/ D h Y .y/ C h x .x/, where the explanatory variables only contribute a shift h x .x/ to the conditional transformation function. Often this shift is assumed to be linear in x; therefore, we use the function m.x/ D b.x/ >ˇD Q x >ˇt o denote linear shifts. Here, b.x/ D Q x is one row of the design matrix without intercept. These simple models correspond to the joint basis c.y; The results presented in Section 3, including Theorems 1-3, carry over in the fixed design case when a is replaced by c.
In the rest of this section, we will present classical models that can be embedded in the larger class of conditional transformation models and some novel models that can be implemented in this general framework. 1ˇ> / > under the constraint > 0 or in more compact notation .˚; .y; 1; Q x > / > ; # /. The parameters of the model are the inverse standard deviation and the inverse negative coefficient of variation instead of the mean and variance of the original normal distribution. For 'exact continuous' observations, the likelihood L is equivalent to least squares, which can be maximized with respect to˛andˇwithout taking into account. This is not possible for censored or truncated observations, where we need to evaluate the conditional distribution function that depends on all parameters; this model is called Type I Tobit model (although only the likelihood changes under censoring and truncation, but the model does not). Using an alternative basis function c would allow arbitrary non-normal conditional distributions of Y , and the simple shift model c.y; x/ > # D a.y/ > # 1 C b.x/ > # 2 is then a generalization of additive models and leads to the interpretation E Y jX Dx .a.

Classical transformation models
The choice a D .1; log/ > implements the log-normal model for Y > 0. Implementation of a Bernstein basis a D a Bs;M allows arbitrarily shaped distributions, that is, a transition from the normal family to the transformation family, and thus likelihood inference on # 2 without strict assumptions on the distribution of Y . The transformation a Bs;M .y/ > # 1 must increase monotonically in y. Maximization of the log-likelihood under the linear inequality constraint D M C1 # 1 > 0, with D M C1 representing first-order differences, implements this requirement. x > / > ; . # 1 ; 1; ˇ>/ > / with Á 1 (and thus fixed transformation function log) is the exponential AFT model because it implies an exponential distribution of Y . When the parameter > 0 is estimated from the data, the model .F MEV ; .1; log; Q x > / > ; # / is called the Weibull model, .F SL ; .1; log; Q x > / > ; #/ is the log-logistic AFT model and .ˆ; .1; log; Q x > / > ; # / is the log-normal AFT model. For a continuous (not necessarily positive) response Y , the model F Y j X Dx .y/ D F MEV .h Y .y/ m.x// is called the proportional hazards, relative risk or Cox model. The transformation function h Y equals the log-cumulative baseline hazard and is treated as a nuisance parameter in the partial likelihood framework, where only the regression coefficientsˇare estimated. Given Ǒ , non-parametric maximum likelihood estimators are typically applied to obtain O h Y . Here, we parameterize this function as h Y .y/ D log.ƒ Y .y// D a.y/ > # 1 (e.g. using a D a Bs;M ) and fit all parameters in the model .F MEV ; .a > ; Q x > / > ; .# > 1 ; ˇ>/ > / simultaneously. The model is highly popular because m.x/ is the log-hazard ratio to m.0/. For the special case of rightcensored survival times, this parameterization of the Cox model was studied theoretically and empirically by McLain & Ghosh (2013). Changing the distribution function in the Cox model from F MEV to F SL results in the proportional odds model .F SL ; .a > ; Q x > / > ; .# > 1 ; ˇ>/ > /; its name comes from the interpretation of m.x/ as the constant log-odds ratio of the odds O Y .y j X D x/ and O Y .y j x D 0/. An additive hazards model with the conditional hazard function Y .y j -Ordaz, 1983) under the additional constraint Y .y j X D x/ > 0. In this case, the function a.y/ > # 1 > 0 is the positive baseline cumulative hazard function ƒ Y .y j X D 0/.

Linear transformation model
The transformation model .F Z ; .a > ; Q x > / > ; .# > 1 ; ˇ>/ > / for any a and F Z is called the linear transformation model and contains all models discussed in this section. Note that the transformation of the response h Y .y/ D a.y/ > # 1 is non-linear in all models of interest (AFT, Cox etc.) and the term 'linear' only refers to a linear shift m.x/ of the explanatory variables. Partially linear or additive transformation models allow non-linear shifts as part of a partially smooth basis b, that is, in the form of an additive model. The number of constraints only depends on the basis a but not on the explanatory variables.

Extension of classical transformation models
A common property of all classical transformation models is the additivity of the response transformation and the shift, that is, the decomposition h.y j x/ D h Y .y/ C h x .x/ of the conditional transformation function. This assumption is relaxed by the following extensions of the classical models. Allowing for deviations from this simple model is also the key aspect for the development of novel transformation models in the rest of this section.

Discrete non-proportional odds and hazards models
For ordered categorical responses, the model F Y j X Dx .y k / D F Z .h Y .y k / m k .x// allows a category-specific shift m k .x/ D Q x >ˇk ; with F SL , this cumulative model is called the non-proportional odds model, and with F MEV , it is the non-proportional hazards model. Both models can be cast into the transformation model framework by defining the joint basis c.y k ; x/ D .a.y k / > ; a.y k / >˝b .x/ > / > as the Kronecker product of the two simple basis functions a.y k / D e K 1 .k/ and b.x/ D Q x (assuming that b does not contain an intercept term). Note that the conditional transformation function h.y j x/ includes an interaction term between y and x.

Time-varying effects One often studied extension of the Cox model is
x >ˇ. y//, where the regression coefficientsˇ.y/ may change with time y. The Cox model is included withˇ.y/ Áˇ, and the model is often applied to check the proportional hazards assumption. With a smooth parameterization of time y, for example, via a D a Bs;M , and linear basis b.x/ D Q x, the transformation model .F MEV ; .a > ; a >˝b> / > ; #/ implements this Cox model with time-varying (linear) effects. This model (with arbitrary F Z ) has also been presented in Foresi & Peracchi (1995) and is called distribution regression in Chernozhukov et al. (2013).

Novel transformation models
Because of the broadness of the transformation family, it is straightforward to set up new models for interesting situations by allowing more complex transformation functions h.y j x/. We will illustrate this possibility for two simple cases the independent two-sample situation and regression models for count data. The generic and most complex transformation model is called the conditional transformation model and is explained at the end of this section.
Beyond shift effects Assume we observe samples from two groups A and B and want to model the conditional distribution functions F Y j X DA .y/ and F Y j X DB .y/ of the response Y in the two groups. Based on this model, it is often interesting to infer whether the two distributions are equivalent and, if this is not the case, to characterize how they differ. Using an appropriate basis function a and the basis b.x/ D .1; 1.B// > , the model .F Z ; .a >˝b> / > ; #/ parameterizes the conditional transformation function as h.y j A/ D a.y/ > # 1 and h.y j B/ D h.y j A/ C h B A .y/ D a.y/ > # 1 C 1.B/a.y/ > # 2 . Clearly, the second term is constant zero (h B A .y/ Á 0) iff the two distributions are equivalent (F Y jX DA .y/ D F Y jX DB .y/ for all y). For the deviation function h B A .y/ D a > # 2 , we can apply standard likelihood inference procedures for O # 2 to construct a confidence band or use a test statistic like max. O # 2 =se. O # 2 // to assess deviations from zero. If there is evidence for a group effect, we can use the model to check whether the deviation function is constant, that is, h B A .y/ Á c ¤ 0. In this case, the simpler model .F Z ; .a > ; 1.B// > ; .# > 1 ; ˇ/ > / with shiftˇD # 2 might be easier to interpret. This model actually corresponds to a normal analysis of variance model with F Z Dˆand a.y/ > D .1; y/ > or the Cox proportional hazards model with .F MEV ; .a > Bs;M ; 1.B// > ; .# > 1 ; ˇ/ > /.

Count regression 'without tears'
Simple models for count data " D ¹0; 1; 2; : : :º almost always suffer from over-dispersion or excess zeros. The linear transformation model F Y jX Dx .y/ D F Z .h Y .y/ m.x// can be implemented using the basis function a.y/ D a Bs;M .byc/, and then the parameters of the transformation model .F Z ; .a > ; Q x > / > ; #/ are not affected by over-dispersion or under-dispersion because higher moments are handled by h Y independently of the effects of the explanatory variables m.x/. If there are excess zeros, we can set up a joint transformation model F Y j X Dx .y/ D F Z .h Y .y/ m.x/ C 1.y D 0/.˛0 m 0 .x/// such that we have a two-components mixture model consisting of the count distribution F Y j X Dx .y/ D F Z .h Y .y/ m.x// for y 2 " and the probability of an excess zero Hence, the transformation analogue to a hurdle model with hurdle at zero is the transformation model .F Z ; .a > ; Q x > ; 1.y D 0/; 1.y D 0/ Q x > / > ; .# > 1 ;ˇ>;˛0;ˇ> 0 / > /. j and include all special cases discussed in this section. It is convenient to assume monotonicity for each of the partial transformation functions; thus, the linear constraints for a j are repeated for each basis function in b j (detailed descriptions of linear constraints for different models in this class are available in Hothorn, 2017b). Hothorn et al. (2014) introduced this general model class and proposed a boosting algorithm for the estimation of transformation functions h for 'exact continuous' responses Y . In the likelihood framework presented here, conditional transformation models can be fitted under arbitrary schemes of censoring and truncation, and classical likelihood inference for the model parameters # becomes feasible. Of course, unlike in the boosting context, the number of model terms J and their complexity are limited in the likelihood world because the likelihood does not contain any penalty terms that induce smoothness in the x-direction.

Conditional transformation models
A systematic overview of linear transformation models with potentially response-varying effects is given in Table 1. Model nomenclature and interpretation of the corresponding model parameters is mapped to specific transformation functions h and distribution functions F Z . To the best of our knowledge, models without names have not yet been discussed in the literature, and their specific properties await closer investigation.

Empirical evaluation
We will illustrate the range of possible applications of likelihood-based conditional transformation models. In Section 5.2, we will present a small simulation experiment highlighting the possible advantage of indirectly modelling conditional distributions with transformation functions.

Illustrations
Density estimation: Old Faithful geyser The duration of eruptions and the waiting time between eruptions of the Old Faithful geyser in the Yellowstone National Park became a standard benchmark for non-parametric density estimation. The nine parameters of the transformation model .˚; a Bs;8 .waiting/; # / were fitted by maximization of the approximate log-likelihood (treating the waiting times as 'exact' observations) under the eight linear constraints D 9 # > 0. The model depicted in Fig. 1A reproduces the classic bimodal unconditional density of waiting time along with a kernel density estimate. It is important to note that the transformation model was fitted likelihood based, whereas the kernel density estimate relied on a cross-validated bandwidth. An unconditional density estimate for the duration of the eruptions needs to deal with censoring because exact duration times are only available for the Table 1.  Quantile regression: head circumference The Fourth Dutch Growth Study is a crosssectional study on growth and development of the Dutch population younger than 22 years. Stasinopoulos & Rigby (2007) fitted a growth curve to head circumferences (HCs) of 7040 boys using a generalized additive models for location, scale and shape (GAMLSS) model with a Box-Cox t distribution describing the first four moments of HC conditionally on age. The model showed evidence of kurtosis, especially for older boys. We fitted the same growth curves by the conditional transformation model .ˆ; .a Bs;3 .HC/ >˝b Bs;3 .age 1=3 / > / > ; #/ by maximization of the approximate log-likelihood under 3 4 linear constraints .D 4˝I 4 /# > 0. Figure 2 shows the data overlaid with quantile curves obtained via inversion of the estimated conditional distributions. The figure very closely reproduces the growth curves presented in Fig. 16 of Stasinopoulos & Rigby (2007) and also indicates a certain asymmetry towards older boys.

Survival analysis: German Breast Cancer Study Group-2 trial This prospective, controlled
clinical trial on the treatment of node-positive breast cancer patients was conducted by the German Breast Cancer Study Group. Out of 686 women, 246 received hormonal therapy, whereas the control group of 440 women did not. Additional variables include age, menopausal status, tumour size, tumour grade, number of positive lymph nodes, progesterone receptor and oestrogen receptor. The right-censored recurrence-free survival time is the response variable of interest.
The Cox model .F MEV ; .a > Bs;10 ; 1.hormonal therapy// > ; #/ implements the transformation function h.y j treatment/ D a Bs;10 .y/ > # 1 C 1.hormonal therapy/ˇ, where a > Bs;10 # 1 is the log-cumulative baseline hazard function parameterized by a Bernstein polynomial andˇ2 R is the log-hazard ratio of hormonal therapy. This is the classical Cox model with one treatment parameterˇbut with fully parameterized baseline transformation function, which was fitted by the exact log-likelihood under ten linear constraints. The model assumes proportional hazards, an assumption whose appropriateness we wanted to assess using the non-proportional hazards model .F MEV ; .a > Bs;10˝. 1; 1.hormonal therapy/// > ; #/ with the transformation function h.y j treatment/ D a Bs;10 .y/ > # 1 C 1.hormonal therapy/a Bs;10 .y/ > # 2 : The function a Bs;10 .y/ > # 2 is the time-varying difference of the log-hazard functions of women without and with hormonal therapy and can be interpreted as the deviation from a constant log-hazard ratio treatment effect of hormonal therapy. Under the null hypothesis of no treatment effect, we would expect # 2 Á 0. This monotonic deviation function adds ten linear constraints D 11 # 1 C D 11 # 2 > 0, which also ensure monotonicity of the transformation function for treated patients. We first compared the fitted survivor functions obtained from the model including a time-varying treatment effect with the Kaplan-Meier estimators in both treatment groups. Figure 3A shows a nicely smoothed version of the survivor functions obtained from this transformation model. Figure 3B shows the time-varying treatment effect a Bs;10 .y/ > O # 2 , together with a 95% confidence band computed from the joint normal distribution of O # 2 for a grid over time; the method is much simpler than other methods for inference on time-varying effects (e.g. Sun et al., 2009). The 95% confidence interval around the log-hazard ratio Ǒ is also plotted, and as the latter is fully covered by the confidence band for the timevarying treatment effect, there is no reason to question the treatment effect computed under the proportional hazards assumption.
In the second step, we allowed an age-varying treatment effect to be included in the model .F MEV ; .a Bs;10 .y/ >˝. 1.hormonal therapy/; 1 1.hormonal therapy//˝b Bs;3 .age/ > / > ; #/. For both treatment groups, we estimated a conditional transformation function of survival time y given age parameterized as the tensor basis of two Bernstein bases. Each of the two basis functions comes with 10 3 linear constraints; therefore, the model was fitted under 60 linear constraints. Figure 4 allows an assessment of the prognostic and predictive properties of age. As the survivor functions were clearly larger for all patients treated with hormones, the positive  treatment effect applied to all patients. However, the size of the treatment effect varied greatly. The effect was most pronounced for women younger than 30 years and levelled off a little for older patients. In general, the survival times were longest for women between 40 and 60 years old. Younger women suffered the highest risk; for women older than 60 years, the risk started to increase again. This effect was shifted towards younger women when hormonal treatment was applied.

Simulation experiment
The transformation family includes linear as well as very flexible models, and we therefore illustrate the potential gain of modelling a transformation function h by comparing a very simple transformation model with a fully parametric approach and to a non-parametric approach using a data-generating process introduced by Hothorn et al. (2014).
In the transformation model .ˆ; ..1; y/˝.1; x > // > ; #/, two explanatory variables x D .x 1 ; x 2 / > influence both the conditional mean and the conditional variance of a normal response Y . Although the transformation function is linear in y with three linear constraints, the mean and variance of Y given x depend on x in a non-linear way. The choices x 1 UOE0; 1; x 2 UOE 2; 2 with # D .0; 0; 1; :5; 1; 0/ lead to the heteroscedastic varying coefficient model where the variance of Y ranges between 0.44 and 4 depending on x 1 . This model can be fitted in the GAMLSS framework under the assumptions that the mean of the normal response depends on a smoothly varying regression coefficient .x 1 C 0:5/ 1 for x 2 and that the variance is a smooth function of x 1 . This model is therefore fully parametric. As a non-parametric counterpart, we used a kernel estimator for estimating the conditional distribution function of Y as a function of the two explanatory variables. From the transformation model, the GAMLSS and kernel estimators, we obtained estimates of F Y j X Dx .y/ over a grid on y; x 1 ; x 2 and computed the mean absolute deviation (MAD) of the true and estimated probabilities MAD.
for each pair of x 1 and x 2 . Then, the minimum, the median and the maximum of the MAD values for all x 1 and x 2 were computed as summary statistics. The most likely transformation approach and its two competitors were estimated and evaluated for 100 random samples of size N D 200 drawn from model (5). Cross-validation was used to determine the bandwidths for the kernel-based estimators (function npcdist() in package np; for details, see Hayfield & Racine, 2008). We fitted the GAMLSS models by boosting; the number of boosting iterations was determined via sample splitting (Mayr et al., 2012). To investigate the stability of the three procedures under non-informative explanatory variables, we added to the data p D 1; : : : ; 5 uniformly distributed variables without association to the response and included them as potential explanatory variables in the three models. The case p D 0 corresponds to model (5). Figure 5 shows the empirical distributions of the minimum, median and maximum MAD for the three competitors. Except for the minimum MAD in the absence of any irrelevant explanatory variables (p D 0), the conditional distributions fitted by the transformation models were closer to the true conditional distribution function by means of the MAD. This result was obtained because the transformation model only had to estimate a simple transformation function, whereas the other two procedures had a difficult time approximating this simple transformation model on another scale. However, the comparison illustrates the potential improvement one can achieve when fitting simple models for the transformation function instead of more complex models for the mean (GAMLSS) or distribution function (Kernel). The kernel estimator led to the largest median MAD values but seemed more robust than GAMLSS with respect to the maximum MAD. These results were remarkably robust in the presence of up to five non-informative explanatory variables, although of course the MAD increased with the number of non-informative variables p.

Discussion
The contribution of a likelihood approach for the general class of conditional transformation models is interesting both from a theoretical and a practical perspective. With the range of simple to very complex transformation functions introduced in Section 4 and illustrated in Section 5, it becomes possible to understand classical parametric, semi-parametric and non-parametric models as special cases of the same model class. Thus, analytic comparisons between models of different complexity become possible. The transformation family P Y;‚ , the corresponding likelihood function and the most likely transformation estimator are easy to understand. This makes the approach appealing also from a teaching perspective. Connections between standard parametric models (e.g. the normal linear model) and potentially complex models for survival or ordinal data can be outlined in very simple notation, placing emphasis on the modelling of (conditional) distributions instead of just modelling (conditional) means. Computationally, the log-likelihood log ıL is linear in the number of observations N and, for contributions of 'exact continuous' responses, only requires the evaluation of the derivative h 0 of the transformation function h instead of integrals thereof.
Based on the general understanding of transformation models outlined in this paper, it will be interesting to study these models outside the strict likelihood world. A mixed transformation model for cluster data (Cai et al., 2002;Zeng et al., 2005;Choi & Huang, 2012) is often based on the transformation function h.y j x; i/ D h Y .y/ C ı i C h x .x/ with random intercept (or 'frailty' term) ı i for the i th observational unit. Conceptually, a more complex deviation from the global model could be formulated as h.y j x; i/ D h Y .y/ C h Y .y; i/ C h x .x/, that is, each observational unit is assigned its own 'baseline' transformation h Y .y/ C h Y .y; i/, where the second term is an integral zero deviation from h Y . For longitudinal data with possibly timevarying explanatory variables, the model h.y j x.t/; t/ D h Y .y; t/Cx.t/ˇ.t/ (Ding et al., 2012;Wu & Tian, 2013) can also be understood as a mixed version of a conditional transformation model. The penalized log-likelihood log.L.h j y// pen.ˇ/ for the linear transformation model h.y j x/ D h Y .y/ Q x >ˇl eads to Ridge-type or Lasso-type regularized models, depending on the form of the penalty term. Priors for all model parameters # allow a fully Bayesian treatment of transformation models.
It is possible to relax the assumption that F Z is known. The simultaneous estimation of F Z in the model P .Y Ä y j X D x/ D F Z .h Y .y/ Q x >ˇ/ was studied by Horowitz (1996) and later extended by Linton et al. (2008) to non-linear functions h x with parametric baseline transformation h Y and kernel estimates for F Z and h x . For AFT models, Zhang & Davidian (2008) applied smooth approximations for the density f Z in an exact censored likelihood estimation procedure. In a similar set-up, Huang (2014) proposed a method to jointly estimate the mean function and the error distribution in a generalized linear model. The estimation of F Z is noteworthy in additive models of the form h Y C h x because these models assume additivity of the contributions of y and x on the scale of F 1 Z .P .Y Ä y j X D x//. If this model assumption seems questionable, one can either allow unknown F Z or move to a transformation model featuring a more complex transformation function.
From this point of view, the distribution function F Z in flexible transformation models is only a computational device mapping the unbounded transformation function h into the unit interval strictly monotonically, making the evaluation of the likelihood easy. Then, F Z has no further meaning or interpretation as error distribution. A compromise could be the family of distributions F Z .´j / D 1 .1 C exp.´// 1= for > 0 (suggested by McLain & Ghosh, 2013) with simultaneous maximum likelihood estimation of # and for additive transformation functions h D h Y C h x , as these models are flexible and still relatively easy to interpret.
In light of the empirical results discussed in this paper and the theoretical work of McLain & Ghosh (2013) on a Cox model with log-cumulative baseline hazard function parameterized in terms of a Bernstein polynomial with increasing order M , one might ask where the boundaries among parametric, semi-parametric and non-parametric statistics lie. The question how the order M affects results practically has been repeatedly raised; therefore, we will close our discussion by looking at a Cox model with increasing M for the German Breast Cancer Study Group-2 data. All eight baseline variables were included in the linear predictor, and we fitted the model with orders M D 1; : : : ; 30; 35; 40; 45; 50 of the Bernstein polynomial parameterizing the log-cumulative baseline hazard function. In Fig. 6A, the log-cumulative baseline hazard functions start with a linear function (M D 1) and quickly approach a function that is essentially a smoothed version of the Nelson-Aalen-Breslow estimator plotted in red. In Fig. 6B, the trajectories of the estimated regression coefficients become very similar to the partial likelihood estimates as M increased. For M 10, for instance, the results of the 'semi-parametric' and the 'fully parametric' Cox models are practically equivalent. An extensive collection of such head-to-head comparisons of most likely transformations with their classical counterparts can be found in Hothorn (2017b). Our work for this paper and practical experience with its reference software implementation convinced us that rethinking classical models in terms of fully parametric transformations is intellectually and practically a fruitful exercise.

Computational details
A reference implementation of most likely transformation models is available in the mlt package (Hothorn, 2017a). All data analyses can be reproduced in the dynamic document Hothorn (2017b). Augmented Lagrangian Minimization implemented in the auglag() function of package alabama (Varadhan, 2015) was used for optimizing the log-likelihood. Package gamboostLSS (version 1.2-2, Hofner et al., 2016) was used to fit GAMLSS models and kernel density, and distribution estimation was performed using package np (version 0.60-2, Racine & Hayfield, 2014). All computations were performed using R version 3.4.0 (R Core Team, 2017). Additional applications are described in an extended version of this paper (Hothorn et al., 2017).