Volume 45, Issue 1
Original Article
Open Access

Most Likely Transformations

Torsten Hothorn

Corresponding Author

E-mail address: torsten.hothorn@uzh.ch

Institut für Epidemiologie, Biostatistik und Prävention, Universität Zürich

Torsten Hothorn, Institut für Epidemiologie, Biostatistik und Prävention, Universität Zürich.

E‐mail: torsten.hothorn@uzh.ch

Search for more papers by this author
Lisa Möst

Institut für Statistik, Ludwig‐Maximilians‐Universität München

Search for more papers by this author
Peter Bühlmann

Seminar für Statistik, ETH Zürich

Search for more papers by this author
First published: 18 August 2017
Citations: 16

Abstract

We propose and study properties of maximum likelihood estimators in the class of conditional transformation models. Based on a suitable explicit parameterization of the unconditional or conditional transformation function, we establish a cascade of increasingly complex transformation models that can be estimated, compared and analysed in the maximum likelihood framework. Models for the unconditional or conditional distribution function of any univariate response variable can be set up and estimated in the same theoretical and computational framework simply by choosing an appropriate transformation function and parameterization thereof. The ability to evaluate the distribution function directly allows us to estimate models based on the exact likelihood, especially in the presence of random censoring or truncation. For discrete and continuous responses, we establish the asymptotic normality of the proposed estimators. A reference software implementation of maximum likelihood‐based estimation for conditional transformation models that allows the same flexibility as the theory developed here was employed to illustrate the wide range of possible applications.

1 Introduction

In a broad sense, we can understand all statistical models as models of distributions or certain characteristics thereof, especially the mean. All distributions urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0001 for at least ordered responses Y can be characterized by their distribution, quantile, density, odds, hazard or cumulative hazard functions. In a fully parametric setting, all these functions have been specified up to unknown parameters, and the ease of interpretation can guide us in looking at the appropriate function. In the semi‐parametric and non‐parametric contexts, however, the question arises how we can obtain an estimate of one of these functions without assuming much about their shape. For the direct estimation of distribution functions, we deal with monotonic functions in the unit interval, whereas for densities, we need to make sure that the estimator integrates to one. The hazard function comes with a positivity constraint, and monotonicity is required for the positive cumulative hazard function. These computationally inconvenient restrictions disappear completely only when the log‐hazard function is estimated, and this explains the plethora of research papers following this path. However, the lack of any structure in the log‐hazard function comes at a price. A too‐erratic behaviour of estimates of the log‐hazard function has to be prevented by some smoothness constraint; this makes classical likelihood inference impossible. The novel characterization and subsequent estimation of distributions via their transformation function in a broad class of transformation models that are developed in this paper can be interpreted as a compromise between structure (monotonicity) and ease of parameterization, estimation and inference. This transformation approach to modelling and estimation allows standard likelihood inference in a large class of models that have so far commonly been dealt with by other inference procedures.

Since the introduction of transformation models based on non‐linear transformations of some response variable by Box & Cox (1964), this attractive class of models has received much interest. In regression problems, transformation models can be understood as models for the conditional distribution function and are sometimes referred to as ‘distribution regression’, in contrast to their ‘quantile regression’ counterpart (Chernozhukov et al., 2013). Traditionally, the models were actively studied and applied in the analysis of ordered categorical or censored responses. Recently, transformation models for the direct estimation of conditional distribution functions for arbitrary responses received interest in the context of counterfactual distributions (Chernozhukov et al., 2013), probabilistic forecasting (Gneiting & Katzfuss, 2014), distribution and quantile regression (Leorato & Peracchi, 2015; Rothe & Wied, 2013), probabilistic index models (Thas et al., 2012) and conditional transformation models (Hothorn et al., 2014). The core idea of any transformation model is the application of a strictly monotonic transformation function h for the reformulation of an unknown distribution function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0002 as urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0003, where the unknown transformation function h is estimated from the data. Transformation models have received attention especially in situations where the likelihood contains terms involving the conditional distribution function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0004 with inverse link function FZ, most importantly for censored, truncated and ordered categorical responses. For partially linear transformation models with transformation function h(y|x)=hY(y)+hx(x), much emphasis has been given to estimation procedures treating the baseline transformation hY (e.g. the log‐cumulative baseline hazard function in the Cox model) as a high‐dimensional nuisance parameter. Prominent members of these estimation procedures are the partial likelihood estimator and approaches influenced by the estimation equations introduced by Cheng et al. (1995). Once an estimate for the shift hx is obtained, the baseline transformation hY is then typically estimated by the non‐parametric maximum likelihood estimator (see, e.g. Cheng et al., 1997). An overview of the extensive literature on the simultaneous non‐parametric maximum likelihood estimation of hY and hx, that is, estimation procedures not requiring an explicit parameterization of hY, for censored continuous responses is given in Zeng & Lin (2007).

An explicit parameterization of hY is common in models of ordinal responses (Tutz, 2012). For survival times, Kooperberg et al. (1995) introduced a cubic spline parameterization of the log‐conditional hazard function with the possibility of response‐varying effects and estimated the corresponding models by maximum likelihood. Crowther & Lambert (2014) followed up on this suggestion and used restricted cubic splines. Many authors studied penalized likelihood approaches for spline approximations of the baseline hazard function in a Cox model, for example, Ma et al. (2014). Less frequently, the transformation function hY was modelled directly. Mallick & Walker (2003), Chang et al. (2005) and McLain & Ghosh (2013) used Bernstein polynomials for hY, and Royston & Parmar (2002) proposed a maximum likelihood approach using cubic splines for modelling hY and also time‐varying effects. The connection between these different transformation models is difficult to see because most authors present their models in the relatively narrow contexts of survival or ordinal data. The lack of a general understanding of transformation models made the development of novel approaches in this model class burdensome. Hothorn et al. (2014) decoupled the parameterization of the conditional transformation function h(y|x) from the estimation procedure and showed that many interesting and novel models can be understood as transformation models. The boosting‐based optimization of proper scoring rules, however, was only developed for uncensored and right‐censored observations in the absence of truncation and requires the numerical approximation of the true target function. In a similar spirit, Chernozhukov et al. (2013) applied the connection urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0005 for estimation in the response‐varying effects transformation model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0006; this approach can be traced back to Foresi & Peracchi (1995).

A drawback of all but the simplest transformation models is the lack of a likelihood estimation procedure. Furthermore, although important connections to other models have been known for some time (Doksum & Gasko, 1990), it is often not easy to see how broad and powerful the class of transformation models actually is. We address these issues and embed the estimation of unconditional and conditional distribution functions of arbitrary univariate random variables under all forms of random censoring and truncation into a common theoretical and computational likelihood‐based framework. In a nutshell, we show in Section 2 that all distributions can be generated by a strictly monotonic transformation of some absolute continuous random variable. The likelihood function of the transformed variable can then be characterized by this transformation function. The parameters of appropriate parameterizations of the transformation function, and thus the parameters of the conditional distribution function in which we are interested, can then be estimated by maximum likelihood under simple linear constraints that allow classical asymptotic likelihood inference, as will be shown in Section 3. Many classical and contemporary models are introduced as special cases of this framework. In particular, all transformation models sketched in this introduction can be understood and estimated in this novel likelihood‐based framework. Extensions of classical and contemporary transformation models as well as some novel models are derived from our unified theoretical framework of transformation functions in Section 4, and their empirical performance is illustrated and evaluated in Section 5.

2 The likelihood of transformations

Let urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0007 denotes a probability space and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0008 a measurable space with at least ordered sample space Ξ. We are interested in inference about the distribution urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0009 of a random variable Y, that is, the probability space urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0010 defined by the urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0011 measurable function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0012. For the sake of notational simplicity, we present our results for the unconditional case first; regression models are discussed in Section 4.2. The distribution urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0013 is dominated by some measure μ and characterized by its density function fY, distribution function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0014, quantile function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0015, odds function OY(y)=FY(y)/(1−FY(y)), hazard function λY(y)=fY(y)/(1−FY(y)) or cumulative hazard function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0016. For notational convenience, we assume strict monotonicity of FY, that is, FY(y1)<FY(y2)∀y1<y2∈Ξ. Our aim is to obtain an estimate urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0017 of the distribution function FY from a random sample urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0018. In the following, we will show that one can always write this potentially complex distribution function FY as the composition of a much simpler and a priori specified distribution function FZ and a strictly monotonic transformation function h. The task of estimating FY is then reduced to obtaining an estimate urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0019. The latter exercise, as we will show in this paper, is technically and conceptually attractive.

Let urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0020 denotes the Euclidian space with Borel σ‐algebra and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0021 an urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0022 measurable function such that the distribution urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0023 is absolutely continuous (μL denotes the Lebesgue measure) in the probability space urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0024. Let FZ and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0025 denote the corresponding distribution and quantile functions. We furthermore assume urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0026 and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0027 for a log‐concave density fZ as well as the existence of the first two derivatives of the density fZ(z) with respect to z; both derivatives shall be bounded. We do not allow any unknown parameters for this distribution. Possible choices include the standard normal, standard logistic (SL) and minimum extreme value (MEV) distribution with distribution functions urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0028 and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0029, respectively. In the first step, we will show that there always exists a unique and strictly monotonic transformation g such that the unknown and potentially complex distribution urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0030 that we are interested in can be generated from the simple and known distribution urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0031 via urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0032. More formally, let urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0033 denotes a urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0034 measurable function. The composition gZ is a random variable on urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0035. We can now formulate the existence and uniqueness of g as a corollary to the probability integral transform.

Corollary 1.For all random variables Y and Z, there exists a unique strictly monotonically increasing transformation g, such that urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0036.

Proof.Let urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0037 and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0038. Then U:=FZ(Z)∼U[0,1] and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0039 by the probability integral transform. Let urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0040, such that FY(y)=FZ(h(y)). From urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0041, we get the uniqueness of h and therefore g. The quantile function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0042 and the distribution function FY exist by assumption and are both strictly monotonic and right continuous. Therefore, h is strictly monotonic and right continuous and so is g.

Corollary 2.For μ=μL, we have g=h−1 and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0043.

This result for absolutely continuous random variables Y can be found in many textbooks (Lindsey, 1996, e.g.); Corollary 1 also covers the discrete case.

Corollary 3.For the counting measure urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0044 is a right‐continuous step function because FY is a right‐continuous step function with steps at y∈Ξ.

We now characterize the distribution FY by the corresponding transformation function h, set up the corresponding likelihood of such a transformation function and estimate the transformation function based on this likelihood. Let urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0045 denote the space of all strictly monotonic transformation functions. With the transformation function h, we can evaluate FY as FY(y|h)=FZ(h(y))∀y∈Ξ. Therefore, we only need to study the transformation function h; the inverse transformation g=h−1(Bickel et al., 1993, used to define a ‘group model’ by) is not necessary in what follows. The density for absolutely continuous variables Y(μ=μL) is now given by urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0046. For discrete responses Y(μ=μC) with finite sample space Ξ={y1,…,yK}, the density is
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0047
and for countably infinite sample spaces Ξ={y1,y2,y3,…}, we get the density
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0048
With the conventions urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0049 and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0050, we use the more compact notation fY(yk|h)=FZ(h(yk))−FZ(h(yk−1)) in the sequel.
For a given transformation function h, the likelihood contribution of a datum urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0051 is defined in terms of the distribution function (Lindsey, 1996)
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0052
This ‘exact’ definition of the likelihood applies to most practical situations of interest and, in particular, allows discrete and (conceptually) continuous as well as censored or truncated observations C. For a discrete response yk, we have urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0053 and y_=yk−1, such that urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0054. For absolutely continuous random variables Y, we almost always observe an imprecise datum urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0055 and, for short intervals urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0056, approximate the exact likelihood urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0057 by the term urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0058 or simply fY(y|h) with urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0059(Lindsey, 1999). This approximation only works for relatively precise measurements, that is, short intervals. If longer intervals are observed, one speaks of ‘censoring’ and relies on the exact definition of the likelihood contribution instead of using the aforementioned approximation (Klein & Moeschberger, 2003). In summary, the likelihood contribution of a conceptually ‘exact continuous’ or left‐censored, right‐censored or interval‐censored continuous or discrete observation urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0060 is given by
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0061
under the assumption of random censoring. The likelihood is more complex under dependent censoring (Klein & Moeschberger, 2003), but we will not elaborate on this issue. The likelihood contribution urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0062 of an ordered factor in category yk is equivalent to the term urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0063 contributed by an interval‐censored observation urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0064, when category yk is defined by the interval urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0065. Thus, the expression urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0066 for the likelihood contribution reflects the equivalence of interval censoring and categorization at corresponding cut‐off points.
For truncated observations in the interval urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0067, the aforementioned likelihood contribution is defined in terms of the distribution function conditional on the truncation
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0068
and thus, the likelihood contribution changes to (Klein & Moeschberger, 2003)
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0069
It is important to note that the likelihood is always defined in terms of a distribution function (Lindsey, 1999) and it therefore makes sense to directly model the distribution function of interest. The ability to uniquely characterize this distribution function by the transformation function h gives rise to the following definition of an estimator urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0070.

Definition 1. (Most likely transformation)Let C1,…,CN denotes an independent sample of possibly randomly censored or truncated observations from urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0071. The estimator

urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0072
is called the most likely transformation.

Log‐concavity of fZ ensures concavity of the log‐likelihood (except when all observations are right censored) and thus ensures the existence and uniqueness of urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0073.

Many distributions are defined by a transformation function h, for example, the Box–Cox power exponential family (Stasinopoulos & Rigby, 2007), the sinh‐arcsinh distributions (Jones & Pewsey, 2009) or the T‐X family of distributions (Alzaatreh et al., 2013). In what follows, we do not assume any specific form of the transformation function but parameterize h in terms of basis functions. We now introduce such a parameterization, a corresponding family of distributions, a maximum likelihood estimator and a large class of models for unconditional and conditional distributions.

3 Transformation analysis

We parameterize the transformation function h(y) as a linear function of its basis‐transformed argument y using a basis function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0074, such that urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0075. The choice of the basis function a is problem specific and will be discussed in Section 4. The likelihood urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0076 only requires evaluation of h, and only an approximation thereof using the Lebesgue density of ‘exact continuous’ observations makes the evaluation of the first derivative of h(y) with respect to y necessary. In this case, the derivative with respect to y is given by urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0077, and we assume that urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0078 is available. In the following, we will write h=aϑ and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0079 for the transformation function and its first derivative, omitting the argument y, and we assume that both functions are bounded away from urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0080 and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0081. For a specific choice of FZ and a, the transformation family of distributions consists of all distributions urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0082 whose distribution function FY is given as the composition FZaϑ; this family can be formally defined as follows.

Definition 2. (Transformation family)The distribution family urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0083 with parameter space urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0084 is called transformation family of distributions urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0085 with transformation functions urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0086‐densities fY(y|ϑ),y∈Ξ, and error distribution function FZ.

The classical definition of a transformation family relies on the idea of invariant distributions, that is, only the parameters of a distribution are changed by a transformation function but the distribution itself is not changed. The normal family characterized by affine transformations is the most well‐known example (e.g. Fraser, 1968; Lindsey, 1996). Here, we explicitly allow and encourage transformation functions that change the shape of the distribution. The transformation function aϑ is, at least in principle, flexible enough to generate any distribution function FY=FZaϑ from the distribution function FZ. We borrow the term ‘error distribution’ function for FZ from Fraser (1968), because Z can be understood as an error term in some of the models discussed in Section 4. The problem of estimating the unknown transformation function h, and thus the unknown distribution function FY, reduces to the problem of estimating the parameter vector ϑ through maximization of the likelihood function. We assume that the basis function a is such that the parameters ϑ are identifiable.

Definition 3. (Maximum likelihood estimator) urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0087

Based on the maximum likelihood estimator urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0088, we define plug‐in estimators of the most likely transformation function and the corresponding estimator of our target distribution FY as urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0089 and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0090. Because the problem of estimating an unknown distribution function is now embedded in the maximum likelihood framework, the asymptotic analysis benefits from standard results on the asymptotic behaviour of maximum likelihood estimators. We begin with deriving the score function and Fisher information. The score contribution of an ‘exact continuous’ observation urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0091 from an absolutely continuous distribution is approximated by the gradient of the log‐density
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0092(1)
For an interval‐censored or discrete observation y_ and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0093 (the constant terms urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0094 and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0095 vanish), the score contribution is
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0096(2)
For a truncated observation, the score function is urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0097.
The contribution of an ‘exact continuous’ observation y from an absolutely continuous distribution to the Fisher information is approximately
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0098(3)
For a censored or discrete observation, we have the following contribution to the Fisher information
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0099(4)

For a truncated observation, the Fisher information is given by urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0100.

We will first discuss the asymptotic properties of the maximum likelihood estimator urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0101 in the parametric setting with fixed parameters ϑ in both the discrete and continuous case. For continuous variables Y and a transformation function parameterized using a Bernstein polynomial, results for sieve maximum likelihood estimation, where the number of parameters increases with N, are then discussed in Section 3.2.

3.1 Parametric inference

Conditions on the densities of the error distribution fZ and the basis functions a ensuring consistency and asymptotic normality of the sequence of maximum likelihood estimators urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0102 and an estimator of their asymptotic covariance matrix are given in the following three theorems. Because of the full parameterization of the model, the proofs are simple standard results for likelihood asymptotics, and a more complex analysis (as required for estimation equations in the presence of a nuisance parameter hY, e.g. in Cheng et al., 1995) is not necessary. We will restrict ourselves to absolutely continuous or discrete random variables Y, where the likelihood is given in terms of the density fY(y|ϑ). Furthermore, we will only study the case of a correctly specified transformation h=aϑ and refer the reader to Hothorn et al. (2014), where consistency results for arbitrary h are given.

Theorem 1.For urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0103 and under the assumptions (A1), the parameter space Θ is compact and (A2) urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0104 where ϑ0 is well separated:

urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0105
the sequence of estimators urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0106 converges to ϑ0 in probability, urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0107, as urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0108.

Proof.The log‐likelihood is continuous in ϑ, and because of (A2), each log‐likelihood contribution is dominated by an integrable function. Thus, the result follows from van der Vaart (1998) (Theorem 5.8 with example 19.7; see note at bottom of page 46).

Remark 1.Assumption (A1) is made for convenience, and relaxations of such a condition are given in van de Geer (2000) or van der Vaart (1998). The assumptions in (A2) are rather weak: the first one holds if the functions a are not arbitrarily ill posed, and the second one holds if the function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0109 is strictly convex in ϑ(if the assumption would not hold, we would still have convergence to the set urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0110).

Theorem 2.Under the assumptions of Theorem 1 and in addition (A3)

urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0111
(A4) urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0112 and (for the absolutely continuous case μ=μL only) urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0113 are non‐singular, and (A5) urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0114 and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0115, the sequence urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0116 is asymptotically normal with mean zero and covariance matrix
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0117
as urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0118.

Proof.Because the map urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0119 is continuously differentiable in ϑ for all y in both the discrete and absolutely continuous case and the matrix

urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0120
is continuous in ϑ as given in 1 and 2, the transformation family urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0121 is differentiable in quadratic mean with Lemma 7.6 in van der Vaart (1998). Furthermore, assumptions (A4 and A5) ensure that the expected Fisher information matrix is non‐singular at ϑ0. With the consistency and (A3), the result follows from Theorem 5.39 in van der Vaart (1998).

Remark 2.Assumption (A4) is valid for the densities fZ of the normal, logistic and MEV distribution. The Fisher information 3 and 4 evaluated at the maximum likelihood estimator urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0122 can be used to estimate the covariance matrix Σϑ0.

Theorem 3.Under the assumptions of Theorem 2 and assuming urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0123, a consistent estimator for Σϑ0 is given by

urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0124

Proof.With the law of large numbers, we have

urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0125
Because the map ϑF(ϑ|y) is continuous for all y(as can be seen from 3 and 4), the result follows with Theorem 1.

Based on Theorems 13, we can perform standard likelihood inference on the model parameters ϑ. In particular, we can construct confidence intervals and confidence bands for the conditional distribution function from confidence intervals and bands for the linear functions aϑ. We complete this part by formally defining the class of transformation models.

Definition 4. (Transformation model)The triple (FZ,a,ϑ) is called transformation model.

The transformation model (FZ,a,ϑ) fully defines the distribution of Y via FY=FZaϑ and thus the corresponding likelihood urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0126. Our definition of transformation models as (FZ,a,ϑ) is strongly tied to the idea of structural inference (Fraser, 1968) and group models (Bickel et al., 1993). Fraser (1968) described a measurement model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0127 for Y by an error distribution urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0128 and a structural equation Y=gZ, where g is a linear function, thereby extending the location‐scale family Y=α+σZ. Group models consist of distributions generated by possibly non‐linear g. The main difference to these classical approaches is that we parameterize h instead of g=h−1. By extending the linear transformation functions g dealt with by Fraser (1968) to non‐linear transformations, we approximate the potentially non‐linear transformation functions urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0129 by aϑ, with subsequent estimation of the parameters ϑ. For given parameters ϑ, a sample from urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0130 can be drawn by the probability integral transform, that is, urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0131 is drawn and then urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0132ϑZi}.

3.2 Non‐parametric inference

For continuous responses Y, any unknown transformation h can be approximated by Bernstein polynomials of increasing order (Farouki, 2012). For uncensored and right‐censored responses and under the same conditions for FZ as stated in Section 3.1, McLain & Ghosh (2013) showed that the non‐parametric sieve maximum likelihood estimator is consistent with rate of convergence N2/5 for h with continuous bounded second derivatives in unconditional and linear transformation models (Section 4.3). In the latter class, the linear shift parameters β are asymptotically normal and semi‐parametrically efficient. Numerical approximations to the observed Fisher information urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0133 were shown to lead to appropriate standard errors of urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0134 by McLain & Ghosh (2013). Hothorn et al. (2014) established the consistency of boosted non‐parametric conditional transformation models (Section 4.2). For sieve maximum likelihood estimation in the class of conditional transformation models, the techniques employed by McLain & Ghosh (2013) require minor technical extensions, which are omitted here.

In summary, the same limiting distribution arises under both the parametric and the non‐parametric paradigm for transformation functions parameterized or approximated using Bernstein polynomials, respectively. In the latter case, the target is then the best approximated transformation function with Bernstein polynomials, say urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0135 (where the index N indicates that we use a more complex approximation when N increases). If the approximation error urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0136 is of smaller order than the convergence rate of the estimator, the estimator's target becomes the true underlying transformation function h, and otherwise, a bias for estimating h remains.

4 Applications

The definition of transformation models tailored for specific situations ‘only’ requires the definition of a suitable basis function a and a choice of FZ. In this section, we will discuss specific transformation models for unconditional and conditional distributions of ordered categorical, discrete and continuous responses Y. Note that the likelihood function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0137 allows all these models to be fitted to arbitrarily censored or truncated responses; for brevity, we will not elaborate on the details.

4.1 Unconditional transformation models

Finite sample space

For ordered categorical responses Y from a finite sample space Ξ={y1,…,yK}, we assign one parameter to each element of the sample space except yK. This corresponds to the basis function a(yk)=eK−1(k), where eK−1(k) is the unit vector of length K−1, with its kth element being one. The transformation function h is
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0138
with urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0139, and the unconditional distribution function of FY is FY(yk)=FZ(ϑk). This parameterization underlies the common proportional odds and proportional hazards model for ordered categorical data (Tutz, 2012). Note that monotonicity of h is guaranteed by the K−2 linear constraints ϑ2ϑ1>0,…,ϑK−1ϑK−2>0 when constrained optimization is performed. In the absence of censoring or truncation and with urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0140, we obtain the maximum likelihood estimator for ϑ as
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0141
because urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0142 maximizes the equivalent multinomial (or empirical) log‐likelihood urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0143, and we can rewrite this estimator as
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0144
The estimated distribution function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0145 is invariant with respect to FZ.

Assumption (A4) is valid for these basis functions because we have urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0146 for YPY,ϑ0.

If we define the sample space Ξ as the set of unique observed values and the probability measure as the empirical cumulative distribution function (ECDF), putting mass N−1 on each observation, we see that this particular parameterization is equivalent to an empirical likelihood approach, and we get urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0147. Note that although the transformation function depends on the choice of FZ, the estimated distribution function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0148 does not and is simply the non‐parametric empirical maximum likelihood estimator. A smoothed version of this estimator for continuous responses is discussed in the next paragraph.

Infinite sample space

For continuous responses Y, the parameterization h(y)=a(y)ϑ, and thus also urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0149, should be smooth in y; therefore, any polynomial or spline basis is a suitable choice for a. For the empirical experiments in Section 5, we applied Bernstein polynomials (for an overview, see Farouki, 2012) of order M(P=M+1) defined on the interval urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0150 with
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0151
where urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0152 and fBe(m,M) is the density of the Beta distribution with parameters m and M. This choice is computationally attractive because strict monotonicity can be formulated as a set of M linear constraints on the parameters ϑm<ϑm+1 for all m=0,…,M(Curtis & Ghosh, 2011). Therefore, application of constrained optimization guarantees monotonic estimates urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0153. The basis contains an intercept. We obtain smooth plug‐in estimators for the distribution, density, hazard and cumulative hazard functions as urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0154 and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0155. The estimator urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0156 must not be confused with the estimator urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0157 for Y∈[0,1] obtained from the smoothed empirical distribution function with coefficients urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0158 corresponding to probabilities evaluated at the quantiles m/M for m=0,…,M(Babu et al., 2002).

The question arises how the degree of the polynomial affects the estimated distribution function. On the one hand, the model (Φ,aBs,1,ϑ) only allows linear transformation functions of a standard normal, and FY is restricted to the normal family. On the other hand, (Φ,aBs,N−1,ϑ) has one parameter for each observation, and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0159 is the non‐parametric maximum likelihood estimator ECDF, which, by the Glivenko–Cantelli lemma, converges to FY. In this sense, we cannot choose a ‘too large’ value for M. This is a consequence of the monotonicity constraint on the estimator urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0160, which, in this extreme case, just interpolates the step function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0161. Empirical evidence for the insensitivity of results when M is large can be found in Hothorn (2017b) and in the discussion.

4.2 Conditional transformation models

In the following, we will discuss a cascade of increasingly complex transformation models where the transformation function h may depend on explanatory variables Xχ. We are interested in estimating the conditional distribution of Y given X=x. The corresponding distribution function FY|X=x can be written as FY|X=x(y)=FZ(h(y|x)). The transformation function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0162 is said to be conditional on x. Following the arguments presented in the proof of Corollary 1, it is easy to see that for each x, there exists a strictly monotonic transformation function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0163 such that FY|X=x(y)=FZ(h(y|x)). Because this class of conditional transformation models and suitable parameterizations was introduced by Hothorn et al. (2014), we will only sketch the most important aspects here.

Let urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0164 denotes a basis transformation of the explanatory variables. The joint basis for both y and x is called urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0165; its dimension d(P,Q) depends on the way the two basis functions a and b are combined (e.g. urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0166 or urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0167). The conditional transformation function is now parameterized as h(y|x)=c(y,x)ϑ. One important special case is the simple transformation function h(y|x)=hY(y)+hx(x), where the explanatory variables only contribute a shift hx(x) to the conditional transformation function. Often this shift is assumed to be linear in x; therefore, we use the function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0168 to denote linear shifts. Here, urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0169 is one row of the design matrix without intercept. These simple models correspond to the joint basis urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0170, with urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0171 and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0172. The results presented in Section 3, including Theorems 13, carry over in the fixed design case when a is replaced by c.

In the rest of this section, we will present classical models that can be embedded in the larger class of conditional transformation models and some novel models that can be implemented in this general framework.

4.3 Classical transformation models

Linear model

The normal linear regression model Y∼N(α+m(x),σ2) with conditional distribution function FY|X=x(y)=Φ(σ−1(y−(α+m(x)))) can be understood as a transformation model with transformation function h(y|x)=y/σα/σm(x)/σ parameterized via basis functions urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0173 and c=(a,b) with parameters ϑ=(σ−1,−σ−1α,−σ−1β) under the constraint σ>0 or in more compact notation urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0174. The parameters of the model are the inverse standard deviation and the inverse negative coefficient of variation instead of the mean and variance of the original normal distribution. For ‘exact continuous’ observations, the likelihood urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0175 is equivalent to least squares, which can be maximized with respect to α and β without taking σ into account. This is not possible for censored or truncated observations, where we need to evaluate the conditional distribution function that depends on all parameters; this model is called Type I Tobit model (although only the likelihood changes under censoring and truncation, but the model does not). Using an alternative basis function c would allow arbitrary non‐normal conditional distributions of Y, and the simple shift model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0176 is then a generalization of additive models and leads to the interpretation urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0177. The choice urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0178 implements the log‐normal model for Y>0. Implementation of a Bernstein basis a=aBs,M allows arbitrarily shaped distributions, that is, a transition from the normal family to the transformation family, and thus likelihood inference on ϑ2 without strict assumptions on the distribution of Y. The transformation urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0179 must increase monotonically in y. Maximization of the log‐likelihood under the linear inequality constraint DM+1ϑ1>0, with DM+1 representing first‐order differences, implements this requirement.

Continuous ‘survival time’ models

For a continuous response Y>0, the model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0180 with basis functions urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0181 and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0182 and parameters ϑ=(−α,σ−1,−β) under the constraint σ>0 is called the accelerated failure time (AFT) model. The model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0183 with σ≡1 (and thus fixed transformation function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0184) is the exponential AFT model because it implies an exponential distribution of Y. When the parameter σ>0 is estimated from the data, the model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0185 is called the Weibull model, urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0186 is the log‐logistic AFT model and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0187 is the log‐normal AFT model. For a continuous (not necessarily positive) response Y, the model FY|X=x(y)=FMEV(hY(y)−m(x)) is called the proportional hazards, relative risk or Cox model. The transformation function hY equals the log‐cumulative baseline hazard and is treated as a nuisance parameter in the partial likelihood framework, where only the regression coefficients β are estimated. Given urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0188, non‐parametric maximum likelihood estimators are typically applied to obtain urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0189. Here, we parameterize this function as urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0190 (e.g. using a=aBs,M) and fit all parameters in the model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0191 simultaneously. The model is highly popular because m(x) is the log‐hazard ratio to m(0). For the special case of right‐censored survival times, this parameterization of the Cox model was studied theoretically and empirically by McLain & Ghosh (2013). Changing the distribution function in the Cox model from FMEV to FSL results in the proportional odds model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0192; its name comes from the interpretation of m(x) as the constant log‐odds ratio of the odds OY(y|X=x) and OY(y|x=0). An additive hazards model with the conditional hazard function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0193 results from the choice urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0194(Aranda‐Ordaz, 1983) under the additional constraint λY(y|X=x)>0. In this case, the function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0195 is the positive baseline cumulative hazard function ΛY(y|X=0).

Discrete models

For ordered categorical responses y1<⋯<yK, the conditional distribution FY|X=x(yk)=FZ(hY(yk)−m(x)) is a transformation model with a(yk)=eK−1(k). The model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0196 is called the discrete proportional odds model, and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0197 is the discrete proportional hazards model. Here, m(x) is the log‐odds ratio or log‐hazard ratio to m(0) independent of k; details are given in Tutz (2012). For the special case of a binary response (K=2), the transformation model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0198 is the logistic regression model, urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0199 is the probit model and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0200 is called the complementary log–log model. Note that the transformation function hY is given by the basis function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0201, that is, ϑ1 is just the intercept. The connection between standard binary regression models and transformation models is explained in more detail by Doksum & Gasko (1990).

Linear transformation model

The transformation model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0202 for any a and FZ is called the linear transformation model and contains all models discussed in this section. Note that the transformation of the response urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0203 is non‐linear in all models of interest (AFT, Cox etc.) and the term ‘linear’ only refers to a linear shift m(x) of the explanatory variables. Partially linear or additive transformation models allow non‐linear shifts as part of a partially smooth basis b, that is, in the form of an additive model. The number of constraints only depends on the basis a but not on the explanatory variables.

4.4 Extension of classical transformation models

A common property of all classical transformation models is the additivity of the response transformation and the shift, that is, the decomposition h(y|x)=hY(y)+hx(x) of the conditional transformation function. This assumption is relaxed by the following extensions of the classical models. Allowing for deviations from this simple model is also the key aspect for the development of novel transformation models in the rest of this section.

Discrete non‐proportional odds and hazards models

For ordered categorical responses, the model FY|X=x(yk)=FZ(hY(yk)−mk(x)) allows a category‐specific shift urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0204; with FSL, this cumulative model is called the non‐proportional odds model, and with FMEV, it is the non‐proportional hazards model. Both models can be cast into the transformation model framework by defining the joint basis c(yk,x)=(a(yk),a(yk)b(x)) as the Kronecker product of the two simple basis functions a(yk)=eK−1(k) and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0205 (assuming that b does not contain an intercept term). Note that the conditional transformation function h(y|x) includes an interaction term between y and x.

Time‐varying effects

One often studied extension of the Cox model is urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0206, where the regression coefficients β(y) may change with time y. The Cox model is included with β(y)≡β, and the model is often applied to check the proportional hazards assumption. With a smooth parameterization of time y, for example, via a=aBs,M, and linear basis urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0207, the transformation model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0208 implements this Cox model with time‐varying (linear) effects. This model (with arbitrary FZ) has also been presented in Foresi & Peracchi (1995) and is called distribution regression in Chernozhukov et al. (2013).

4.5 Novel transformation models

Because of the broadness of the transformation family, it is straightforward to set up new models for interesting situations by allowing more complex transformation functions h(y|x). We will illustrate this possibility for two simple cases the independent two‐sample situation and regression models for count data. The generic and most complex transformation model is called the conditional transformation model and is explained at the end of this section.

Beyond shift effects

Assume we observe samples from two groups A and B and want to model the conditional distribution functions FY|X=A(y) and FY|X=B(y) of the response Y in the two groups. Based on this model, it is often interesting to infer whether the two distributions are equivalent and, if this is not the case, to characterize how they differ. Using an appropriate basis function a and the basis urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0209, the model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0210 parameterizes the conditional transformation function as urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0211 and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0212. Clearly, the second term is constant zero (hBA(y)≡0) iff the two distributions are equivalent (FY|X=A(y)=FY|X=B(y) for all y). For the deviation function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0213, we can apply standard likelihood inference procedures for urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0214 to construct a confidence band or use a test statistic like urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0215 to assess deviations from zero. If there is evidence for a group effect, we can use the model to check whether the deviation function is constant, that is, hBA(y)≡c≠0. In this case, the simpler model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0216 with shift β=−ϑ2 might be easier to interpret. This model actually corresponds to a normal analysis of variance model with FZ=Φ and a(y)=(1,y) or the Cox proportional hazards model with urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0217.

Count regression ‘without tears’

Simple models for count data Ξ={0,1,2,…} almost always suffer from over‐dispersion or excess zeros. The linear transformation model FY|X=x(y)=FZ(hY(y)−m(x)) can be implemented using the basis function a(y)=aBs,M(⌊y⌋), and then the parameters of the transformation model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0218 are not affected by over‐dispersion or under‐dispersion because higher moments are handled by hY independently of the effects of the explanatory variables m(x). If there are excess zeros, we can set up a joint transformation model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0219 such that we have a two‐components mixture model consisting of the count distribution FY|X=x(y)=FZ(hY(y)−m(x)) for y∈Ξ and the probability of an excess zero
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0220
when urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0221. Hence, the transformation analogue to a hurdle model with hurdle at zero is the transformation model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0222.

Conditional transformation models

When the conditional transformation function is parameterized by multiple basis functions aj(y),bj(x),j=1,…J via the joint basis
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0223
models of the class (·,c,ϑ) are called conditional transformation models with J partial transformation functions parameterized as urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0224 and include all special cases discussed in this section. It is convenient to assume monotonicity for each of the partial transformation functions; thus, the linear constraints for aj are repeated for each basis function in bj(detailed descriptions of linear constraints for different models in this class are available in Hothorn, 2017b). Hothorn et al. (2014) introduced this general model class and proposed a boosting algorithm for the estimation of transformation functions h for ‘exact continuous’ responses Y. In the likelihood framework presented here, conditional transformation models can be fitted under arbitrary schemes of censoring and truncation, and classical likelihood inference for the model parameters ϑ becomes feasible. Of course, unlike in the boosting context, the number of model terms J and their complexity are limited in the likelihood world because the likelihood does not contain any penalty terms that induce smoothness in the x‐direction.

A systematic overview of linear transformation models with potentially response‐varying effects is given in Table 1. Model nomenclature and interpretation of the corresponding model parameters is mapped to specific transformation functions h and distribution functions FZ. To the best of our knowledge, models without names have not yet been discussed in the literature, and their specific properties await closer investigation.

Table 1. Non‐exhaustive overview of conditional transformation models. Abbreviations: proportional hazards (PH), proportional odds (PO), additive hazards (AH), odds ratio (OR), hazard ratio (HR), complementary log (clog), complementary log–log (cloglog), normal linear regression model (NLRM), binary generalized linear model (BGLM), accelerated failure time (AFT)
FZ
Φ FSL FExp FMEV
Ξ h Meaning of
K=2 Binary Regression
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0225 probit BGLM logistic BGLM clog BGLM cloglog BGLM
ϑ1 urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0226 ΛY(y|X=0) urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0227
β log‐OR AH log‐HR
K>2 Polytomous Regression
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0228 discrete PO discrete PH
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0229 urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0230 ΛY(y|X=0) urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0231
β log‐OR AH log‐HR
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0232 non‐PO non‐PH
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0233 urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0234 ΛY(y|X=0) urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0235
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0236 Count Regression
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0237
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0238 urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0239 ΛY(y|X=0) urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0240
β log‐OR AH log‐HR
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0241
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0242 urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0243 ΛY(y|X=0) urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0244
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0245 Survival Analysis
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0246 Exponential AFT
β log‐OR log‐HR
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0247 log‐normal AFT log‐logistic AFT Weibull AFT
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0248 urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0249 ΛY(y|X=0) urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0250
β log‐OR AH log‐HR
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0251 Continuous Regression and Survival Analysis
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0252 NLRM
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0253 variance
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0254 mean
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0255 Aalen AH Cox PH
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0256 urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0257 ΛY(y|X=0) urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0258
β log‐OR AH log‐HR
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0259 Distribution Regression Time‐varying Cox
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0260 urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0261 ΛY(y|X=0) urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0262

5 Empirical evaluation

We will illustrate the range of possible applications of likelihood‐based conditional transformation models. In Section 5.2, we will present a small simulation experiment highlighting the possible advantage of indirectly modelling conditional distributions with transformation functions.

5.1 Illustrations

Density estimation: Old Faithful geyser

The duration of eruptions and the waiting time between eruptions of the Old Faithful geyser in the Yellowstone National Park became a standard benchmark for non‐parametric density estimation. The nine parameters of the transformation model (Φ,aBs,8(waiting),ϑ) were fitted by maximization of the approximate log‐likelihood (treating the waiting times as ‘exact’ observations) under the eight linear constraints D9ϑ>0. The model depicted in Fig. 1A reproduces the classic bimodal unconditional density of waiting time along with a kernel density estimate. It is important to note that the transformation model was fitted likelihood based, whereas the kernel density estimate relied on a cross‐validated bandwidth. An unconditional density estimate for the duration of the eruptions needs to deal with censoring because exact duration times are only available for the daytime measurements. At night, the observations were left censored (‘short’ eruption), interval censored (‘medium’ eruption) or right censored (‘long’ eruption). This censoring was widely ignored in analyses of the Old Faithful data because most non‐parametric kernel techniques cannot deal with censoring. We applied the transformation model (Φ,aBs,8(duration),ϑ) based on the exact log‐likelihood function under eight linear constraints and obtained the unconditional density depicted in Fig. 1B. In Hothorn (2017b), results for M=40 are computed, which led to almost identical estimates of the distribution function.

image
Old Faithful geyser. Estimated density for waiting times (A) between and duration (B) of eruptions by the most likely transformation model (MLT) and kernel smoothing. Note that the kernel estimator was based on the imputed duration times 2,3 and 4 for short, medium and long eruptions at night (as are the rugs in B).

Quantile regression: head circumference

The Fourth Dutch Growth Study is a cross‐sectional study on growth and development of the Dutch population younger than 22 years. Stasinopoulos & Rigby (2007) fitted a growth curve to head circumferences (HCs) of 7040 boys using a generalized additive models for location, scale and shape (GAMLSS) model with a Box–Cox t distribution describing the first four moments of HC conditionally on age. The model showed evidence of kurtosis, especially for older boys. We fitted the same growth curves by the conditional transformation model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0263 by maximization of the approximate log‐likelihood under 3×4 linear constraints (D4I4)ϑ>0. Figure 2 shows the data overlaid with quantile curves obtained via inversion of the estimated conditional distributions. The figure very closely reproduces the growth curves presented in Fig. 16 of Stasinopoulos & Rigby (2007) and also indicates a certain asymmetry towards older boys.

image
Head circumference growth. Observed head circumference and age of 7040 boys with estimated quantile curves for p=0.04,0.02,0.1,0.25,0.5,0.75,0.9,0.98,0.996.

Survival analysis: German Breast Cancer Study Group‐2 trial

This prospective, controlled clinical trial on the treatment of node‐positive breast cancer patients was conducted by the German Breast Cancer Study Group. Out of 686 women, 246 received hormonal therapy, whereas the control group of 440 women did not. Additional variables include age, menopausal status, tumour size, tumour grade, number of positive lymph nodes, progesterone receptor and oestrogen receptor. The right‐censored recurrence‐free survival time is the response variable of interest.

The Cox model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0264 implements the transformation function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0265, where urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0266 is the log‐cumulative baseline hazard function parameterized by a Bernstein polynomial and urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0267 is the log‐hazard ratio of hormonal therapy. This is the classical Cox model with one treatment parameter β but with fully parameterized baseline transformation function, which was fitted by the exact log‐likelihood under ten linear constraints. The model assumes proportional hazards, an assumption whose appropriateness we wanted to assess using the non‐proportional hazards model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0268 with the transformation function
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0269
The function urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0270 is the time‐varying difference of the log‐hazard functions of women without and with hormonal therapy and can be interpreted as the deviation from a constant log‐hazard ratio treatment effect of hormonal therapy. Under the null hypothesis of no treatment effect, we would expect ϑ20. This monotonic deviation function adds ten linear constraints D11ϑ1+D11ϑ2>0, which also ensure monotonicity of the transformation function for treated patients. We first compared the fitted survivor functions obtained from the model including a time‐varying treatment effect with the Kaplan–Meier estimators in both treatment groups. Figure 3A shows a nicely smoothed version of the survivor functions obtained from this transformation model. Figure 3B shows the time‐varying treatment effect urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0271, together with a 95% confidence band computed from the joint normal distribution of urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0272 for a grid over time; the method is much simpler than other methods for inference on time‐varying effects (e.g. Sun et al., 2009). The 95% confidence interval around the log‐hazard ratio urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0273 is also plotted, and as the latter is fully covered by the confidence band for the time‐varying treatment effect, there is no reason to question the treatment effect computed under the proportional hazards assumption.
image
German Breast Cancer Study Group‐2. Estimated survivor functions by the most likely transformation model (MLT) and the Kaplan–Meier (KM) estimator in the two treatment groups (A). Verification of proportional hazards (B): the log‐hazard ratio urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0274 (dashed line) with 95% confidence interval (dark grey) is fully covered by a 95% confidence band for the time‐varying treatment effect (the time‐varying log‐hazard ratio is in light grey; the estimate is the solid line) computed from a non‐proportional hazards model.

In the second step, we allowed an age‐varying treatment effect to be included in the model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0275. For both treatment groups, we estimated a conditional transformation function of survival time y given age parameterized as the tensor basis of two Bernstein bases. Each of the two basis functions comes with 10×3 linear constraints; therefore, the model was fitted under 60 linear constraints. Figure 4 allows an assessment of the prognostic and predictive properties of age. As the survivor functions were clearly larger for all patients treated with hormones, the positive treatment effect applied to all patients. However, the size of the treatment effect varied greatly. The effect was most pronounced for women younger than 30 years and levelled off a little for older patients. In general, the survival times were longest for women between 40 and 60 years old. Younger women suffered the highest risk; for women older than 60 years, the risk started to increase again. This effect was shifted towards younger women when hormonal treatment was applied.

image
German Breast Cancer Study Group‐2. Prognostic and predictive effect of age. The contours depict the conditional survivor functions given treatment and age of the patient.

5.2 Simulation experiment

The transformation family includes linear as well as very flexible models, and we therefore illustrate the potential gain of modelling a transformation function h by comparing a very simple transformation model with a fully parametric approach and to a non‐parametric approach using a data‐generating process introduced by Hothorn et al. (2014).

In the transformation model (Φ,((1,y)⊗(1,x)),ϑ), two explanatory variables x=(x1,x2) influence both the conditional mean and the conditional variance of a normal response Y. Although the transformation function is linear in y with three linear constraints, the mean and variance of Y given x depend on x in a non‐linear way. The choices x1∼U[0,1],x2∼U[−2,2] with ϑ=(0,0,−1,.5,1,0) lead to the heteroscedastic varying coefficient model
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0276(5)
where the variance of Y ranges between 0.44 and 4 depending on x1. This model can be fitted in the GAMLSS framework under the assumptions that the mean of the normal response depends on a smoothly varying regression coefficient (x1+0.5)−1 for x2 and that the variance is a smooth function of x1. This model is therefore fully parametric. As a non‐parametric counterpart, we used a kernel estimator for estimating the conditional distribution function of Y as a function of the two explanatory variables.
From the transformation model, the GAMLSS and kernel estimators, we obtained estimates of FY|X=x(y) over a grid on y,x1,x2 and computed the mean absolute deviation (MAD) of the true and estimated probabilities
urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0277
for each pair of x1 and x2. Then, the minimum, the median and the maximum of the MAD values for all x1 and x2 were computed as summary statistics. The most likely transformation approach and its two competitors were estimated and evaluated for 100 random samples of size N=200 drawn from model 5. Cross‐validation was used to determine the bandwidths for the kernel‐based estimators (function npcdist() in package np; for details, see Hayfield & Racine, 2008). We fitted the GAMLSS models by boosting; the number of boosting iterations was determined via sample splitting (Mayr et al., 2012). To investigate the stability of the three procedures under non‐informative explanatory variables, we added to the data p=1,…,5 uniformly distributed variables without association to the response and included them as potential explanatory variables in the three models. The case p=0 corresponds to model 5.

Figure 5 shows the empirical distributions of the minimum, median and maximum MAD for the three competitors. Except for the minimum MAD in the absence of any irrelevant explanatory variables (p=0), the conditional distributions fitted by the transformation models were closer to the true conditional distribution function by means of the MAD. This result was obtained because the transformation model only had to estimate a simple transformation function, whereas the other two procedures had a difficult time approximating this simple transformation model on another scale. However, the comparison illustrates the potential improvement one can achieve when fitting simple models for the transformation function instead of more complex models for the mean (GAMLSS) or distribution function (Kernel). The kernel estimator led to the largest median MAD values but seemed more robust than GAMLSS with respect to the maximum MAD. These results were remarkably robust in the presence of up to five non‐informative explanatory variables, although of course the MAD increased with the number of non‐informative variables p.

image
Empirical evaluation. Minimum, median and maximum of the mean absolute deviation (MAD) between true and estimated probabilities for most likely transformation (MLT) models, non‐parametric kernel distribution function estimation (Kernel) and generalized additive models for location, scale and shape (GAMLSS) for 100 random samples. Values on the ordinate can be interpreted as absolute differences of probabilities. The grey horizontal lines correspond to the median of MLT.

6 Discussion

The contribution of a likelihood approach for the general class of conditional transformation models is interesting both from a theoretical and a practical perspective. With the range of simple to very complex transformation functions introduced in Section 4 and illustrated in Section 5, it becomes possible to understand classical parametric, semi‐parametric and non‐parametric models as special cases of the same model class. Thus, analytic comparisons between models of different complexity become possible. The transformation family PY, the corresponding likelihood function and the most likely transformation estimator are easy to understand. This makes the approach appealing also from a teaching perspective. Connections between standard parametric models (e.g. the normal linear model) and potentially complex models for survival or ordinal data can be outlined in very simple notation, placing emphasis on the modelling of (conditional) distributions instead of just modelling (conditional) means. Computationally, the log‐likelihood urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0278 is linear in the number of observations N and, for contributions of ‘exact continuous’ responses, only requires the evaluation of the derivative urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0279 of the transformation function h instead of integrals thereof.

Based on the general understanding of transformation models outlined in this paper, it will be interesting to study these models outside the strict likelihood world. A mixed transformation model for cluster data (Cai et al., 2002; Zeng et al., 2005; Choi & Huang, 2012) is often based on the transformation function h(y|x,i)=hY(y)+δi+hx(x) with random intercept (or ‘frailty’ term) δi for the ith observational unit. Conceptually, a more complex deviation from the global model could be formulated as h(y|x,i)=hY(y)+hY(y,i)+hx(x), that is, each observational unit is assigned its own ‘baseline’ transformation hY(y)+hY(y,i), where the second term is an integral zero deviation from hY. For longitudinal data with possibly time‐varying explanatory variables, the model h(y|x(t),t)=hY(y,t)+x(t)β(t)(Ding et al., 2012; Wu & Tian, 2013) can also be understood as a mixed version of a conditional transformation model. The penalized log‐likelihood urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0280 for the linear transformation model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0281 leads to Ridge‐type or Lasso‐type regularized models, depending on the form of the penalty term. Priors for all model parameters ϑ allow a fully Bayesian treatment of transformation models.

It is possible to relax the assumption that FZ is known. The simultaneous estimation of FZ in the model urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0282 was studied by Horowitz (1996) and later extended by Linton et al. (2008) to non‐linear functions hx with parametric baseline transformation hY and kernel estimates for FZ and hx. For AFT models, Zhang & Davidian (2008) applied smooth approximations for the density fZ in an exact censored likelihood estimation procedure. In a similar set‐up, Huang (2014) proposed a method to jointly estimate the mean function and the error distribution in a generalized linear model. The estimation of FZ is noteworthy in additive models of the form hY+hx because these models assume additivity of the contributions of y and x on the scale of urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0283. If this model assumption seems questionable, one can either allow unknown FZ or move to a transformation model featuring a more complex transformation function.

From this point of view, the distribution function FZ in flexible transformation models is only a computational device mapping the unbounded transformation function h into the unit interval strictly monotonically, making the evaluation of the likelihood easy. Then, FZ has no further meaning or interpretation as error distribution. A compromise could be the family of distributions urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0284 for ρ>0(suggested by McLain & Ghosh, 2013) with simultaneous maximum likelihood estimation of ϑ and ρ for additive transformation functions h=hY+hx, as these models are flexible and still relatively easy to interpret.

In light of the empirical results discussed in this paper and the theoretical work of McLain & Ghosh (2013) on a Cox model with log‐cumulative baseline hazard function parameterized in terms of a Bernstein polynomial with increasing order M, one might ask where the boundaries among parametric, semi‐parametric and non‐parametric statistics lie. The question how the order M affects results practically has been repeatedly raised; therefore, we will close our discussion by looking at a Cox model with increasing M for the German Breast Cancer Study Group‐2 data. All eight baseline variables were included in the linear predictor, and we fitted the model with orders M=1,…,30,35,40,45,50 of the Bernstein polynomial parameterizing the log‐cumulative baseline hazard function. In Fig. 6A, the log‐cumulative baseline hazard functions start with a linear function (M=1) and quickly approach a function that is essentially a smoothed version of the Nelson‐Aalen‐Breslow estimator plotted in red. In Fig. 6B, the trajectories of the estimated regression coefficients become very similar to the partial likelihood estimates as M increased. For M⩾10, for instance, the results of the ‘semi‐parametric’ and the ‘fully parametric’ Cox models are practically equivalent. An extensive collection of such head‐to‐head comparisons of most likely transformations with their classical counterparts can be found in Hothorn (2017b). Our work for this paper and practical experience with its reference software implementation convinced us that rethinking classical models in terms of fully parametric transformations is intellectually and practically a fruitful exercise.

image
German Breast Cancer Study Group‐2. Comparison of exact and partial likelihood for order M=1,…,30,35,40,45,50 of the Bernstein polynomial approximating the log‐cumulative baseline hazard function hY. The estimated log‐cumulative baseline hazard functions for varying M are shown in grey, and the Nelson‐Aalen‐Breslow estimator is shown in red (A). The right panel (B) shows the trajectories of the regression coefficients urn:x-wiley:sjos:media:sjos12291:sjos12291-math-0285 obtained for varying M, which are represented as dots. The horizontal lines represent the partial likelihood estimates.

6.1 Computational details

A reference implementation of most likely transformation models is available in the mlt package (Hothorn, 2017a). All data analyses can be reproduced in the dynamic document Hothorn (2017b). Augmented Lagrangian Minimization implemented in the auglag() function of package alabama (Varadhan, 2015) was used for optimizing the log‐likelihood. PackagegamboostLSS (version 1.2‐2, Hofner et al., 2016) was used to fit GAMLSS models and kernel density, and distribution estimation was performed using package np (version 0.60‐2, Racine & Hayfield, 2014). All computations were performed using R version 3.4.0 (R Core Team, 2017). Additional applications are described in an extended version of this paper (Hothorn et al., 2017).

Acknowledgements

Torsten Hothorn received financial support by Deutsche Forschungsgemeinschaft under grant number HO 3242/4‐1. We thank Karen A. Brune for improving the language.

      Number of times cited according to CrossRef: 16

      • A flexible parametric modelling framework for survival analysis, Journal of the Royal Statistical Society: Series C (Applied Statistics), 10.1111/rssc.12398, 69, 2, (429-457), (2020).
      • Robust estimation of the effect of an exposure on the change in a continuous outcome, BMC Medical Research Methodology, 10.1186/s12874-020-01027-6, 20, 1, (2020).
      • Statistical challenges in spatial analysis of plant ecology data, Spatial Statistics, 10.1016/j.spasta.2020.100418, (100418), (2020).
      • Count transformation models, Methods in Ecology and Evolution, 10.1111/2041-210X.13383, 11, 7, (818-827), (2020).
      • Ensemble Postprocessing Using Quantile Function Regression Based on Neural Networks and Bernstein Polynomials, Monthly Weather Review, 10.1175/MWR-D-19-0227.1, 148, 1, (403-414), (2020).
      • Baseline-adjusted proportional odds models for the quantification of treatment effects in trials with ordinal sum score outcomes, BMC Medical Research Methodology, 10.1186/s12874-020-00984-2, 20, 1, (2020).
      • Alternatives to statistical decision trees in regulatory (eco-)toxicological bioassays, Archives of Toxicology, 10.1007/s00204-020-02690-w, (2020).
      • A Bayesian Quantile Time Series Model for Asset Returns, Journal of Business & Economic Statistics, 10.1080/07350015.2020.1766470, (1-12), (2020).
      • Bayesian Inference for Regression Copulas, Journal of Business & Economic Statistics, 10.1080/07350015.2020.1721295, (1-17), (2020).
      • Exploratory identification of predictive biomarkers in randomized trials with normal endpoints, Statistics in Medicine, 10.1002/sim.8452, 39, 7, (923-939), (2019).
      • An empirical comparison of two novel transformation models, Statistics in Medicine, 10.1002/sim.8425, 39, 5, (562-576), (2019).
      • Survival forests under test: Impact of the proportional hazards assumption on prognostic and predictive forests for amyotrophic lateral sclerosis survival, Statistical Methods in Medical Research, 10.1177/0962280219862586, (096228021986258), (2019).
      • Transformation boosting machines, Statistics and Computing, 10.1007/s11222-019-09870-4, (2019).
      • Letter to the Editor response: Garcia et al., Biostatistics, 10.1093/biostatistics/kxy079, 20, 3, (546-548), (2018).
      • Top-down transformation choice, Statistical Modelling, 10.1177/1471082X17748081, 18, 3-4, (274-298), (2018).
      • Continuous outcome logistic regression for analyzing body mass index distributions, F1000Research, 10.12688/f1000research.12934.1, 6, (1933), (2017).

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.