Beta regression for double‐bounded response with correlated high‐dimensional covariates

Continuous responses measured on a standard unit interval (0,1)$$ \left(0,1\right) $$ are ubiquitous in many scientific disciplines. Statistical models built upon a normal error structure do not generally work because they can produce biassed estimates or result in predictions outside either bound. In real‐life applications, data are often high‐dimensional, correlated and consist of a mixture of various data types. Little literature is available to address the unique data challenge. We propose a semiparametric approach to analyse the association between a double‐bounded response and high‐dimensional correlated covariates of mixed types. The proposed method makes full use of all available data through one or several linear combinations of the covariates without losing information from the data. The only assumption we make is that the response variable follows a Beta distribution; no additional assumption is required. The resulting estimators are consistent and efficient. We illustrate the proposed method in simulation studies and demonstrate it in a real‐life data application. The semiparametric approach contributes to the sufficient dimension reduction literature for its novelty in investigating double‐bounded response which is absent in the current literature. This work also provides a new tool for data practitioners to analyse the association between a popular unit interval response and mixed types of high‐dimensional correlated covariates.

regression and is still undergoing development.Relevant development in Beta regression model includes Ospina et al. (2006), Zhou and Huang (2022) and Muhammad Qasim and Kibria (2021).
However, all existing works mentioned above handle bounded response data with low-dimensional covariates which were often assumed to be independent of each other.For example, Zhou and Huang (2022) consider only two independent predictors, a uniform variable and a standard normal variable.Muhammad Qasim and Kibria (2021) introduce a Beta ridge regression estimator to remedy the instability problem when some covariates are linearly correlated.The studied dimensions are low, and all covariates are normally distributed.To the best of our knowledge, little work considers high-dimensional covariates with a mixture of both continuous and discrete variables.In real-life applications, data are often collected with a comprehensive spectrum that results in high-dimensional correlated covariates that consist of a mixture of various data types.It is often the case that one or several linear combinations of the covariates have explained a great majority of the response variables.For example, a measure of IQ score consists of some linear combinations of many measurable factors.In high-dimensional data analyses, dimension reduction is known as one of the predominated tools.Popular data reduction methods such as principal components analysis focus on maximising the variance of the covariates matrix which ignores the correlation between the response and the covariates.Inverse types of regression approaches, for example, sliced inverse regression (SIR) (Li, 1991), sliced average variance estimation (SAVE) (Cook & Weisberg, 1991), to dimension reduction are simple to use, yet both linearity and constant variance assumptions do not hold in general.Neither SIR nor SAVE provide insight on the dispersion parameter which characterises the Beta distribution.Ma and Zhu (2012) proposed a novel approach to dimension reduction problems through a semiparametric estimation that relaxes the linearity and constant variance assumptions.However, the challenge of analysing the response variable with double-bounded support was unaddressed in Ma and Zhu (2012) nor other semiparametric dimension reduction literature, for example, Liu (2022).More comprehensive reviews on dimension reduction literature can be found in Cook (2018) and Ma and Zhu (2013b).Motivated by the crucial role of Beta regression for bounded responses and the limitations of existing literature, we propose a semiparametric sufficient dimension reduction approach for Beta regression model in the presence of high-dimensional covariates.The proposed method is capable of handling highdimensional covariates that consist of a mixture of various data types and may be correlated.The only assumption we make is that the response variable follows a Beta distribution; no additional assumption is required.The resulting coefficient estimates are consistent and efficient.
The roadmap is as follows.In Section 2, we present the proposed model and the semiparametric estimation procedure.We investigate the large sample properties of the proposed method in Section 3. We further evaluate the semiparametric method through numerical studies with comparisons of existing dimension reduction methods in Section 4. In Section 5, we analyse the pathways to desistance data using the proposed method.Finally, we summarise the study in Section 6.

| DIMENSION REDUCTION APPROACH
Define the response variable as Y, Y ð0,1Þ.Let X R nÂp denote the p dimensional covariates with n observations.Following the parameterization in Muhammad Qasim and Kibria (2021), the response relates to the covariates through a Beta distribution with mean μ and precision ϕ, where 0 < μ < 1 and ϕ > 0. The advantage of the parameterization is that it directly reflects the mean and variance of the response.The variance of the response is then VarðYÞ ¼ μð1 À μÞ=ð1 þ ϕÞ which decreases as the precision is increased for fixed μ.For i ¼ 1, …,n, the density of Y i can be written as We consider a dimension reduction model where β is a p Â d matrix with d < p and d is the structural dimension of the reduction model.The goal of dimension reduction is to identify the unique subspace (Cook, 1998) spanned by the column vectors of β with the smallest d.Note that β is not unique as any d Â d full rank matrix generates the same subspace (Cook, 2018).For identification purposes, we adopt the parametrisation (Ma & Zhu, 2013a) that the upper block is a d Â d identity matrix, while the lower ðp À dÞ Â d block is an arbitrary matrix, that is, To estimate the dimension reduction model parameters, it is to estimate β l which has ðp À dÞd free parameters.Model (2) implies that the response variable Y relates to X only through some linear combinations β T X.In particular, we consider a mean model . Following the geometric approach (Bickel et al., 1998;Tsiatis, 2006), we derive the efficient score which is the residual after projecting the score vector onto the nuisance tangent space with respect to β.The nuisance tangent space is defined as the mean squared closure of all nuisance tangent spaces of all parametric submodels.
Therefore, the efficient score for β is where ψðzÞ ¼ d dz logΓðzÞ is the digamma function and μ 0 ðzÞ ¼ d dz μðzÞ.vecl stands for vectorization of the lower ðp À dÞ Â d block, that is, veclðβÞ ¼ ðβ dþ1,1 , …, β p,1 , …, β dþ1,d , …, β p,d Þ T .We solve the estimating equations jointly for β and ϕ.Most dimension reduction methods, for example, (Cook & Weisberg, 1991;Li, 1991), rely on a linearity condition; that is, If the linearity condition holds, then (3) becomes When the linearity condition is violated, we propose to estimate the conditional expectation nonparametrically using a Nadaraya-Watson kernel method, that is, , where h is a bandwidth and K is a multivariate kernel function, K h ðzÞ ¼ Kðz=hÞ=h d .We plug in the nonparametric estimator ÊðX i jβ T x i Þ to (3) and then solve for β.

| ASYMPTOTIC PROPERTIES
We investigate the asymptotic properties of the proposed estimator.We first list some regularity conditions that are often needed for theoretical development.
(C1) The univariate kernel function KðÁÞ is Lipschitz, symmetric and has compact support.It satisfies Here we abuse the notation and use the same K regardless of the dimension of its argument.
(C2) The density functions of X and β T X denoted, respectively, by f X ðxÞ and fðβ T xÞ are bounded from below and above.Each entry in the matrices E XX T jβ T x is locally Lipschitz-continuous and bounded from above as a function of β T x.
(C3) EðXjβ T xÞfðβ T xÞ is mth-order differentiable and their m-th derivatives, as well as fðx T βÞ, are locally Lipschitz-continuous.
We first summarise a useful lemma for the proof of Theorem 1.
Lemma 1.For the Nadaraya-Watson kernel estimator, be the probability density function of β T x.Then by the definition of expectation, we have Therefore, Theorem 1.Under the regularity conditions listed above, when n !∞, the estimator β from (3) satisfies in distribution, where , where a 2 aa T .
Proof.Since β solves Equation (3), then it satisfies where β * is on the line connecting β and β, and we used a Taylor expansion at β to obtain the third equality.The first term in the last equality is of order o p ð1Þ due to Lemma 1.Thus, we have where B is the expectation of the derivative of the efficient score which is its negative second moment since the score has mean zero.

| NUMERICAL EXPERIMENTS
We evaluate the proposed method through simulation studies, with comparisons to several other approaches.After ample experiments, we found that the betareg function in the betareg package is generally not stable or it often fails when there exists a moderate number of covariates.Therefore, we directly maximise the log-likelihood function of the Beta regression model using MaxLik (Henningsen & Toomet, 2011).The resulting estimators serve as initial values for solving (3) and (4).That is, will be the initial values for solving (3) and (4) when d ¼ 1.In all simulation experiments, we implement the initial estimator, SIR estimator (Li, 1991), SAVE estimator (Cook & Weisberg, 1991), linear estimator which assumes the linearity assumption and the semiparametric estimator that EðXjβ T XÞ is estimated nonparametrically.The linear estimator can be viewed as a benchmark.We consider various data-generating processes with small sample sizes and relatively larger sample sizes.In all simulation settings, we set ϕ ¼ 0:8 as the major interest is on β.The precision parameter estimation results are similar, while the semiparametric approach gives a smaller bias.We evaluate and compare the performances of β through various measures.Details are as follows.
We first consider a low structural dimension by setting the mean of the double-bounded response variable to where the true parameter β has dimension 11 Â 1 with values ð1:0, À 0:3,0:25,0:13, À 0:2,0:156,0:38,0:46,0:15,0:12, À 0:34Þ T .The covariates consist of both continuous variables from normal, uniform, Beta and Gamma distributions and discrete variables from binomial, Poisson and geometric distributions.We experiment with a small sample, as small as 50, and a moderate sample of 100, and a relatively large sample of 200.We calculate the mean of the estimators (Mean), bias (Bias), empirical standard error (ESE), asymptotic standard error (ASE), mean squared error (MSE) and 95% coverage rate (CR).We also compare the efficiency (Eff) of the proposed estimator to all other methods.We observe that our proposed method results in smaller bias, better coverage rates and is more efficient compared with the estimators produced by maximising the regression model without dimension reduction, by the SIR method and by the SAVE method.It is worth mentioning that the proposed estimator is as good as the one produced by assuming the linearity condition which is not required in the semiparametric approach.Due to space limits, the results for sample sizes 50 (Table S5) and 100 (Table S6) are moved to the supporting information.We only display the results for n ¼ 200 in the main manuscript; see Table 1.
Secondly, we consider a popular structural dimension by setting d ¼ 2. The mean model we consider is μ parameters are set in a way that β T 1 X can be viewed as an overall average of all covariates, while β T 2 X can be viewed as the contrast.We generate X 1 $ Unifð0,1Þ, X 2 $ Nð0,1Þ, X 3 $ Poisð1Þ, X 4 $ Bernð10:5Þ, X 5 $ UnifðÀ1,0Þ.
Then form , and X 11 ¼ X 2 X 3 so that the covariates are complex in the sense that they contain both continuous and discrete and they are correlated.The results of the parameter estimations are summarised in Table 2.As Table 2 shows, the proposed estimator has coverage rates close to the 95% nominal rate.
To further evaluate the proposed method, we consider a greater structural dimension where the true dimension d is nonreducible.In particular, we consider a complex model When analysing the data, we reduce the data dimension to one for all dimension reduction models while the true structural dimension is 11.We experiment with sample sizes 50 and 100 for 200 times.As Table 3 shows, the nonparametric approach produces the smallest mean residual estimates, as good as that from assuming linearity assumption.Although the SIR and the SAVE methods are straightforward to implement, both methods result in large variations in predicting the outcome response; see Figure 1.Such phenomena are observed in all simulations.

| RISK OF AGGRESSIVE VIOLENT OFFENDING AMONG ADOLESCENT OFFENDERS
The Pathways to Desistance study is a multisite study of serious adolescent offenders as they transition from adolescence into early adulthood.
Participants in the study were adjudicated youth who had been found guilty of committing a serious crime and were between the ages of 14 and 18 at the time of their involvement.The sample for the study was composed of 924 adolescents from Maricopa County, AZ, and from Philadelphia County, PA. Detailed information about the study's design, recruitment and methodology can be found in Mulvey et al. (2004) and Schubert et al. (2004).Respondents were asked to indicate whether they had engaged in a series of 11 offending behaviours that reflected violent offenses (e.g.set fire, killed or shot someone, in a fight, beat someone as part of gang).The risk of committing aggressive violence after 1 year of release from a juvenile detention centre was evaluated along with many other factors.A total of 924 respondents reported a mean risk of 16.44% with a standard deviation of 15.01%.
Previous research (Maschi, 2008;Tyler, 2015) shows that violence exposure (exposure), hard drugs (harddrug) and marijuana usage (marijuana) play an important role in the risk of committing aggressive violent offending among adolescent offenders.Other important factors include peer delinquency (peer), age, gender (male), lifetime exposure to violence as a witness (expwitever), psychopathic tendency (psychopathy), school attachment (schattach), mother-child relationship (mcrelation), moral disengagement (moral), alcohol use (alcohol), cigarettes use (alcohol), spirituality level (Spirituality), proportion of time spent on the streets (pst), socioeconomic status (sociostatus), single-parent family (single; respondents who live in a single-parent household), modified Emotionality Activity, Sociability and Impulsivity inventory (EASI), Global Severity Index (GSI), neighbourhood conditions (neighbourhood), head injury (headinjury) measured by the presence of a head injury which caused unconsciousness or needed medical attention and IQ score assessed by the Wechsler Abbreviated Scale of Intelligence.These variables contain continuous, binary and categorical measurements, and some are highly likely correlated.We apply the Beta dimension reduction model to analyse the association between the risk and these important variables.The resulting precision of the Beta regression is 13.19, and the regression coefficient estimates are reported in This paper studies responses that are measured continuously on a standard unit interval ð0,1Þ which are often encountered in many scientific disciplinary.The covariates that relate to the double-bounded response are a high-dimensional mixture consisting of both continuous and discrete variables and may be correlated.We propose a semiparametric approach to analyse the association between a double-bounded response and high-dimensional correlated covariates of mixed types.The semiparametric approach reduces the data dimension through a geometric approach by deriving the estimating equations.The proposed method makes full use of the available data through one or several linear combinations of the covariates without losing information from the data.The only assumption we make is that the response follows a Beta distribution; no other assumption is required.The estimation procedure is through solving estimating questions, and the resulting estimator is consistent and efficient.
We further investigate the performance of the proposed method via various simulation studies that mimic real-world data-generating processes.
Compared with several other popular dimension reduction methods, the proposed method results in smaller bias, better coverage rates and smallest mean residual estimates and is more efficient without assuming linearity condition or constant variance condition which are often required by other methods.The semiparametric approach proposed in the study contributes to the sufficient dimension reduction literature for its novelty in investigating double-bounded response measures which is absent in the current literature.It also provides a new tool for data practitioners to analyse the association between a popular unit interval response and mixed types of high-dimensional correlated covariates.

ACKNOWLEDGEMENTS
The author is grateful to the editor, an associate editor and two anonymous referees for their constructive comments that helped improve the paper substantially.
Abbreviations: ASE, asymptotic standard error; Bias, bias; CR, 95% coverage rate; Eff, efficiency of the nonparametric estimator with respect to the estimator; ESE, empirical standard error; Mean, mean of the estimators; MSE, mean squared error; SAVE, sliced average variance estimation; SIR, sliced inverse regression.

Table 4
. The mean prediction error is 0.006 which is substantially small, suggesting that the model and estimation are of satisfactory.F I G U R E 1 Comparison of mean prediction errors for n ¼ 50 (left) and n ¼ 100 (right) where the true structural dimension is nonreducible, p ¼ d ¼ 11.T A B L E 3 Structural dimension is nonreducible for 11 covariates.Note: The mean (Mean) and standard deviation (std) of the residuals for 200 simulations.Abbreviations: SAVE, sliced average variance estimation; SIR, sliced inverse regression.
Pathways to desistance study data analysis.