Inverse problem approach to regularized regression models with application to predicting recovery after stroke

Regression modelling is a powerful statistical tool often used in biomedical and clinical research. It could be formulated as an inverse problem that measures the discrepancy between the target outcome and the data produced by representation of the modelled predictors. This approach could simultaneously perform variable selection and coefficient estimation. We focus particularly on a linear regression issue, Y∼N(Xβ,σIn) , where β∈Rp is the parameter of interest and its components are the regression coefficients. The inverse problem finds an estimate for the parameter β , which is mapped by the linear operator (L:β⟶Xβ) to the observed outcome data Y=Xβ+ε . This problem could be conveyed by finding a solution in the affine subspace L−1(Y) . However, in the presence of collinearity, high‐dimensional data and high conditioning number of the related covariance matrix, the solution may not be unique, so the introduction of prior information to reduce the subset L−1(Y) and regularize the inverse problem is needed. Informed by Huber's robust statistics framework, we propose an optimal regularizer to the regression problem. We compare results of the proposed method and other penalized regression regularization methods: ridge, lasso, adaptive‐lasso and elastic‐net under different strong hypothesis such as high conditioning number of the covariance matrix and high error amplitude, on both simulated and real data from the South London Stroke Register. The proposed approach can be extended to mixed regression models. Our inverse problem framework coupled with robust statistics methodology offer new insights in statistical regression and learning. It could open a new research development for model fitting and learning.

regression analysis is not just used for fitting independent and dependent variables, but to identify causal relationships between predictors and outcome to contribute to the evidence on causal relationships between presumed causes or exposure and an observed effect in terms of Bradford Hill Criteria (Bradford Hill, 1965). Since Fisher's theory was published, there have been several developments in regression analysis such as parametric and nonparametric regression (Pagan & Ullah, 1999), Bayesian regression (Broemeling, 1985) and regularized regression (Bühlmann & van de Geer, 2011). Nowadays, using multivariate regression is an essential tool often used in clinical trials and epidemiological studies.
Gauss introduced the normal distribution as the error distribution for which ordinary least square or maximum likelihood (OLS/ML) is optimal (Aitken, 1936;Gauss & Davis, 1857). However, real-world data do not often satisfy these classical assumptions (Baltagi, 2008). OLS-or ML-based methods may fail to predict or to give meaningful interpretations in certain scenarios. To overcome these limitations, many techniques have been developed such as ridge regression (Hoerl & Kennard, 1970), lasso (Tibshirani, 1994), adaptive-lasso (Zou, 2006) and elastic-net (Zou & Hastie, 2005).
From a mathematical perspective, the regression fitting problem could be formulated as a minimization of an inverse problem that measures the discrepancy between the observed outcome and the data produced by representation of the modelled predictors. The inverse problem approach could simultaneously perform variable selection and coefficient estimation.
In particularly, on the general linear regression problem (McCullagh & Nelder, 1989), we assume that: ∈ , (ℝ), given ∈ ℝ find ∈ ℝ the parameter of interest such that: ( ) = 0 + 1 1 + 2 2 + ⋯ + , where 1 , 2 , … , columns of X are a set of real valued integrable random variables. In statistics we say that observation could be explained by 1 , 2 , … , covariates. Since the measurements are always accompanied by measurement errors and in some situations jump discontinuities, the minimization of the distance between observed and modelled data will give unsatisfactory solutions, therefore some prior information is needed in order to regularize the solution.
The problem of minimization could be stated as follows: • > 0 is the parameter of regularization that adjust the trade-off between the proximity of observations and the degree of regularity. • is a constraint set of the prior knowledge on .
The fidelity term − ( ∑ =1 ( | , , ) ) can be regarded as a tolerant expression of the constraint equation.
Note that this could be expressed considering least square (LS) as a particular case of (1): and is expressed in the integral form as: where : ℝ → ℝ.
Choosing the most adequate prior function for a given inverse problem is not simple, and need to take into account two important considerations in practice: the accuracy (error) of prediction on future data and meaningful interpretations of the model. To illustrate the error linked to the regression fitting problem, we investigate in a simple example (bellow) which shows the impact of imperfect observation on the accuracy of estimate. = ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ 10 7 8 7 7 5 6 5 8 6 10 9 7 5 9 10 We aim to revisit regularized regression modelling using an inverse problem framework, which represents a novel way to advance the methodologies for statistical problem. We investigate the statistical inverse problem in the context of regression modelling, explaining the importance of regularization in analytical predictive models. The theoretical analysis is supported by a simulation study in order to evaluate different regularization strategies in low-and high-dimensional settings. Building upon Huber's robust statistics (Huber, 1981) and inverse problem framework, we characterize and propose a new regularization function for the regression problem. We evaluate the performance of the proposed method and other existent methods (lasso, adaptive-lasso, ridge, elastic-net) using simulations and real data from South London Stroke Register (SLSR).

Statistical inverse problem framework in regression context
Inverse problem consists in finding an estimate for = ( 1 , 2 , … , ) which is mapped by an operator L: ⟶ to fit data Y (Vogel, 2002).
The measured datāapproximate the perfect data Y within error estimate: Referring to Hadamard's definition for inverse problem (Hadamard, 1902), we say that the problem is ill-posed if the injectivity, surjectivity and stability cannot be assured.
We will study the influence of the regularity term such that the problem has a unique, stable and statistically meaningful and interpretable solutions.
In the regression modelling, we suppose that we observe an outcome ∈ ℝ and a predictor Matrix X ∈ , (ℝ), whose columns 1 , … , ∈ ℝ correspond to predictor variables. Using Hadamard's definition in the context of regression modelling, we will establish the surjectivity, the injectivity through the selection of a meaningful solution by introducing a prior information, and finally formulate a generalization of the regularized inverse problem.
To formulate a generalization of the regularized regression problem, we use a heuristic analysis based on four cases: The simple case is when X is bijective ( = ), the problem is analytically well-posed, the solution of = may verify the Hadamard conditions and is unique, but still be very sensitive to small perturbations. Also, when X is ill-conditioned, −1 amplify the error which yields to an irregular solution.
However, when X is not surjective ( ( ) < ⇔ < ), the observation Y may not belong to ( ), that is, Y ∉ ( ). Then it seems natural to replace Y by it's projection onto ( ), that is,̄= Proj Y ( ) to re-establish the surjectivity.
The problem becomes̄= || − || 2 , wherēis the unique solution of the normal equation = called least square solution.
Moreover, when X is not injective ( ( ) < ), we have an infinity of solutions forming an affine variety of dimension ( − ), that is, all vectors in the affine subspace −1 { } are solutions, hence we can select a meaningful solution in the subspace −1 { } by introducing a priori information which involves the choice of some inference principle such as smoothness of the solution or other restoration criteria to reduce the subspace −1 { } and regularize the inverse problem.
The problem becomes: where Ψ( ) could be, for example, and > 0. are the adaptive weights. As suggested by Zou (2006),̂could be the OLS estimates or 2 penalized estimator in high-dimensional data (adaptive-lasso).
Furthermore, when X is neither injective nor surjective, the combination of these strategies for solving the inverse problem leads to recover a solution that optimizes a prior function Ψ( ) over the the affine subspace This can be expressed as: where Ψ ∶ (ℝ, ℝ ) ⟶ ℝ and (ℝ, ℝ ) is the space of functions defined from ℝ to ℝ . A standard and equivalent strategy of the problem (5) is to reformulate the problem in a minimization of the fidelity term || − || plus a prior term Ψ( ). Using Lagrange multiplier, this could be expressed as problem (3). From inverse problem perspective, we aim to find a regularization function Ψ and a scale that optimize problem (3).

Link with the Bayesian framework
The Bayesian approach consist on finding the most probable knowing the measured data and the forward problem model . P( ) represent the prior probability of the model object (coefficient) and ( | ) is the measurement model probability (likelihood) of generating a measure given an ideal model object .
The maximum a posterior (MAP) allows to formalize inverse problem as a statistical problem, where the posterior probability ( | ) has to be maximized with respect to . By taking the logarithm, we can see that we have to minimize ( ) =− ( ) − ( | ). To convert the problem (3) into a prior distribution over expected object we use a Boltzmann or Gibbs distribution of the form: ( )= 1 (− ∫ ( ) ) (Geman, 1988). This prior model is then combined with the model based on the measurements with a specific noise: , where and ′ are normalization constants, called also, the partitions functions. Hence the regularized Bayesian model is expressed as the problem (3).

Proposed method: New hybrid regularization function
We consider the problem (5) and we define: where To ensure that ∫ Ω (| ( )|) d is convex, we need the following hypothesis: Recall of the problem: The problem consists to find the optimal minimizer of the following functional over a convex domain Ω ⊂ ℝ: Suppose that ( ) has a minimum pointˆ, then the problem satisfies the following proposition.
Proposition 2.1. ′ ( ) satisfies the following equation: Browning idea from robust statistics, we propose the Huber function as a choice of the function (regularizer) which is optimal and verify the general hypothesis: The subgradient of is given by The Huber norm is often used as a loss function and is shown to be robust and competitive against the squared loss. In our study, the Huber function is used as a regularization term. This approach is frequently used in optical tomography (Douiri, Schweiger, Riley & Arridge, 2005) and image resolution (Unger, Pock, Werlberger, & Bischof, 2010) and has shown its efficiency. Huber regularizer enjoys the properties of 1 and 2 norms on the model parameters. The proposed regularizer is quadratic near the origin and linear far away from the origin which penalizes the outliers less severely. To this end, was specified to ensure a smooth link between quadratic to linear. As recommended by Huber, the constant regulates the amount of robustness which is in the range between 1 and 2, commonly chosen = 1.5. Note that could be adaptive or chosen by L-curve method (Hansen, 2001).
To solve problem (7) using the proposed regularizer and other regularization methods cited above, we used a convex optimization-based methodology using (CVXR package) (Fu, Narasimhan, & Boyd, 2017).

Collinearity, conditioning of covariance matrices and Belsley, Kuh and Welsch test
Belsley, Kuh and Welsch test (Belsley, Kuh, & Welsch, 1980) is a method that classifies data affected by collinearity using the conditioning number of the related covariance matrix. We note the conditioning index of the covariance matrix, where = ∕ and = + √ are the singular values of the covariance matrix.
In health data, the conditioning number of the covariance matrix is often larger than 100. This test could give a measure of collinearity and provides related number of collinearity relationship.

SIMULATIONS
The theoretical framework was applied by simulation to perform model selection and validation in regression models. We investigated five methods for penalized linear regression; Ridge, lasso, adaptive-lasso, elastic net and the proposed method (Huber regularizer). The statistical analysis was performed using R statistical software. The simulated data consisted of a range of independent data in low-( > ) and high-dimensional ( >> ) sets, under different error amplitudes and a high condition number (strong collinearity). Each data set was divided into training and test sets. The five methods were used to fit optimal models in each of the training sets. The fitted models were used to assess the performance of predictions in the corresponding validation data sets. To assess the accuracy, we compute the mean square error (MSE), standard error and extract the number of nonzerô-coefficients. The procedure was repeated 50 times per example. All predictors variables were continuous multivariate normal distributed. The predictors were generated by sampling from a multivariate normal distribution with the following probability density function: where is the mean vector and Σ is the covariance matrix. For all x we set = 0 and Var(x) = 1. Through a singular value decomposition of X (SVD), we set the condition number of X. Each predictor variable was assigned a predetermined -value. Therefore, we obtained a vector that consisted of the corresponding -values. In all examples, the simulated data set was divided into three partitions, two for training set and one for validation (test) set in each data set. Case > : • In example 1, we simulate 50 data sets, each with n = 20 from = + where ∼ (0, 1). We set = (3, 1.5, 0, 0, 2, 0, 0, 0), = 3 and ( ) = 100 (strong collinearity exist). • In example 2, we simulate 50 data sets, each with n = 20 from = + where ∼ (0, 1). We set = (0.80, 0.80, 0.80, 0.80, 0.80, 0.80, 0.80, 0.80), = 3 and ( ) = 100. • In example 3, we simulate 50 data sets, each with n = 100 and p = 40 from = + where ∼ (0, 1).
Each example was considered separately. Ridge, lasso, adaptive-lasso, elastic-net and the proposed method (Huber) were fitted to the same data set simultaneously. The regularization parameter was set as a grid of values in the range of [10 −1 , 10]. Optimal was determined by searching through the grid of values that optimize the minimization problem. From examples 1-3, simulations confirmed that the lasso and ad-lasso identified small number of important predictors. We observed that Huber method achieved high predictive accuracy. When the number of predictors was very large, the accuracy of Huber model was higher compared to ridge, lasso, adaptive-lasso and elastic-net. These confirmed the superiority of Huber, significantly, in the case of strong collinearity with high error amplitude. Note that the proposed method benefits from the property of sparsity in some cases, by shrinking some coefficients to zero through the l1 penalty. We observed that ridge and elastic-net generally improved over the lasso. The lasso is performing better than adaptive-lasso, however, the adaptive-lasso tends to select more variables than the lasso. Finally, in n > p case, high predictive accuracy was observed with Huber method followed by ridge regression. Eventually, elastic-net, lasso and adaptive-lasso incorporate variable selection more than Huber and their resulting model was more interpretable than that of the proposed method. From examples 4-6, which represent the high-dimensional case, we observed that ridge regression and Huber outperform all the methods in example 4 and example 6 in terms of prediction accuracy. Elastic-net is performing better than other methods in example 5. Lasso and adaptive-lasso tend to select more variables than the other methods. In a high-dimensional case, we observed that there is no procedure that statistically outperforms all the others. This expected as different procedures could be used in different settings in practice.

Data and modelling approach
We used stroke data published by Douiri et al. (2017) to evaluate the proposed hybrid regularization method as well other published methods mentioned above. Data collected include 495 patients from the population-based SLSR between the period August 2002 and October 2004. All patients with cerebral infractions (ICD-10 code 163), were included. The progression of functional recovery was evaluated using Barthel Index (BI) with total possible scores ranging from 0 to 20, with lower scores indicating increased disability. BI was measured at weeks 1, 2, 3, 4, 6, 8, 12, 26 and 52 after stroke. A number of candidate predictors were considered in the model variable selection, including demographics (age, sex, ethnicity, premorbid disability, socioeconomic status), stroke characteristics (subtype based on the Oxford classification (lacunar infarct, LACI), total anterior circulation infarcts (TACI), partial anterior circulation infarcts (PACI), posterior circulation infarcts (POCI) and intra-cerebral haemorrhagic (ICH) stroke, presence of cerebellar symptoms, baseline impairments, case-mix variables (Glasgow coma score, GCS), National Institutes of Health Stroke Scale (NIHSS)). Potential variables of recovery were screened for practicality based on their prevalence in academic literature, resulting in seven candidates prognostic factors. Details of these predictors can be found (Douiri, Grace & Sarker, 2017). Please note that all methods discussed are robust for collinearity. In relation to missing data and based on the assumption that these were at random, we used a multiple imputation method known as Markov Chain Monte Carlo (MCMC) (Gilks, Richardson, & Spiegelhalter, 1995). The seven predictors (sex (sex), ethnic groups (ethgrp), Glasgow coma score (glas_cs), stroke subtype (subtype: TACI, PACI, LACI, POCI), Patient age (age), first week BI score (batotw1) and NIHSS Stroke Scale (nihtot) in order to predict outcomes of BI at 12, 26 and 52 weeks. Ridge, lasso, adaptive-lasso, elastic-net and huber methods were applied to the stroke data. Model fitting and tuning parameter selection that optimize the minimization problem were done on the training data. We then compared the performance of the aforementioned methods by computing their prediction mean squared error. Parameters estimation, average mean squared errors (MSE) and standard deviation (SD) were calculated by bootstrap methodology using 500 replications. As we mentioned earlier, the conditioning related to the covariance matrix for stroke data (SLSR) was calculated and is higher than 100 which indicates the existence of a strong collinearity.

Ridge (sd) Lasso (SD) Ad-lasso (SD) Elastic-net (SD) H u b e r( SD)
Coefficients estimations Results from these simulations confirmed that the proposed method (huber) is robust in terms of prediction accuracy. Ridge method is performing better than elastic-net, lasso and adaptive-lasso. Compared to the proposed method, the prediction accuracy of both the lasso and adaptive-lasso, is clearly affected as expected by the high conditioning number of the related covariance matrix. However, the proposed method (huber) overcomes this problem better.

DISCUSSION
We have proposed an inverse problem framework to the regression problem. This approach allowed us to characterize the regularization function that optimize the regression model. A generalized penalized function was construed and based upon robust statistics methodology. We proposed a simple but efficient hybrid regularization function. The resulting regularized regression model showed good performances compared to other known methods including ridge, lasso, adaptive-lasso and elastic-net methods. To the best of our knowledge, this is the first time where a method has used combined inverse problem approach with robust statistics methodology to solve regularized regression problem. In our simulations, we have tested the proposed method using different scenarios: error amplitudes, strong degree of collinearity which affect the conditioning of the related covariance matrix and number of predictors. The proposed method fitted well the regression model and showed good performance and less bias compared to other methods. This confirms that the inverse problem framework coupled with robust statistics methodology, has assisted to characterize the optimal regularization and was more robust compared to other methods as we observed in these simulations. Note that this proposed method could be considered as a nonlinear form of elastic-net which compromise between ridge and lasso.
Using clinical data from a population-based study from London which reflect the real-world application, the method performed well in constructing regression parameters that fit well the observed data with minimal predictive error. The proposed method established an added clinical value in practice and confirms that the inverse problem framework used is optimal. In clinical research, the developed predictive model could be applied as a tool in assessing the beneficial effects of evidence-based interventions and support care settings. As a research tool, this could be used to test novel interventions or to identify enriched samples, reducing the reliance on the need for expensive and often impractical randomized controlled trials. This predictive enrichment strategy is of importance for designing future trials as it enables the enrolment of the most suitable patients thereby permitting the use of a smaller study population. Another potential application could be to derive a set of preliminary cost weights on resource uses which could help to build personalized patient care and funding models. This approach could be extended to solve generalized mixed regression models, high dimension reduction and could also be used in statistical machine learning problems to optimize loss function (Poggio & Rosasco, 2016). The method offers all the advantages that lasso, and elastic net penalty provide for model fitting and variable selection or automated feature extraction, as well as give an understanding of the limitations related to the data. As in lasso, the proposed method also defines a continuous shrinking operation that can produce coefficients that are numerically very small (so the predictor related could be opted out). The supportive results reported here suggest that the inverse problem approach could be useful in a wide variety of statistical estimation problems. This approach is not well explored in statistical community and further study is needed to investigate further these potentials. Inverse problem framework tool could provide a useful tool for clinical and public health research. Our work added further evidence that the penalized regression using inverse problem approach is robust with less bias and could be used with assurance in the practice. Furthermore, the proposed method offers an improved alternative method to ridge, lasso, adaptive-lasso and elastic-net, which have already been shown to be valuable methods in previous studies.

A C K N O W L E D G E M E N T S
AD and CW would like to acknowledge the support and funding from the National Institute for Health Research (NIHR) Collaboration for Leadership in Applied Health Research and Care South London at King's College Hospital NHS Foundation Trust and the Royal College of Physicians, as well as the support from the NIHR Biomedical Research Centre based at Guy's and St Thomas' NHS Foundation Trust and King's College London.
YH gratefully acknowledges the financial support of the International PhD Program for Modelling Complex Systems supported by both, Sorbonne University and IRD.

C O N F L I C T O F I N T E R E S T
The autors have declared no conflict of interest.

D ATA AVA I L A B I L I T Y S TAT E M E N T
The study and its consent procedure were approved by the ethics committees of Guy's and St Thomas' Hospital Trust, King's College Hospital, Queen's Square and Westminster Hospital. Consent for data sharing was not obtained from study participants. The research team will consider reasonable requests for sharing of anonymized patient-level data.

O P E N R E S E A R C H B A D G E S
This article has earned an Open Data badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available in the Supporting Information section. This article has earned an open data badge "Reproducible Research" for making publicly available the code necessary to reproduce the reported results. The results reported in this article could fully be reproduced.

S U P P O R T I N G I N F O R M AT I O N
Additional supporting information may be found online in the Supporting Information section at the end of the article.