Adaptive predictor‐set linear model: An imputation‐free method for linear regression prediction on data sets with missing values

Linear regression (LR) is vastly used in data analysis for continuous outcomes in biomedicine and epidemiology. Despite its popularity, LR is incompatible with missing data, which frequently occur in health sciences. For parameter estimation, this shortcoming is usually resolved by complete‐case analysis or imputation. Both work‐arounds, however, are inadequate for prediction, since they either fail to predict on incomplete records or ignore missingness‐induced reduction in prediction accuracy and rely on (unrealistic) assumptions about the missing mechanism. Here, we derive adaptive predictor‐set linear model (aps‐lm), capable of making predictions for incomplete data without the need for imputation. It is derived by using a predictor‐selection operation, the Moore–Penrose pseudoinverse, and the reduced QR decomposition. aps‐lm is an LR generalization that inherently handles missing values. It is applied on a reference data set, where complete predictors and outcome are available, and yields a set of privacy‐preserving parameters. In a second stage, these are shared for making predictions of the outcome on external data sets with missing entries for predictors without imputation. Moreover, aps‐lm computes prediction errors that account for the pattern of missing values even under extreme missingness. We benchmark aps‐lm in a simulation study. aps‐lm showed greater prediction accuracy and reduced bias compared to popular imputation strategies under a wide range of scenarios including variation of sample size, goodness of fit, missing value type, and covariance structure. Finally, as a proof‐of‐principle, we apply aps‐lm in the context of epigenetic aging clocks, linear models that predict a person's biological age from epigenetic data with promising clinical applications.


INTRODUCTION
Due to its simplicity and well-known properties, linear regression (LR) is a frequently applied statistical model in biomedical sciences when performing statistical inference and prediction of a continuous outcome given a set of predictors.Even though LR requires strong assumptions such as normally distributed residuals, violations can in practice often be tolerated without compromising its validity, especially at large sample sizes (Schmidt & Finan, 2018).Two different statistical problems have to be differentiated when it comes to LR: On the one hand, there is parameter estimation of the regression coefficients.On the other hand, there is the aim to predict the outcome of the LR from new data of the predictors given the model with estimated parameters.Missing values are a common phenomenon in empirical sciences in the Age of Information (Martínez-Plumed et al., 2021), and can be a threat to the integrity of data analysis in life and health sciences (Desai et al., 2011;Janssen et al., 2009).Refined prediction models for disease risk or treatment response often incorporate a large number of clinical influence or risk variables, and thus make incomplete records likely.In the past, research has focused on how LR can be adapted to tolerate missing values in the context of parameter estimation (Little, 1992).Two commonly applied approaches are complete-case analysis (CCA) or imputation, in which incomplete observations are either removed or replaced by complete, partially imputed values, respectively.In contrast to this, adaptations of LR that can deal with missing values under the prediction paradigm have barely been explored (Hoogland et al., 2020;Mercaldo & Blume, 2020;Sperrin et al., 2020) which is especially problematic because both CCA and imputation are suboptimal solutions in this situation (McCaw et al., 2020): CCA implies that predictions can only be made for complete observations.Incomplete medical records are common and CCA will result in the inability to draw predictive outcomes for certain patients, an especially heavy loss when the data were collected via invasive or expensive methodologies (Marshall et al., 2002).Moreover, CCA can have unintended consequences: for instance, minority groups are more reluctant to give information and thus, more likely to generate incomplete records.Failing to embrace the missingness in our modern data sets may thus lead to a fairness reduction in our models (Martínez-Plumed et al., 2021).Alternatively, aiming to include incomplete observations, imputation is typically employed prior to calculating predictions (Wood et al., 2015).For this, however, additional assumptions are required, namely, missing completely at random (MCAR) or missing at random (MAR, Li, 2013), which are seldom realistic scenarios in practice.In fact, the missing data mechanism is rarely known and cannot be tested for, with the exception of MCAR (Little, 1988).Imputation may thus introduce sources of bias or increase the computational burden, especially in the case of multiple imputations (Mercaldo & Blume, 2020).Perhaps even more importantly, quantifying the uncertainty of predictions (so-called "pragmatic model performance") poses a challenge in the presence of imputation (Wood et al., 2015): To what extent can we trust a predicted outcome if certain predictors were missing?Finally, many imputation strategies simply fail under extreme missingness, a scenario often neglected in the literature (McCombe et al., 2022).
In this study, we will regard the following setup consisting of two stages: (i) In the parameter estimation stage, one set of researchers measures a complete (reference) data set with  ∈ ℕ predictors and  ∈ ℕ observations captured in the  × ( + 1) design matrix  ref (including the intercept) and outcome vector  ref ∈ ℝ  .In this stage, parameter estimation for the LR model is performed.(ii) In the prediction stage, another set of researchers has an (application) data set with an incomplete predictor design matrix  app and wants to predict the outcome  app by applying the model of stage (i).Let us also assume that only summary statistics from the reference data set  ref ,  ref are allowed to be passed from stage (i) to (ii) due to privacy constraints (which is typically the case when medical, genetic, or otherwise identifiable patient data are involved).Here, we present a novel method called adaptive predictor-set linear model (aps-lm) that is able to transfer parameters from the reference (i) to the application (ii) stage in order to make predictions for incomplete observations without having to rely on imputation and without compromising privacy in the reference data set.It is a CPU/storage-efficient analytical solution that does not introduce additional assumptions and was derived by using a predictor-selection matrix operation, the Moore-Penrose pseudoinverse, and the reduced QR decomposition.aps-lm is a generalization of LR that can inherently handle missing values.It is fitted to the reference data set of stage (i) and can be efficiently applied to the external data set of stage (ii) suffering from missing data by taking into account the pattern of missing values for each individual prediction.We derived closed-form expressions for the coefficient estimates, standard errors, confidence and prediction intervals, and give statistical interpretation of the parameters for both unregularized and  2 -regularized LR models.aps-lm has a very compact representation: The required parameters are an upper triangular matrix  ∈ ℝ (+1)×(+1) , the standard ordinary least squares (OLS) estimator θols ∈ ℝ +1 , and the scalar  ⊤  ∈ ℝ, which preserve the privacy of the reference data set.To investigate the properties of aps-lm compared to its alternatives, we benchmarked it in a simulation study against two popular imputation methods: mean imputation and Multivariate Imputation by Chained Equations (MICE) imputation (van Buuren & Groothuis-Oudshoorn, 2011).As a result, aps-lm showed greater prediction accuracy and reduced bias compared to the tested alternatives under a wide range of sample sizes, goodness-of-fit (GOF) values, missing value types, and covariance structures.Moreover, as a proof-of-principle application example, we explored the benefits of aps-lm in the context of epigenetic aging clocks; that is, statistical models that can estimate a person's biological age from epigenetic data, with promising prospects in the understanding of aging and assessment of clinical interventions (Bell et al., 2019) or other more specialized applications, such as in forensic genetics (Vidaki & Kayser, 2017).Ultimately, we envision aps-lm as a simple and flexible method with a wide range of applications that can smooth the transition of linear models from the controlled research setting to the disarray of the real world, with special interest for applications under privacy restrictions such as medical data, data suffering from extreme missingness, or requiring a thorough prediction error model.
The manuscript is organized as follows: Section 2 introduces the setup, the problem of missing values, and the corresponding notation.In Section 3, our novel aps-lm is developed.Simulations comparing the performance of aps-lm to mean and MICE imputation are given in Section 4. Finally, an application example is shown using data for epigenetic age prediction.

Parameter estimation in standard LR
For  ∈ ℕ observations and  ∈ ℕ predictors with  > , we define a standard linear model in the following form: where  ∈ ℝ  is the vector containing the dependent variable,  ∈ ℝ ×(+1) is a full-rank design matrix with an intercept column such that rank() =  + 1,  ∈ ℝ (+1) is the vector of linear coefficients,  ∈ ℝ  is the random error vector, ⃗ 0  ∈ ℝ  is the vector whose components are all equal to zero,  2  ∈ ℝ is the variance of the error terms, and   ∈ ℝ × is the identity matrix of order .The parameter vector  is usually estimated by the OLS estimate θols ∈ ℝ +1 which can be written for a full-rank design matrix  as where † denotes the Moore-Penrose pseudoinverse, easily computed from the singular-value decomposition (Shinozaki et al., 1972).We use this expression in our derivation of the aps-lm model (at Equation 6) due to its compactness rather than the more popular normal equations θols = ( ⊤ ) −1  ⊤ , where  ⊤ denotes the transpose of .Note that these two representations are equivalent for full-rank design matrices since then  † = ( ⊤ ) −1  ⊤ .

Setup and notation for our problem
We will formulate our problem in two stages: (i) In the first stage, we have the well-known parameter estimation setup of a linear model where a reference data set with outcome +1) is available and the aim is to estimate the parameter vector  ∈ ℝ (+1) of the corresponding linear model.We assume no missing values in  ref and  ref in stage (i).(ii) In the second stage, we have the prediction setup where an application data set with design matrix  app ∈ ℝ  app ×(+1) is measured and the outcome  app ∈ ℝ  app is to be estimated (sometimes the term "predicted" is also used).
Here,  ref and  app are the number of observations in the reference and application data set, respectively.We will further assume that the research groups of stages (i) and (ii) are independent from each other and that  ref and  app are hosted on different servers with access only for the respective group.Only summary statistics can be freely exchanged between research groups as the data are considered sensitive and are hence to be protected because of data privacy (Malin et al., 2013).The objective in stage (ii) is to make predictions of the outcome  app from  app by employing a linear model without violating privacy restrictions of the individuals registered in the reference data set.In the absence of missing values in  app , the problem is trivial and can be solved, for example, by computing θols =  † ref ⋅  ref in stage (i), transferring this privacy-preserving summary statistic between servers and computing the linear predictions as ŷapp =  app ⋅ θols .
(3) However, the above scheme fails if missing values are present in stage (ii) in  app , as the direct computation of ŷapp is not possible in presence of missing values.To tackle this issue, one possibility is to pretrain linear models for all combination of available variables to get missing-value-dependent estimates of  in stage (i).The downside is that the number of models and parameters to be stored scales with  in an unfavorable fashion (Hoogland et al., 2020): Another approach is to train upon demand: For each observation in the application data set in stage (ii) note the missing values and exclude the corresponding variables in the reference data set of stage (i).Then, estimate linear coefficients in stage (i), pass the estimates over to stage (ii), and generate predictions using these sample-customized linear coefficients in stage (ii) (Mercaldo & Blume, 2020).This procedure, however, requires intricate multiparty computation and is thus usually highly impractical.As a result, the most convenient option is to simply perform imputation on  app and employ the full model as in ŷapp =  imputed app ⋅ θols (Wood et al., 2015).Imputation has the drawback that it causes high computational costs, especially for multiple imputation, and introduces assumptions, typically MCAR or MAR, which in practice are often violated.To overcome the aforementioned limitations, we propose aps-lm, an extension to linear models which can inherently handle any combinations of available predictors.
Note that this operation projects an  ×  matrix down onto an  × ( − |Ω|) or ( − |Ω|) ×  matrix.The operation and its properties are formally defined in Suppl.Methods 1.

Definition of aps-lm and parameter estimation
Let Ω be a set of missing variables such that Ω ⊂ {2, … , ( + 1)} (the first column of  is reserved for the intercept).We can now describe the linear model in the presence of missing data in the following central equation for our novel aps-lm by combining Equations ( 1) and ( 4): where  Ω ∈ ℝ (+1−|Ω|) is the vector of remaining LR coefficients,  Ω ∈ ℝ  is the vector of error terms, and  2 ,Ω ∈ ℝ is the variance of the error terms.Note that the new design matrix  ⋅  (,Ω) +1 is the original design matrix  but having removed the columns corresponding to the set Ω.The model of Equation ( 5) can be seen as a projection of the full saturated model of Equation (1) to the submodel on the set of available predictors.For a later application in stage (ii), Ω will represent the missing variables for an individual as recorded in  app .Thus, every set Ω corresponds to a different standard OLS linear submodel, with number of coefficients equal to  + 1 − |Ω| and different  2 ,Ω .We will first derive an estimate for  Ω .For this, we will assume full rank and apply the normal equation in its pseudoinverse form (Equation 2): This expression, however, depends on  and is thus neither privacy-nor storage-friendly.To solve this issue, we exploit some properties of the pseudoinverse.It has been shown that ( ⋅ ) † =  † ⋅  † if any of the following properties applies: (i)  has orthonormal columns,  ⊤ ⋅  = , (ii)  has orthonormal rows,  ⋅  ⊤ = , (iii)  and  have linearly independent rows and columns, respectively, (iv)  =  ⊤ , or (v)  =  † (Greville, 1966).Though none of these conditions necessarily apply for  or  (,Ω) +1 , we can achieve condition (i) by performing reduced QR decomposition of .We can write  as where  ∈ ℝ ×(+1) has the same dimension as , and  ∈ ℝ (+1)×(+1) is an upper triangular matrix.This differs from the more popular full QR decomposition in which  ∈ ℝ × and  ∈ ℝ ×(+1) .Reduced QR decomposition was initially developed in numerical linear algebra to improve implementation efficiency, but here we use it to compress the number of parameters stored in .Introducing Equation ( 7) into (6), we get: In the last step, we also used the result that the pseudoinverse of a matrix with orthonormal columns is its transpose.To obtain the final expression for θΩ , we want to exchange  ⊤ ⋅  for a more interpretable term.We can do so by setting Ω to the empty set ∅: By multiplying by  from the left on both sides, we get: Finally, by substituting Equation ( 9) into (8), we arrive at the final expression for aps-lm coefficient estimates as a function of Ω, , and θols : The matrix ( ⋅  +1) transforms θols , which corresponds to Ω = ∅, to θΩ for any set Ω ≠ ∅. θΩ thus corresponds to θols mapped down to the reduced predictor space.The model parameters  and θols do not depend on Ω.We are here in stage (i) where  corresponds to  ref and  to  ref . and θols can be readily calculated and transferred to stage (ii).As  ∈ ℝ (+1)×(+1) is upper triangular and θols ∈ ℝ +1 , the total number of parameters stored in  and θols ,  , θols , is independent of  and equal to: This setup is satisfactory if  ≫ , which is the typical case in LR analysis.In addition, neither  nor  can be retrieved from  and θols ; as a result, privacy of the participants in the reference data set cannot be compromised.With respect to CPU efficiency, although aps-lm is better represented in its pseudoinverse form (Equation 10), the computation of the pseudoinverse has an unfavorable algorithmic time complexity (e.g., (( + 1 − |Ω|) 2 ⋅ ) with the Golub and Kahan (1965) algorithm) (Hamm & Huang, 2019).For the practical implementation of aps-lm, we can substitute the pseudoinverse with the more efficient sweep operator (Goodnight, 1979;Marshall et al., 2002) (more details in Suppl.Methods 5).

Making predictions with θ𝛀 on 𝑿 app
Let us now set  ∶=  ref .
In stage (i),  and θols are calculated and can be freely transferred to stage (ii).In stage (ii), for each observation  ∈ {1, … ,  app }, the set Ω  of missing values is determined and a personalized set of linear parameters, θΩ  , can be estimated by Equation ( 10).An individual prediction is then simply computed by: where ŷ app ∈ ℝ is the prediction for individual  and  app ,∀∈Ω ∁  contains all nonmissing variables of individual .This step can be trivially parallellized for improved computational performance.

Parameter interpretation
From the parameter matrix , it is possible to retrieve the sample size  ∈ ℕ, mean vector μ ∈ ℝ +1 , and the covariance matrix Σ ∈ ℝ (+1)×(+1) , defined as where ⃗ 1  ∈ ℝ  denotes the vector whose components are all equal to 1. μ, Σ, and  can be easily recovered from the entries in the matrix  (derivation in Suppl.Methods 2.1): The relationship is bijective, meaning that  can also be computed from these same parameters: where for a matrix  the notation chol() stands for the matrix  from the Cholesky decomposition with  ⊤ ⋅  = .

aps-lm Standard errors
We can obtain the standard errors for the coefficients in aps-lm following somewhat similar steps as in the derivation of the standard errors in the classical linear model (available in Suppl.Methods 3): where Here, we see that one new parameter is introduced,  ⊤  ∈ ℝ.Including  ⊤  in our set of aps-lm parameters, we end up with a total of ( + 1)( + 4)∕2 + 1 parameters in order to compute linear coefficient estimates and standard errors for all missing value sets Ω (see Equation 11).Standard errors can be used to compute confidence intervals for the mean response as in: where  ∕2 −(+1)+|Ω  | is the ∕2-quantile of the -distribution with  − ( + 1) + |Ω  | degrees of freedom.We can also obtain prediction intervals as in: This expression allows the reporting of prediction errors that take Ω  into account without the need to access the reference data set.

Adaptive predictor-set ridge regression
Regularization is a statistical methodology that modifies objective functions to include certain constraints, especially useful to solve ill-posed optimization problems.Ridge regression (also known as  2 -regression or Tikhonov regression) is a regularized version of LR in which an  2 -penalty for the linear coefficients is introduced to the OLS LR objective function.
Expressing the objective function in its Lagrangian form, we have: where  is the regularization parameter (equivalent to a Lagrange multiplier of the  2 -norm constraint of ) and  ({1},) +1 is included in the regularization term to avoid penalizing the intercept variable.Ridge regression manages to reduce the standard errors of θ, but at the expense of creating bias in the linear coefficients estimates.Conveniently, a closedform solution is available for this problem (Hoerl & Kennard, 2000).Thus, we can also derive an adaptive predictorset ridge regression model aps-ridge.The calculations are similar to aps-lm and given in Suppl.Methods 4. The model parameters, which are derived in stage 1 and are communicated to stage 2, are now   , θridge ,  ⊤ , and .Note that ) and  = chol ) such that it is possible to interchange between the regularized and unregularized form of the  matrix as long as the regularization parameter  is available.The ridge linear coefficients estimate excluding the set of variables in Ω is given by: where +1 ) and θridge = ( ⊤    ) −1 ⋅  ⊤  ⋅ θols .This expression is parallel to that of aps-lm (Equation 10), but substituting  and θols by   and θridge , respectively.Alternatively, θridge Ω can be efficiently calculated with the sweep operator (Suppl.Methods 5).We also derived standard errors for aps-ridge (Suppl.Methods 4): where and where  ,Ω denotes the effective degrees of freedom (Janson et al., 2015), computed as with Tr being the trace operator.Similar to aps-lm, we approximate confidence intervals for the mean response by and prediction intervals by This is only an approximation since the distribution of the ridge estimator is unknown (Bae & Kim, 2014).Warnings against the use of confidence intervals in penalized regression are common due to the fact that the estimation of the linear coefficients is biased (Hoerl & Kennard, 2000) (and thus of the aps-ridge coefficients).However, this does not mean that the predictions are biased.In fact, [ −  ⋅ θridge ] = [ −  ⋅ θols ] = 0 (Suppl.Methods 4.1); therefore, our reported confidence and prediction intervals for the response are unbiased.

An approximation to adaptive predictor-set generalized linear models
In the study of Marshall et al. (2002), a similar approach to aps-lm is employed in the context of logistic regression.This is only one possibility of extending the aps-lm to generalized linear models.Other possibilities and related properties have to be investigated in future studies.We will give here only a rough description of their method and compare it to aps-lm.Written in aps-lm notation, the following generalized linear model is proposed: where  is a link function.Although  is originally defined on ℝ, we can extend this easily to ℝ  by applying  to each component (in order to conform to our matrix notation).In Marshall et al. (2002), they particularly focus on the logistic regression.Let θglm Ω and θglm be the maximum likelihood estimators of  glm Ω and  glm ∶=  glm ∅ .We then have: θglm Ω is computationally intensive and difficult to calculate.Therefore, Marshall et al. (2002) propose an estimator which is equivalent to the following approximation: As can be seen from Equation ( 25), the prediction based on the missing data set, ŷglm Ω , is replaced by the prediction based on the whole data set, ŷglm ∅ .In our notation, ŷglm Ω corresponds to the desired stage 2 prediction whereas ŷglm ∅ refers to the stage 1 prediction of the full model.
To compare with aps-lm, let us set the link function to the identity function: where θΩ is the aps-lm estimate of the linear coefficient vector.Curiously, in this scenario, the approximation of Marshall et al. (2002) becomes an equality.Employing Equation ( 7) we get: Thus, with identity link function, Marshall's approximation is equal to aps-lm's θΩ .For a general link function , note that Equation ( 26) needs access to the full stage 1 data.We can also extend Marshall's approximation in a way that protects the privacy of the individuals in the reference data set by expanding with Equation ( 7): This last expression is equivalent to the standard aps-lm coefficient estimate expression (Equation 10) but replacing θols by θglm .However, this approximation will worsen with increasing |Ω|.More research in this area is clearly necessary.

SIMULATION STUDY
To test the benefits of aps-lm over its alternatives, we performed an extensive simulation study, which was setup according to the Aims-Data generating mechanism-Estimands-Methods-Performance measures (ADEMP) guidelines (Morris et al., 2019).Four rounds of simulations were performed with different aims (Table 1): In rounds 1 and 2, we compared our novel aps-lm to (unconditinal) mean imputation and to MICE multiple imputation (van Buuren & Groothuis-Oudshoorn, 2011).
For this, we first simulated the reference and application data sets  ref ,  ref ,  app , and  app of stage (i) and (ii) assuming that they follow the same linear model and without missing values.To create missing values for stage (ii), we removed entries from  app .Then, we supposed that the true simulated outcome  =  app of stage (ii) is unknown and we estimated it with three methods: (unconditional) mean imputation, MICE multiple imputation, and aps-lm.For the two imputation methods, the estimated linear model from stage (i) was applied to the imputed data set  app,imputed of stage (ii).For apslm, the parameters  and θols were derived in stage (i) from  ref and  ref and applied to the unimputed  app in stage (ii).Application of the estimated linear model from stage (i) to the full application data set without missing values served as a benchmark and gold standard.In round 1, a wide range of conditions were simulated (21 simulations) and in round 2 the influence of missing value type was examined (six simulations).Round 3 compared the performance of aps-ridge to aps-lm and finally in round 4 the simulated coverage of the confidence and prediction intervals was compared to the predefined coverage.To judge the performance of the approaches in rounds 1-3, we compared the estimated ŷ with the true simulated value .For this, we applied four measures: the Pearson correlation squared ρ2 , ŷ (for convenience, since meaningful negative correlations are not expected) (rounds 1-2), the bias between  and ŷ (rounds 2-3), the concordance correlation (round 2), and the mean squared prediction error (rounds 2-3).For the ridge models, the regularization parameter  had to be determined by cross-validation which was based on the mean squared error (MSE) in the software we applied.Therefore, using the Pearson correlation in the comparison of ridge to other models would be detrimental to those models.We consequently used only the MSE in round 3. Details on the implementation are available in Suppl.Methods 6.Briefly, we first simulated  ref and  app from a multivariate normal distribution by fixing mean vector  and covariance matrix Σ for  = 10 predictors and appending an intercept column.Second, to generate  ref and  app , we fixed the parameter vector , computed  ref ⋅  and  app ⋅  and added white noise () whose variance ( 2  ) was tuned to adjust the GOF of the linear model.Three types of covariance matrices of predictors were used: independent, weakly dependent, and strongly dependent; their corresponding correlation matrices are visualized in Figure S1A-C.For round 2, an additional covariance matrix with ultra-strong dependence was investigated (Figure S1D).Three types of missing value schemes were employed (Rubin, 1976).We provide extensive definitions and descriptions on how the missing value types were implemented in this study in Suppl.Methods 6.1.Briefly, MCAR occurs when the probability of becoming missing does not depend on observed or unobserved variables but solely on a single parameter, the proportion of missing values in  app , here denoted by  NA , which we set equal for all variables in our simulations.For MAR, the probability of becoming missing does not depend on the variable in question but on other observed variables and an additional tuning parameter.For our simulations, this relationship is modeled with the use of link functions; four types of link functions are used in this study, all of which are based on the logistic function, such that a higher probability of becoming missing is given to extreme points to the left (L), to the right (R), both sides (LR), or values closer to the center of the distribution (C).Finally, the term missing not at random (MNAR) is used in all other cases, that is, when the probability of a variable becoming missing depends additionally on the value of the variable itself or on unobserved variables.We model this relationship again with the aforementioned link functions.

Round 1: aps-lm under a wide range of conditions
In this round, we aimed to explore a wide simulation parameter space comprising the proportion of missing values ( NA ), linear model GOF ( 2 ), sample size (), missing value type, and covariance structure of predictors.We focused on the squared Pearson correlation measure in this section for convenience.The results of all 21 simulations are available in Figures 1, 2, and 3 and S2-S19.By definition, aps-lm tends toward unconditional mean imputation when a diagonal covariance matrix is used.In general, this is apparent in the simulations (Figures S2, S4, S7, S9, S12, S14, and S17); however, at low  and high  NA or under MNAR, aps-lm outperforms unconditional mean imputation in terms of MSE in spite of independence.The cause is probably that mean vector estimation is less reliable on the relatively small application data set when the sample size is low or many missing values occur, while aps-lm has access to the robust estimates from the reference data set without missing values (stored in the  matrix).When dependence is included in the covariance matrix, aps-lm systematically outperforms both unconditional mean and MICE imputation in the explored parameter space (1, 2, and 3).In general, despite the much higher computational cost of MICE, this method performs worse than aps-lm or even unconditional mean imputation.Only at high sample size, low  NA , and some degree of dependence, MICE performs slightly better than unconditional mean imputation but never better than aps-lm.

Round 2: The influence of different types of missing values
In this round, to draw comparisons between different missing value schemes, we tested all missing value schemes in parallel on the same simulated data sets instead of injecting missing values on independent data sets (Figures 4 and S20-21).In addition, we modified  to see whether any of the observed tendencies were -dependent (Figures S22-24).
In general, MNAR is more detrimental to performance than MCAR or MAR.In particular, the decrease of ρ2 , ŷ with higher  NA values is more pronounced for MNAR and considerable bias is generated, the direction of which is and missing value subtype-dependent.Though MICE and mean imputation seem to be very sensitive to MNARinduced bias, aps-lm manages to quench its influence, especially in the presence of dependence.In this round, we also computed the MSE and the concordance correlation in addition to the Pearson correlation and the bias.These two measures showed very similar results (Figures S25-27).In addition, we simulated a covariance matrix with ultra-strong dependence (Figure S1D).The differences between the methods are even more pronounced in this case (Figure S28).This round further highlights the potential of aps-lm to tolerate missing values, even under the MNAR mechanism.

Round 3: aps-lm versus aps-ridge
With the simulations in round 3 we aimed to compare the performance of ordinary LR versus ridge regression for our adapted predictor-set methods .Similarly to round 2, we simulated all missing value scenarios on the same data sets.Because ridge regression is used here, we used MSE instead of the Pearson correlation as performance measure in addition to the bias.For our simulations, we see very similar performance in bias and MSE for the full ordinary linear and ridge models as well as for aps-ridge and aps-lm.Note, however, that we only used 10 predictors.Ridge is expected to perform better for higher numbers of variables.An overlay of the results of rounds 2 and 3 can be found in Figures S31-33.Here, again one can see that the full models perform best while aps-lm and aps-ridge give similar results which are superior to mean and MICE imputation.

Round 4: Coverage of confidence and prediction intervals
In this round, it was investigated whether the predefined coverage of the confidence and prediction intervals was achieved for our aps-lm model.We compared the coverage for a given 95% coverage level.The simulation procedure needed to be adapted because now simulations had to be made from the submodel 5.A detailed description can be   For no missing values, that is, Ω 0 , those two of course agree.For Ω 1 with only one missing predictor, the value of  2 is only slightly reduced.In contrast, one can see massively lower values of  2 Ω compared to  2 for Ω 2 and Ω 3 .Note that  2 Ω 2 is smaller than  2 Ω 3 even though Ω 2 contains three missing predictors and Ω 3 six.This is due to the fact that predictor seven is included in Ω 2 but not in Ω 3 .Predictor seven is the most relevant influence variable in our simulated model because it has by far the largest absolute regression coefficient (compare Equation (45) in Suppl.Methods).APPLICATION TO EPIGENETIC AGING CLOCKS Epigenetics is the branch of molecular biology that aims to understand how genes are regulated (i.e., if DNA is viewed as the hardware of genes, the epigenome can be considered as the software).The best characterized epigenetic mechanism is DNA methylation, which serves as a key signal in genomic regulation.Epigenetic aging clocks are statistical models that can predict the chronological age of healthy individuals from DNA methylation, notorious for their unprecedented accuracy unmet by any other biomarker so far.Namely, these statistical clocks employ specific sites in the human genome whose DNA methylation ratios change monotonically with age in a synchronized fashion across human individuals for reasons not well understood yet.For example, the Horvath's clock (Horvath, 2013) is based on an elastic net regression model and includes methylation ratios of 353 sites.In an independent data set, it was able to predict age with a mean absolute error of 3.6 years within a wide range of tissues.Interestingly, when applied on samples from nonhealthy individuals (Down Syndrome, Parkinson's and Alzheimer's disease, and certain types of cancer), the predicted age is often higher than the true chronological age (epigenetic age acceleration) (Horvath & Raj, 2018).On the other hand, when applied to centenarians and their offspring, predicted age is significantly lower than chronological age (epigenetic age deceleration; i.e., association with extreme longevity) (Horvath et al., 2015).The prospect that epigenetic aging clocks may serve as a window to the aging process has raised strong biomedical interest since it may result in promising avenues toward the reduction of chronic diseases in the world's ever-aging population and a potential relief from the health care burden (Bell et al., 2019).Missing values are common for epigenetic aging clocks (Di Lena et al., 2019), especially in single-cell data (Bell et al., 2019), formalin-fixation and paraffin-embedding (FFPE) samples (Maden et al., 2021), cell-free DNA (cfDNA) from liquid biopsies (Kandimalla et al., 2021), and low-quality and -quantity trace samples in forensics (Vidaki & Kayser, 2017) or paleoepigenetics (Zhur et al., 2021).The current state-of-the-art is to perform imputation prior to linear model prediction (Di Lena et al., 2021), hence, justifying aps-lm's potential in this application.
To assess the applicability of aps-lm in a real-world setting, we made use of the DNA methylation microarray data freely available at the EWAS Data Hub (Xiong et al., 2020), targeting 485,512 cytosines and consisting of 8374 samples from a wide range of tissues and ages.We restricted our analysis to the 353 markers included in the Horvath's model and we split the data set in a 9:1 ratio into a reference data set  ref ,  ref ( = 7536) belonging to stage (i) and an application data set  app ,  app ( = 838) belonging to stage (ii) where  refers to the methylation values and  to the chronological age.Like in the simulations,  app is supposed to be not known but to be estimated by the prediction models.Unlike the setup for apslm, missing values were now present for both  ref and  app with a proportion of 4.2%.This required imputation for  ref .We tested aps-lm and additionally aps-ridge, since the original Horvath's epigenetic aging model was developed by elastic net regression.Thus, we had a total of four scenarios (more details in Suppl.Methods 7): scenario 1: mean imputation on  ref followed by a standard linear model, scenario 2: MICE imputation on  ref followed by a standard linear model, scenario 3: mean imputation on  ref followed by a ridge model, and scenario 4: MICE imputation on  ref followed by a ridge model.For scenarios 3 and 4, the regularization parameter  was based on 10-fold cross-validation (with the glmnet R-package (Friedman et al., 2010)).For all models, the response variable  (chronological age) was transformed by the piecewise Horvath transformation (Suppl.Methods 7).We now had a similar setup as in the simulations.As the computational costs of MICE for imputation of  app in stage (ii) were prohibitive, we solely benchmarked our adaptive predictor-set models against unconditional mean imputation.We shall, in the following, use the term "training configuration" to denote the imputation of  ref in stage (i) (mean or MICE imputation)."Testing configuration" refers to the two tested prediction methods applied on  app in stage (ii) (mean imputation or aps-lm/aps-ridge).
Model performance at the original background level of missing values (4.2%) was roughly similar across training and testing configurations (Figure 7).To test model performance beyond the original proportion of missing values, we artificially injected missing values to  app following the MCAR criterion, solely for simplicity.For an extended range of missing value proportions, the training configuration (mean or MICE imputation on  ref ) had little influence on ρ2 , ŷ .On the other hand, the adaptive predictor-set testing configuration was significantly more resilient to missing value injection than the mean imputation testing configuration in terms of squared correlation and bias; in view of our simulation results, this was to be expected because of the known dependence structure in DNA methylation data.Interestingly,  2 -regularization had a significantly negative impact on the adaptive predictor-set testing configuration.This is due to there being a lower boundary on the regularization parameter  under default settings in the R package glmnet we applied.We used the default threshold unchanged to mimic the way practitioners would interact with glmnet, a widely popular implantation of the ridge regression in R. Because the ratio of the number of observations () and the number of predictors () is relatively large with  = 7536 in the reference data set and  = 353, a value of  equal to or very close to zero would be optimal.The boundary for  is determined by the data and is higher for more variables.This is exactly the case for our epigenetic clock example with 353 predictors.The lower boundary enforced a suboptimal value for  resulting in higher MSE values than if  could be freely chosen.For an optimal choice of  (i.e., very close to 0), the performance of aps-ridge would be virtually indistinguishable from aps-lm.The full potential of aps-ridge can only unfold in the case of a low ∕ ratio.Finally, we observe that in this application, the adaptive predictor-set testing configuration gives rise to a nonzero bias (though lower than the imputation testing configuration).There are two sources of bias: minor differences in means between reference and application data set and the use of a nonlinear transformation (i.e., Horvath's piecewise function, ) on the dependent variable: The latter is relevant as we quantify performance on the original scale.

DISCUSSION
LR models are omnipresent in statistics.They are, however, incompatible with missing data, a colossal challenge in modern life and health sciences.CCA and imputation can aid in parameter estimation of LR models, but not in prediction on an independent data set, since they either fail to predict on incomplete records or are unable to model how prediction intervals change as a function of the set of missing predictors Ω.In this study, we derived aps-lm as a solution for this problem.It is a generalization of LR that inherently takes the possibility of missing data into account.aps-lm is applied on a reference data set  ref ,  ref in stage (i).It's parameters are then shared in stage (ii) and applied on an incomplete application data set  app to predict the outcome ŷapp without requiring imputation.In our approach, the full model is adapted to the available predictor sets via Equation ( 5).This submodel is assumed to be the true underlying model and the corresponding OLS estimator θΩ is investigated, which depends on the set of missing variables Ω (Equation 6).We have shown that certain parameters  and θols can be calculated from the full model of Equation ( 1) and used to obtain θΩ .The full model is, however, never assumed to be the true underlying model but only used on a computational basis.
The parameters from the full model are projected down to our submodel in order to obtain a good representation of θΩ .We also developed a version that is applicable to ridge regression, aps-ridge.We have shown that Ω-adapted prediction errors can be easily calculated for aps-lm.Second, aps-lm is CPU-efficient (especially if computed with the sweep operator, Suppl.Methods 5): The parameters of aps-lm are the result of a closed-form solution and there is no need for numerical optimizers and their corresponding inaccuracies and computational burden.Third, aps-lm is storage-efficient: θΩ can be written as function of  and θols , with a total of ( + 1) ⋅ ( + 4)∕2 parameters, where  is the number of predictors (Equation 11).This is a substantial compression since it does not depend on the number of observations .With the additional parameter  ⊤ , we can also compute custom standard errors that incorporate the degree of missingness in a sample (Equation 16).Finally, aps-lm preserves privacy of the reference data set, since neither  ref nor  ref can be retrieved from our parameterization , θols , and  ⊤ .aps-lm and aps-ridge could only be derived by using the normal equations for the OLS estimator (for aps-lm) and the ridge objective function (for aps-ridge).The adaptive predictor-set model cannot trivially be extended to other regression models such as least absolute deviation, total least squares, lasso regression, elastic nets, robust regression, quantile regression, generalized linear models, and Cox regression.Nonetheless, there are possibilities for approximate solutions: As outlined in Section 3.6, in a previous study of Marshall et al. (2002) a somewhat similar concept to aps-lm was employed in an approximation of parameter estimation in generalized linear models.A reduction of predictors is also considered by Piironen and Vehtari in their "projection predictive method" (Piironen & Vehtari, 2017).
Their equations for the projection of predictors to subsets of predictors are similar to the usage of our matrix operators  (Ω,) and  (Ω,) .They use this projection to obtain a set Ω that is as close as possible to the reference model (Ω = ∅) with respect to the Kullback-Leibler divergence.
Our simulations highlighted the power of aps-lm in presence of correlation between predictors.The fact that our apslm model outperforms imputation methods might be surprising at first.The underlying reason for that is probably the following: Although multiple imputation strategies like MICE consider multiple predictors as influence variables in their models, MICE performs predictor imputation one at a time using only the predictor to be imputed as an outcome in the modeling equations.Our approach on the other hand takes the full multivariate distribution of all predictors into account via the  matrix from the QR decomposition of the design matrix .Another advantage of aps-lm is that information from the whole reference data set can be utilized via the  matrix which is computed from  ref .So even for independent predictors, aps-lm performs better for a high number of missing variables and low sample size than imputation methods.The performance of our method does not depend upon the missingness mechanism and it even gives good results for MNAR.For imputed data, MNAR can introduce large bias and can be worse than CCA because imputation methods like multiple imputation or maximum likelihood usually rely on distributional assumptions on the missingness procedure (Sterne et al., 2009).For aps-lm, we assume the existence of a complete reference data set.By fitting our parameters there, we bypass potential missingness biases occurring at the application stage.Although we attempted to evaluate aps-lm as comprehensively as possible including extreme missingness in our simulations-a scenario often neglected in the literature-there are some limitations to be acknowledged.First, we solely tested aps-lm for a limited number of continuous predictors, covariance structures, and missing value types.Applying aps-lm on more data sets in the future will provide the ultimate evidence for its efficacy.Throughout this study, MICE imputation displayed the least favorable performance in spite of its computationally expensive routines.These results are similar to those obtained in the context of Gaussian mixture models by McCaw et al. (2020) in which an imputation-free method displayed improved performance over MICE imputation.However, it is important to realize that MICE was not developed in the context of prediction, though many researchers use it for this purpose.Thus, no data shown here shall serve as evidence to discredit MICE's potential in the context of parameter estimation, for which it was originally derived.In addition, MICE is a very flexible method with a large number of possible settings, which makes it hard to benchmark fairly.For example, we initially observed that MICE was unable to impute when a high proportion of missing values occurred.This was because by default MICE drops collinear variables, which can only be changed by setting the argument remove.collinear to FALSE (which is not described anywhere on the software's documentation in version "November 19, 2022").We used the default for the maximum number of iterations (maxit = 5) which may not be sufficient for optimal convergence and also applied the default predictive mean matching (pmm) method with its known danger of overduplication at a high proportion of missing values.Thus, it is possible that MICE performs more favorable under untested settings.
Furthermore, we applied aps-lm to epigenetic aging clock models as a proof-of-principle.Since our employed data set contained only 4.2% missing values, a proportion at which the investigated imputation strategies performed almost indistinguishably from aps-lm, it might seem that there is no need for more refined methods, such as aps-lm.However, it has to be stressed that the data set employed has been curated and solely includes high-quality data.aps-lm, on the other hand, was derived to obtain predictions even when the data quality is suboptimal (e.g., for extreme missingness), which enables applications in biological age prediction in FFPE, cfDNA, single-cell, or even forensic and paleoepigenetic samples.
The assumption of the existence of completely observed reference data in stage 1 is sometimes not satisfied such that the data of both stage 1 and stage 2 have missing values.Indeed, we had to tackle this very same scenario in our application on epigenetic aging clocks.Nonetheless, having curated (almost) complete reference data sets is a typical situation in biomedical data analysis.A reference data set is often especially created for model building and thus includes no or only limited missing data.A way around would be to impute missing values in the reference data set, see our example.Imputation in the reference data set of stage 1 for model estimation is already well addressed in the literature but not in the case of prediction (e.g., on the application data set of stage 2).
In summary, the novelty and advantages of our new method are as follows: (i) aps-lm enables the predictive analysis of data sets with missing values without the need for imputation.(ii) aps-lm can be applied regardless of the missing value distribution including MNAR.(iii) aps-lm outperformed the investigated imputation approaches.(iv) aps-lm is an analytical method without the need for approximations.(v) aps-lm is computationally more efficient than multiple imputation methods.(vi) aps-lm preserves the privacy of sensitive reference data sets.
Finally, even though we chose for convenience an application whose data were publicly available, we envision our method's true potential for samples with data protection.Researchers that analyze data sets with privacy restrictions can now make their linear models openly available by sharing the easily computable parameters θols and .Subsequently, these models can be readily applied for any combinations of missing variables by other research groups or for clinical applications.

A C K N O W L E D G M E N T S
We thank Professor Michael Krawczak (Kiel University) for his helpful suggestions.We thank two anonymous reviewers and the associate editor for valuable comments for the improvement of the manuscript.

C O N F L I C T O F I N T E R E S T S TAT E M E N T
The authors declare no conflicts of interest.

D ATA AVA I L A B I L I T Y S TAT E M E N T
A reproducibility report that includes R-scripts, raw simulated data and the aps-lm application data set and models for epigenetic aging prediction is available under an MIT license at Github (https://github.com/BenjaminPlanterose/aps-lm).The complete application data set is publicly available at EWAS Data Hub (Xiong et al., 2020) (https://download.cncb.ac.cn/ewas/datahub/download/).

O P E N R E S E A R C H B A D G E S
This article has earned an Open Data badge for making publicly available the digitally-shareable data necessary to reproduce the reported results.The data is available in the Supporting Information section.This article has earned an open data badge "Reproducible Research" for making publicly available the code necessary to reproduce the reported results.The results reported in this article could fully be reproduced.

F
Round 1-Simulation 2: Missing completely at random (MCAR) under weak dependence.Missing values were injected on the application data set following an MCAR scheme with variable proportion of missing values ( NA ).Squared Pearson correlation between real and predicted dependent variable is plotted at varying sample size (),  2 , and  NA for: (i) the estimated linear regression (LR) model of stage (i) applied on the application data set  app without missing data (full), applied on  app with missing data after performing (ii) unconditional mean (MI) or (iii) MICE imputation (MICE), and (iv) adaptive predictor-set linear model (aps-lm) without applying any imputation.Simulation results are visualized as jitter plots while overall tendencies in the mean are represented as LOESS curves.F I G U R E 2 Round 1-Simulation 8: Missing at random linear regression (MAR-LR) under weak dependence.Missing values were injected on the application data set following a MAR-LR scheme that makes missing values more likely on both the left (L) and the right tails (R).For further details, see the legend of Figure 1.

F
Round 1-Simulation 14: Missing not at random (MNAR)-L under weak dependence.Missing values were injected on the application data set following an MNAR-L scheme that makes missing values more likely on the left tail (L).For further details, see the legend of Figure1.

F
Round 2-Simulation 2: All missing value types under weak dependence.Missing values were injected on the application data set following all implemented missing value schemes (missing completely at random [MCAR], missing at random [MAR]-L,R, LR, C and missing not at random [MNAR]-L, R, LR, C).(a) Squared Pearson correlation between real and predicted dependent variable or (b) prediction bias is plotted at varying missing value type,  2 , and  NA .For further details, see the legend of Figure 1.found in Suppl.Methods 6.6.The simulation was performed for four different sets of missing values Ω (with 0, 1, 3 and 6 missing predictors out of 10), three linear model GOF ( 2 = 0.25, 0.50, 0.95) and three sample sizes( = 50,  100, 200).Figures6 and S34-35 show the results of the simulations.The predefined coverage is nicely reached for all the different scenarios.In addition to the GOF of the full model  2 , the GOF of the submodel  2 Ω is also displayed.

F
Round 3-Simulation 2: All missing value types under weak dependence employing  1 .Missing values were injected on the application data set following all implemented missing value schemes (missing completely at random [MCAR], missing at random [MAR]-L,R, LR, C and missing not at random [MNAR]-L, R, LR, C).(a) Prediction mean squared error between real and predicted dependent variable or (b) prediction bias is plotted at varying missing value type,  2 , and  NA .This is shown for the estimated ordinary least squares and ridge linear regression (LR) model of stage 1 applied on the application data set  app without missing data (i) full (ols), (ii) full (ridge), respectively, and (iii) aps-lm and (iv) aps-ridge without applying any imputation.The regularization parameter for full (ridge) and aps-ridge was set equally and obtained via cross-validation on the reference data set based on the mean squared prediction error metric.Note that the mean squared error axis is in log-scale.Curves aps-ridge and aps-lm overlay in Panels (a) and (b); curves full (ols) and full (ridge) overlay in Panel (b).

F
Application of the adaptive predictor-set linear model (aps-lm) on epigenetic aging clocks.The dependent variable consisted of chronological age (in years) on a transformed scale.To test model performance beyond the original proportion of missing values,  0 , we artificially injected missing values following the missing completely at random (MCAR) scheme.A total of 10 Monte-Carlo replicates were performed for each  NA =  0 +  extra .(a) Legend with color codes.Mean squared prediction error (b) or prediction bias (c) between real and predicted dependent variable is plotted for varying  NA , scenarios or testing configurations (mean imputation or aps-lm/aps-ridge).Results are visualized as jitter plots while overall tendencies in the mean are represented as LOESS curves.Dashed vertical line represents  0 .
TA B L E 1