Stratified Learning: A General-Purpose Statistical Method for Improved Learning under Covariate Shift

We propose a simple, statistically principled, and theoretically justified method to improve supervised learning when the training set is not representative, a situation known as covariate shift. We build upon a well-established methodology in causal inference, and show that the effects of covariate shift can be reduced or eliminated by conditioning on propensity scores. In practice, this is achieved by fitting learners within strata constructed by partitioning the data based on the estimated propensity scores, leading to approximately balanced covariates and much-improved target prediction. We demonstrate the effectiveness of our general-purpose method on two contemporary research questions in cosmology, outperforming state-of-the-art importance weighting methods. We obtain the best reported AUC (0.958) on the updated"Supernovae photometric classification challenge", and we improve upon existing conditional density estimation of galaxy redshift from Sloan Data Sky Survey (SDSS) data.


Introduction
In supervised learning, when the labeled training (source) data do not accurately represent the unlabeled (target) data, learning models may not generalize well, leading to unreliable target output predictions.Domain adaptation methods aim to obtain accurate target predictions under such domain shift (Pan and Yang, 2009).Domain adaptation scenarios are widespread, and a variety of methods to tackle them have been proposed.Following Kouw and Loog (2019), they can be organized into three categories: feature-based methods (e.g., Pan et al., 2010); inference-based methods (e.g., Liu and Ziebart, 2014); and sample-based approaches, with a focus on importance weighting (e.g., Shimodaira, 2000;Sugiyama et al., 2008;Cortes et al., 2010), mainly in the covariate shift framework, which is our focus.(Section S2 of the Supplement includes additional background.)

Covariate Shift:
Adapting a model trained on unrepresentative source data to accurately predict target data labels requires information about the target data distribution.We consider covariate shift, a common case of domain shift.Specifically, let X ⊂ R  ,  > 0, be the feature space and Y the label space with  > 1 classes, or a subset of R  in multivariate regression with  dependent variables.Different domains are defined as different joint distributions (, ) over the same feature-label space X × Y (Kouw and Loog, 2019).We consider a transductive, unsupervised domain adaptation case, where source data   = {( ()   ,  ()  )}   =1 with   labelled samples from the joint distribution   (source domain D  ) available, as well as target data   = { ()   }   =1 with   unlabelled samples from the joint distribution   (target domain D  ).To avoid the trivial case, we assume that   (, ) ≠   (, ).For ease of notation, we implicitly condition on an indicator variable , with   (, ) := (, | = 1) representing source data (analogously   (, ) := (, | = 0) for target data).Covariate shift is a particular domain adaptation problem, where the conditional distribution of the output variable given the predictive covariates is the same for source and target data, but the distribution of source and target covariates differ, i.e.,   (|) =   (|) but   () ≠   () (Moreno-Torres et al., 2012).
Covariate shift commonly occurs if training samples are not selected randomly, but biased in terms of certain covariates.For instance, brighter astronomical objects are more likely to be better observed and therefore included in the training set (Lima et al., 2008;Revsbech et al., 2018).Selection bias has been widely studied in the statistical literature (Heckman, 1979;Little and Rubin, 2019), e.g., when estimating treatment effects via observational studies.The transfer of causal inference techniques to domain adaptation has received more attention in recent years -both fields share the goal of obtaining accurate estimators under distribution shift (Magliacane et al., 2018).We build on this work in the present paper.

Propensity Scores:
The introduction of propensity scores (Rosenbaum and Rubin, 1983, hereafter RR83) was groundbreaking in causal inference for obtaining unbiased treatment effect estimates from confounded, observational data.RR83 define the propensity score as the probability of treatment assignment given the observed covariates.They show that, under certain assumptions, conditional on the propensity scores the treatment and control group have balanced covariates, which allows unbiased treatment effect estimation.Four main methods are used to condition on the propensity scores: inverse probability of treatment weighting (IPTW), using the propensity score for covariate adjustment, matching, and stratification on the propensity scores (RR83; Rosenbaum and Rubin, 1984).Extensive work has been done on best practice and generalization of propensity scores in causal inference, too much to list here, and we refer to Imbens and Rubin (2015) for an overview.Propensity scores have also found wide application to related areas, such as classification with imbalanced classes (Rivera et al., 2014), or fairness-aware machine learning (Calders et al., 2013), among others.In the covariate shift framework, estimated propensity scores are used only implicitly for importance weighting (e.g., Zadrozny, 2004;Kanamori et al., 2009) analogously to IPTW, and for matching to obtain validation data (Chan et al., 2020).There has been no effort, however, to transfer the general methodology.
In the causal inference literature, there is an ongoing debate between the use of weighted estimators and stratification/matching (e.g., Rubin, 2004;Lunceford and Davidian, 2004;Austin and Stuart, 2017).While, under correct specification of the propensity score model, weighting leads to consistent estimation of treatment effects, this might not hold for stratification due to potential residual confounding within strata (Lunceford and Davidian, 2004).However, the bias introduced by stratified estimators is traded with reduced variance compared to weighted estimators (Lunceford and Davidian, 2004).In addition, estimates of the propensity score are less variable than estimates of their reciprocal, or similarly a density ratio (to form weights).A small change in an estimated propensity score that is near zero can lead to a large difference in the computed inverse-propensity weight, causing massive variance in the estimates based on the weights (Lunceford and Davidian, 2004;Austin and Stuart, 2017).

Contribution:
We propose stratified learning (StratLearn), a simple and statistically principled framework to improve learning under covariate shift, based on propensity score stratification.We show that, theoretically, conditioning on estimated propensity scores eliminates the effects of covariate shift.In practice, we partition (stratify) the (source and target) data into subgroups based on the estimated propensity scores, improving covariate balance within strata.We show that supervised learning models can then be optimized within source strata without further adjustment for covariate shift, leading to reduced bias in the predictions for each stratum.StratLearn is general-purpose, meaning it is in principle applicable to any supervised regression or classification task under any model.We provide theoretical evidence for the effectiveness of StratLearn, and demonstrate it in a range of low-and high-dimensional applications.We show that the principled transfer of the propensity score methodology from causal inference to the covariate shift framework allows the statistical learning community to employ hardwon practical advice from causal inference, e.g., balance diagnostics, propensity score model assessment/selection, etc. (e.g., Rosenbaum and Rubin, 1984;Imai and van Dyk, 2004;Austin et al., 2007;Pirracchio et al., 2014).We stratify to condition on propensity scores instead of using importance weighting to avoid the massive variance associated with the latter.StratLearn (stratification) does not use individually estimated propensity score values except to form strata, leading to a more robust method (Rubin, 2004), as demonstrated in our numerical studies.
This article is organized as follows.In Section 2, we formally introduce the target risk minimization task, and summarize related literature on covariate shift, particularly on importance weighting methods.We conclude Section 2 with an overview of propensity scores in causal inference, before developing our new methodology in Section 3. In Section 4, we numerically evaluate our method.In Section 4.2, we apply StratLearn to Type Ia supernovae (SNIa) classification, which has attracted broad interest in recent years (Kessler et al., 2010;Boone, 2019).Improving upon Revsbech et al. (2018), StratLearn obtains the best reported AUC1 (0.958) on the updated "Supernovae photometric classification challenge" (SPCC) (Kessler et al., 2010).In Section 4.3, we improve upon non-parametric full conditional density estimation of galaxy photometric redshift (i.e., photo-) (Izbicki et al., 2017), a key quantity in cosmology.Supplemental Materials (numbers appearing with a prepended 'S') provide details complementing the numerical results in Section 4 (Sections S5 and S8), and additional numerical evidence using data from the UCI repository (Dua and Graff, 2017) in Section S7, and a variation of the SPCC data (Kessler et al., 2010) in Section S6.For additional background, we provide a bibliographic note (Section S2), details on related methods (Section S3), data and software used in this paper (Section S4), as well as complementary details of our proposed methodology StratLearn (Section S1).StratLearn is computationally efficient, easy to implement and readily adaptable to various applications.Our investigations show that StratLearn is competitive with state-of-the-art importance weighting methods in lower dimensions and greatly beneficial for higher-dimensional applications.

Target Risk Minimization:
In a supervised learning task, let  : X → R  be the training function, and ℓ : R  × Y → [0, ∞) be the loss function comparing the output of  with the true outcome Y.This describes a general multivariate regression case; in a probabilistic classification task with  classes we usually have  : X → [0, 1]  .The risk function associated to our supervised learning task is R (  ) := E[ℓ(  (), )].We cannot generally compute R (  ), since the exact joint distribution (, ) is unknown.However, an approximation of the risk can be obtained by computing the empirical risk by averaging the loss on the training sample.The objective is to minimize the target risk R  (  ) := E   ( ,) [ℓ(  (), )], via the labelled source data   and unlabelled target data   .Our task is to train a model function  that minimizes R  (  ), being able to compute only the source loss ℓ(  (  ),   ), but not the target loss ℓ(  (  ),   ).Section 2.2 reviews importance weighting methods to minimize the target risk under covariate shift.

Related Literature -Importance Weighting:
In an influential work, Shimodaira (2000) proposes a weighted maximum likelihood estimation (MWLE) and shows that this MWLE converges in probability to the minimizer of the target risk.Following Shimodaira (2000), assuming that the support of   () is contained in   (), the expected loss (risk) w.r.t.D  equals that w.r.t.D  with weights () :=   ()/  () for the loss incurred by each , (1) In short, the target risk can be minimized by weighting the source domain loss by a ratio of the densities of target and source domain features.The importance weights () are paramount in the covariate shift literature and several approaches optimize the estimation of the weights.One approach estimates the densities   () and   () separately (Shimodaira, 2000), e.g., through kernel density estimators (Sugiyama and Müller, 2005).Others estimate the density ratio directly, e.g., via Kernel-Mean-Matching (Huang et al., 2007), Kullback-Leibler importance estimation (KLIEP) (Sugiyama et al., 2008), and variations of unconstrained least-squares importance fitting (uLSIF) (Kanamori et al., 2009).Given (), Sugiyama et al. (2007) propose importance weighted cross-validation (IWCV) and show that in theory this can deliver an almost unbiased estimate of the target risk.Zadrozny (2004) links covariate shift with selection bias, and shows that the target risk can be minimised by importance sampling of source domain data, employing the inverse probability of source set assignment for importance weights.This allows any probabilistic classifier to be used to obtain the weights, e.g., logistic regression (Bickel and Scheffer, 2007).
Although importance weighting in theory enables minimization of the target risk, there are challenges.Based on a measure of domain dissimilarity (e.g., Rényi divergence), Cortes et al. (2010) show that weighting leads to high generalization upper error bounds, making predictions unreliable, especially with large importance weights.In addition, Reddi et al. (2015) points out that while weighting can reduce bias, it can also greatly increase variance.Unfortunately, with increasing feature space dimension, the variance of the importance-weighted empirical risk estimates may increase sharply (Izbicki et al., 2014;Stojanov et al., 2019).This can be partly tackled by dimensionality reduction methods (Stojanov et al., 2019); see Kouw and Loog (2019) for a detailed discussion.We address these variance concerns via propensity scores.

Related Literature -Propensity scores in causal inference:
Given a set of observed covariates  and a binary indicator  for treatment assignment (treatment vs control), RR83 introduce the propensity score as

𝑒(𝑋)
and define treatment allocation  as strongly ignorable, if Condition (i) means that treatment assignment  is conditionally independent of the potential outcome ( 1 ,  0 ), given the observed covariates.The potential outcomes are the possible outcomes for an object, depending on its treatment status, and at most one is observed, (e.g., for a treated object the observed outcome is  =  1 ).In practice, condition (i) means that no confounders (covariates that are associated with the treatment and outcome) are unmeasured.RR83 show that if (3) holds, the propensity score is a balancing score.That is, given the propensity score, the distribution of the covariates in treatment and control are the same, i.e., ( |(),  = 1) = ( |(),  = 0).Thus, conditional on the propensity score, unbiased average treatment effect estimates can be obtained, i.e., E[ 1 |(), In practice, conditioning on the estimated (rather than true) propensity score can achieve better empirical balance as this corrects for statistical fluctuations in the sample as well (RR83; Hirano et al., 2003).Below, we show how the propensity score methodology can be transferred to the covariate shift framework, for target risk minimization in supervised learning tasks.
With Proposition 1, we extend the basic causal inference theory to use propensity scores in the covariate shift framework.Conditioning on estimated propensity scores enables statistically principled minimization of the target risk based on source data.According to Proposition 1, if we were to condition on any single value of the propensity score, the distribution of  and  in the source and target domains would be identical and we could minimize their target risk using the source data alone.Because sample sizes with identical propensity scores are too small in practice for model fitting, we employ an approximation. StratLearn where   indicates conditioning on assignment to the 'th source stratum (analogously for target   ).It follows that for  ∈ 1, . . ., , Thus, we can minimize the target risk within strata by minimizing the source risk within strata.In this way, we reduce the covariate shift problem to non-overlapping subgroups where the source and target domain are approximately the same, which in principle allows us to fit any supervised learner to    to predict the target objects in    .Figure 1 presents a flow chart illustrating the steps of our proposed StratLearn methodology.

StratLearn -Technical Details
In general, any probabilistic classifier could be used to estimate propensity scores (e.g., Pirracchio et al. (2014)).Logistic regression is commonly used in causal inference, and we adopt it for the applications in this paper.In practice, the covariate shift assumption,   (|) =   (|), requires there be no unobserved confounding covariates.To meet this requirement, we include all potential confounders as main effects  We use  = 5 strata based on empirical evidence provided by Cochran (1968), showing that sub-grouping into five strata is enough to remove at least 90% of the bias for many continuous distributions (RR83).Given the stratified data, we fit a model   to source data    , and predict the respective target samples in    , for  ∈ 1, . . ., .Model hyperparameters for   can be selected through empirical risk minimization on source data    , for instance through cross-validation on    .The model functions   are trained independently and can be computed in parallel to reduce computational time.If the source distribution does not cover the target distribution well enough, some of the strata may contain too little source data to reliably train the model.In this case, we add source data from one or more adjacent strata to avoid highly variable predictions.There is a bias-variance trade-off here in that this reduction in variance requires a relaxation of the approximation in (8), which inevitably increases bias somewhat.Although a general and precise criterion for combining the strata is elusive (more complex models require more data and data sets of the same size may be more or less informative for the same model), we illustrate the combination of source strata in Section 4.2 (and Section S7), where one or more source strata have insufficient data.

StratLearn -Balance Diagnostics
A key advantage of propensity scores derived in causal inference is their covariate balancing score property (RR83), that is,   (|()) =   (|()).In causal inference, this property is used to verify the propensity score model and/or the choice of covariates, , e.g., by checking that  has the same within-strata distribution in the treatment and control groups.Employing the balancing property in the derivation of Proposition 1 allows us to take advantage of such diagnostic tools in our framework.We refer to the large literature on this (e.g., Rosenbaum and Rubin, 1984;Imai and van Dyk, 2004;Austin et al., 2007;Austin, 2011), and provide an example of such a balance check in Section 4.2 (and in Sections S6 and S7).
In Remark 1, we detail how additional structure in the covariate shift setting can be exploited to justify a corollary model diagnostic.
Remark 1.In the propensity score framework of causal inference (Rosenbaum and Rubin, 1983) we have potential outcomes  0 and  1 .In the covariate shift framework, the potential outcomes are identical ( 0 ≡  1 ).That is, there is no "treatment effect" from being assigned to the source or target set, though only the source data are observed ( 1 ≡  ).Now, given the propensity score (), with 0 < () < 1, and the covariate shift condition (|,  = 1) = (|,  = 0), source data assignment is 'strongly ignorable' (using the terminology of RR83).It follows through Theorem 4 in RR83 that, conditional on the propensity score, source and target outcome are the same in expectation.
In cases where labels are observed for (part of) the target group we can use Remark 1 as a model diagnostic.Although in practice the labels are mostly unobserved in the target group, they are available in our real-world scientific/experimental settings described in Sections 4.2 and 4.3.In Section 4.2 (as well as Sections S6, S7 and S8), we use Remark 1 to demonstrate a reduction of within-strata covariate shift (i.e., by conditioning on the propensity score).
We further demonstrate the possibility of similarly using predicted labels instead of actual labels as a model diagnostic.While the actual target labels   are usually not available in real-world data applications, the distribution of the model predicted outcome labels (ỹ =  ()) can be evaluated for source  (  ) and target  (  ).With  being a measurable function of the covariates , and by employing the balancing property of propensity scores, it holds   (  ()|()) =   (  ()|()).Consequently, a discrepancy between the distributions of predicted source outcome  (  ) and predicted target outcome  (  ) is an indication of residual (covariate) shift in the source and target distribution. 2An advantage of assessing the balance in the predicted outcome  () (in addition to covariate balance) is that  () is designed to approximate the outcome .Thus, a discrepancy of  (  ) and  (  ) indicates an imbalance in strongly predictive covariates, a straightforward sign for remaining, (likely) concerning confounding.We demonstrate the application of balance diagnostics via predicted labels in Section 4.2.

Comparison Methods
We compare StratLearn to a range of well-established importance weighting methods.
In Section 4.3, we incorporate the estimated weights as in the corresponding benchmark publication.Following Izbicki et al. (2017), the estimated weights are used for loss weighting as in (1).In Section 4.2, importance weighting has not previously been applied.We implement IWCV, importance sampling, and a combination of both, to demonstrate the advantage of StratLearn with respect to either; see Section S3.

Classification -SNIa Identification
Objective: Type Ia supernovae (SNIa) are invaluable for the study of the accelerated expansion history of the universe (e.g., Riess et al., 1998;Perlmutter et al., 1999).SNIa are exploding stars that can be seen at large distances, occurring due to a particular physical scenario which causes their intrinsic luminosities to be (nearly) the same.This "standard candle" property of SNIa makes it possible to measure their distance, which in turn depends on parameters that describe the expansion of the universe.
To take advantage of this, reliable identification of SNIa based on photometric light curve (LC) data is a major challenge in modern observational cosmology.Photometric LC data are easily collectable, consisting of measurements of an astronomical object's brightness (i.e.flux), filtered through different passbands (wavelength ranges), at a sequence of time points (as illustrated in Fig 2).Only a small subset of the objects are labelled via expensive spectroscopical observations.The labeled source data,   , are not representative of the photometric target data,   , as the selection of spectroscopic 2 In practice, one has to ensure that  is not overfitted on the available training (source) data sample, a standard check in supervised learning, which can readily be assessed via standard validation tools, such as cross-validation or bootstrapping of source data.
3 KLIEP and uLSIF were implemented with the original author's public domain MATLAB code (link).q q q q q q q q q q q q q q q q q q q q q q q q q q 56220 56240 56260 56280 56300 56320 −10 0 10 20 30 40 g−band Time (days) Flux q q q q q q q q q q q q q q q q q q q q q q q q 56220 56240 56260 56280 56300 56320 0 20 40 60 80 r−band Time (days) Flux q q q q q q q q q q q q q q q q q q q q q q q 56220 56240 56260 56280 56300 56320 0 20 40 60 80 100 120 i−band Time (days) Flux q q q q q q q q q q q q q q q q q q q q q q q 56220 56240 56260 56280 56300 56320 source samples is biased towards brighter and bluer objects.The automatic classification of unlabelled objects, based on biased spectroscopically confirmed source data, is the subject of much research, including public classification challenges (Kessler et al., 2010(Kessler et al., , 2019)).
Leading SNIa classification approaches are based on data augmentation; they sample synthetic objects from Gaussian process (GP) fits of the LCs to overcome covariate shift (Revsbech et al., 2018;Boone, 2019).The method of Revsbech et al. (2018) can be viewed as a prototype of StratLearn, as it augments the source data separately in strata based on the estimated propensity scores.However, to optimize data augmentation within strata, Revsbech et al. (2018) requires a sub-sample of labeled target data that is unavailable in practice.While effective in this particular case, GP data augmentation is not an option in most covariate shift tasks.We show that StratLearn makes augmentation unnecessary.We use target prediction AUC to compare performance to published results.

Data and Preprocessing:
We use data from the updated "Supernova photometric classification challenge" (SPCC) (Kessler et al., 2010), containing a total of 21,318 simulated SNIa and of other types (Ib, Ic and II).For each supernova (SN), LCs are given in four color bands, {, , , }.The data include a source set   of 1102 spectroscopically confirmed SNe with known types and 20,216 SNe with unknown types (target set   ).51% of the source objects are SNIa, while only 23% of the target date are SNIa, a consequence of the strong covariate shift in the data.
We follow the approach in Revsbech et al. (2018), which was applied to an earlier release of the SPCC data (Kessler et al., 2010, discussed in Section S6), to extract a set of features from the LC data that can be used for classification.First, a GP with a squared exponential kernel is used to model the LCs.Then, a diffusion map (Coifman and Lafon, 2006) (as used in Richards et al. (2012)) is applied, resulting in a vector of 100 similarity measures between the LCs that we use as predictor variables.Combining these with redshift (a measure of cosmological distance, defined in Section 4.3) and a measure of overall brightness, we obtain 102 predictive covariates.

Results:
To evaluate the impact of covariate shift on classification, we first consider a 'biased fit' by training a random forest classifier (as in Revsbech et al. (2018)) on the source covariates ignoring covariate shift, resulting in an AUC of 0.902 on the target data (black ROC curve in Fig 3).Next, we obtain a 'gold standard' benchmark by randomly selecting 1102 objects from target data as a representative source set.The same classification procedure with the unbiased 'gold standard' training data (unavailable in practice) yields an AUC of 0.972 on the remaining 19,114 target objects.
Given the biased source data, StratLearn is implemented as described in Section 3, including all 102 covariates in the logistic propensity score estimation model.After stratification, a random forest classifier is trained and optimized on source strata   1 and   2 separately to predict samples in target strata   1 and   2 .We use repeated 10-fold cross validation with a large hyperparameter grid to minimize the empirical risk of (9) within each strata, employing log-loss4 as our loss-function; details appear in Section S5.Source strata    for  ∈ {3, 4, 5} have a small sample size, (13,7,4) respectively.Thus, source strata    for  ∈ {3, 4, 5} are merged with   2 to train the random forest to predict    for  ∈ {3, 4, 5}.With StratLearn, we obtain an AUC of 0.958 on the target data (blue ROC curve in Fig 3), very near the optimal 'gold standard' benchmark.thus not included in the results.We also implemented IWCV using the same hyperparameter grid as for StratLearn, and a combination of IWCV and importance sampling, which both led to lower AUC than the ones reported in Fig 3 Previous state-of-the-art methods report an AUC of 0.855 (Lochner et al., 2016) using boosted decision trees, 0.939 (Pasquet et al., 2019) using a framework of an autoencoder and a convolutional neural network and 0.94 (Revsbech et al., 2018) using LC augmentation and target data leakage, all lower than StratLearn.

Balance Assessment on updated SPCC data:
To illustrate the balancing property of propensity scores (see Section 3.3) and its effect on predictive target performance, we assess the covariate balance in the updated SPCC data within strata conditional on the estimated propensity scores, by means of two commonly used balance measures: absolute standardized mean differences (SMD) and the Kolmogorov-Smirnov test statistics (KS-stats) (Austin, 2011;Austin and Stuart, 2015).
Fig 4 provides a detailed covariate balance comparison, by plotting the "raw" SMD against the StratLearn SMD in stratum 1 (black) for each covariate.We remove two outliers (redshift and brightness) with very large "raw" SMD (1.1 and 1.7), because including them in the Figure makes it more difficult to illustrate the balance of the bulk of the covariates; both are well balanced in stratum 1 using StratLearn (SMD equals 0.12 and 0.17).Points below the diagonal line are better balanced in the stratum than in the "raw" non-stratified data.This is the case for the vast majority (71%) of black points in  On average across the 102 covariates, StratLearn improves covariate balance compared to the "raw" non-stratified data measured by SMD by ∼70% in stratum 1 and ∼10% in stratum 2 (KS-stats: ∼70% in stratum 1 and ∼30% in stratum 2)5 .It further improves upon STACCATO by ∼36% in stratum 1 and ∼46% in stratum 2 using SMD (KS-stats: ∼24% in stratum 1 and ∼36% in stratum 2).The remaining strata contain too few source data to assess covariate balance.Details are provided in Table S2.
The improved covariate balance (reduced covariate shift) directly translates into improved predictive performance.STACCATO (including data augmentation and target data leakage) yields a target AUC of 0.94, whereas with StratLearn we obtain a target AUC of 0.958 (without data augmentation and no target data leakage) -a substantial improvement resulting from the improved covariate balance by accounting for potentially confounding covariates.In general, we note that balance is particularly important for covariates that are strongly predictive for the outcome.Domain-specific expertise might be necessary to identify such covariates in the individual cases in practice.In Section S7, we demonstrate how covariate balance can be improved by adjusting the propensity score model.
Table 1 presents the composition of the five StratLearn strata.Recall that according to Remark 1, conditional on the propensity score the marginal distributions of source and target outcome are the same in expectation.Table 1 shows that the proportion of SNIa in the source and target data (which in this case can be computed from knowledge of the true target labels in the simulation) align well for StratLearn in the first two strata, indicating the expected reduction in covariate shift.The source sample sizes in In Table 2, we demonstrate how predicted outcomes can be employed for balance diagnostics by assessing the predicted proportions of SNIa within strata obtained by STACCATO and by StratLearn.We compute the predicted outcomes by classifying objects to be SNIa if the (random forest) predicted SNIa probabilities are above 0.5.While STACCATO leads to a strong discrepancy between predicted SNIa proportions in the first two strata (indicating remaining confounding), StratLearn leads to well-matched predicted SNIa proportions.We further quantify the discrepancy by performing a two-sided Fisher's exact test of independence, with the null hypothesis that there is no association of source/target set assignment and predicted SNIa proportion.Comparing different propensity score models, a higher p-value is an indicator for better balance in the predicted outcomes, and should thus be desirable.StratLearn leads to much higher p-values than STACCATO (failing to reject the null hypothesis for reasonable significance levels), which implies much weaker relation between source/target assignment and predicted outcomes.
In this particular example, with StratLearn, we fail to reject the null hypothesis for most significance levels.This may not always be the case (e.g., Section S6).We recall that the strategy of conditioning on propensity scores via stratification leads to subgroups with similar (not identical) propensity scores, and thus to similar (not identical) joint distributions within strata (this is the approximation in ( 8)).This in turn might lead to differences in the distributions of the covariates and the (predicted) outcomes, even if we could condition on the true propensity scores.We thus employ the p-values of (predicted) outcomes as an additional tool to asses, and primarily to compare, propensity score models to detect and reduce confounding of highly predictive and thus most relevant covariates.

Conditional Density Estimates -Photo-𝒛 Regression
Objective: The wavelength of light from extragalactic objects is stretched because of the expansion of the universe -a phenomenon called 'redshift'.This fractional shift towards the red end of the spectrum is denoted by .A precise measurement of redhsift allows cosmologists to estimate distances to astronomical sources, and its accurate quantification is essential for cosmological inference (e.g., redshift is a key component of the Big Bang theory).Because of instrumental limitations, redshift can be precisely measured only for a small fraction of the ∼ 10 7 galaxies observed to date (set to grow to ∼ 10 9 within a decade).These source data are subject to covariate shift relative to the set of galaxies with unknown redshift (target).Izbicki et al. (2017) employed importance weighting to adjust for covariate shift in , a set of observed photometric magnitudes (a logarithmic measure of passband-filtered brightness), when estimating .They obtain a non-parametric estimate of the full conditional density,  (|), to quantify predictive uncertainty of redshift estimates.Proper quantification of predictive uncertainties is crucial to avoid systematic errors in the scientific downstream analysis (Izbicki et al., 2017;Sheldon et al., 2012).Using the same setup and conditional density estimation models (hist-NN, ker-NN, Series and Comb, detailed in Izbicki et al. ( 2017)), 6 we show that StratLearn leads to better overall predictive performance than importance weighting.
Assuming that source and target data follow the same distribution, under the  2 −loss, conditional density estimators typically aim to minimize the generalized risk (generalized in that the underlying loss can be negative): Izbicki et al. (2017) propose to adjust for covariate shift by adapting (10), via optimizing weighted versions of the conditional density estimators (Izbicki et al., 2017, Sections 5.1-5.3) with respect to an importance weighted generalized risk: where the weights, ŵ(  ) =   ()/  (), are estimated using the methods described in Section 4.1.As their best performing model for  (|), Izbicki et al. (2017) propose an average of importance weighted ker-NN and Series, f (|), with constraints (i):   ≥ 0, and (ii): referred to as 'Comb ' (i.e., combination), where  = 2 and   is optimized to minimize (11).
With StratLearn, we optimize the unweighted conditional density estimators (hist-NN, ker-NN, Series) by minimizing (10) in each source stratum separately (accounting for covariate shift following Proposition 1).We also propose a StratLearn version of Comb by optimizing (12) on each source stratum separately (via the generalized risk in (11) with () ≡ 1), including ker-NN and Series (each optimized via StratLearn beforehand).StratLearn and the other methods are compared with a 'Biased' (unweighted) method that simply optimizes (10).We abbreviate the combination of each method (StratLearn, 'Biased', and each of the weighting methods in Section 4.1) with the models for  (|) (hist-NN, ker-NN, Series and Comb) as Method Model .

Data:
We use the same data as Izbicki et al. (2017), consisting of 467,710 galaxies from Sheldon et al. ( 2012), each with spectroscopic redshift  (measured with negligible error), and five photometric covariates .As in Izbicki et al. (2017), we use the r-band magnitude and the four colors (differences of magnitude values in adjacent photometric bands) as our covariates.We denote this spectroscopic source sample by   .To simulate realistic covariate shift, we follow Izbicki et al. (2017): starting from   , we use rejection sampling to simulate a photometric, unrepresentative target sample   , with the prescription ( = 0|) =   (13,4) ( ( ) )/max  ( )   (13,4) ( ( ) ), where  ( ) is the band magnitude and   (13,4) is a beta density with parameters (13,4).Additionally, we investigated adding  ∈ {10, 50} i.i.d.standard normal covariates as potential predictors to the 5 photometric covariates.This simulates the realistic case where additional potentially confounding covariates are present.For comparability, we follow Izbicki et al. (2017) and use | train  | = 2800 galaxies randomly sampled from   as training data, plus a validation set of | val  | = 1200 galaxies.We assess the performance of each Method Model pair using a random subset of   , i.e., | test  | = 6000.

Results:
For evaluation of f under each Method Model pair, we use the (in our simulation) known target redshifts,   , to compute the target risk, R ( f ), via a non-weighted version of (11) with  ()  and  ()  replaced by  ()  and  ()  , given by: q q q q q q q q q q q q q q q q q q q q q q −4 −2 0 2

covariates
Target Risk hist−NN ker−NN Series Comb q q q q q q q q q q q q q q q q q q q q q q q q −2 −1 0 1

covariates
Target Risk hist−NN ker−NN Series Comb q q q q q q q q q q q q q q q q q q q q q q q q −2 −1 0 1 2

covariates
Target Risk hist−NN ker−NN Series Comb q q q q q q Biased uLSIF NN IPS KLIEP StratLearn Fig. 5: Target risk ( R ) of the four photo- estimation models under each method (different colors), using different sets of predictors.Bars give the mean ± 2 bootstrap standard errors (from 400 bootstrap samples).
estimator degrades strongly under most methods.In these cases, StratLearn  exploits the higher performance of StratLearn ker-NN .In contrast, for the non-adjusted (Biased) and importance weighted methods (e.g.IPS), the combination of approaches (Comb) does not necessarily lead to improved performance (e.g., IPS Comb exhibits a higher target risk than IPS ker-NN on its own (Fig 5, right panel)), indicating that the optimization in ( 12) fails due to remaining covariate shift in the data.More precisely, the weighted empirical source risk minimization ((11), as a form of (1)) does not lead to target risk minimization in these situations.In general, the improvement of StratLearn relative to weighting methods increases with the dimensionality of the covariate space, leading to a more robust regime.

Discussion
We provide a simple, though statistically principled and theoretically justified method for learning under covariate shift conditions.We show that StratLearn outperforms a range of state-of-the-art importance weighting methods on two contemporary research questions in cosmology (and on toy covariate shift examples, Section S7), especially in the presence of a high-dimensional covariate space.The assumption of covariate shift is rather strong, requiring that there are no unmeasured confounding covariates -something that cannot be guaranteed in general.In Section S7, however, we demonstrate a certain robustness of our method against violation of this assumption.Further work is necessary to assess more fully the performance of StratLearn when this assumption is only approximately fulfilled.We emphasize that the covariate shift framework is best justified in the presence of a large number of covariates mitigating the risk of unmeasured confounders -in which case it is critical to adopt a method that, like StratLearn, can robustly handle many covariates.Our framework is entirely general and versatile, as illustrated with examples of regression, conditional density estimation and classification.Notably, our numerical demonstrations illustrate the advantage of using only a subset of the source data when formulating predictions for individual objects in the target, where the subset is chosen for its similarity to the target data in question (through stratification).This is a markedly different strategy to the widespread practice of including all possible available observations when fitting learning models.The novelty of our approach is grounded in the transfer of the well-established causal inference propensity score framework (RR83) to the domain adaptation/covariate shift setting, by demonstrating that a method established to obtain unbiased treatment effect estimates can be adapted to optimize the target risk of a supervised learner under covariate shift.In future work, this extension gives a chance to transfer hard won practical advice from causal inference (e.g., balance diagnostics, estimation of propensity scores, choice of included covariates (Rosenbaum and Rubin, 1984;Austin et al., 2007;Pirracchio et al., 2014), etc.) to the covariate shift framework.We will also explore the possibility of taking advantage of Proposition 1 through a matching approach (Imbens and Rubin, 2015), which could prove more sensitive to the underlying propensity score distribution.We believe StratLearn may become a powerful alternative to importance weighting, with a myriad of possible extensions and applications.

Supplemental Materials S1 Additional Methodological Details for StratLearn
In this section, we provide additional information for our novel StratLearn methodology described in Section 3, by formally deriving Proposition 1.
Step (17) follows from the balancing property of the propensity score ( 14).Thus, conditional on the propensity score, the source and target data have the same joint distribution.Equation ( 6) follows directly.

S2 Bibliographic Note
In this section, we provide additional background literature to supplement examples provided in the main paper.
The purpose of domain adaptation methods is to obtain accurate target (test) predictions in situations where the source data domain is not an accurate representation of the target data domain.Quionero-Candela et al. (2009) provide a comprehensive overview of such domain shift, or "dataset shift", in machine learning.Domain adaptation cases arise in a myriad of applications, such as in medical imaging (Stonnington et al., 2008;Van Opbroek et al., 2014;Kamnitsas et al., 2017), where mechanical configurations may vary between medical centers; natural language processing (Jiang and Zhai, 2007), where annotated source data is often highly specialized and thus different from the target data; robotics and computer vision (Patel et al., 2015;Hoffman et al., 2016;Tai et al., 2017;Csurka, 2017), where simulated and observed data is combined to improve classification performance on the unlabeled target data; and astronomy (Gieseke et al., 2010;Vilalta et al., 2013;Freeman et al., 2017;Revsbech et al., 2018;Allam Jr et al., 2018;Möller and de Boissière, 2020), where brighter astronomical objects are more likely to be observed and therefore included in the source set.
The estimation of importance weights is central in the covariate shift framework, with a variety of different approaches.Yu and Szepesvári (2012) and Baktashmotlagh et al. (2014) estimate the densities separately, e.g. through kernel density estimators.Based on Huang et al. (2007), Wen et al. (2015) propose Kernel-Mean-Matching in a reproducing kernel Hilbert space employing the frank-wolfe algorithm.Kanamori et al. (2012) describe variations of unconstrained least-squares importance fitting (uLSIF) (Kanamori et al., 2009).Tsuboi et al. (2009) propose a variation of KLIEP (Sugiyama et al., 2008), adopting a log-linear function to model the importance weights.To alleviate the poor performance issue of weighting in high-dimensional covariate spaces, Sugiyama et al. (2010) and Sugiyama et al. (2011) propose the incorporation of dimensionality reduction procedures when estimating density ratios.In theoretical work, Ben-David et al. (2007) and Ben-David et al. (2010) derive generalization bounds, and conditions under which a classifier can learn from source data to perform well on target data.Cortes et al. (2008) theoretically analyze the effect of errors in the weight estimation on the outcome hypothesis of the learning algorithm.More recently, there have been efforts to transfer methods between the causal inference and domain adaptation framework (Rojas-Carulla et al., 2018;Magliacane et al., 2018;Hassanpour and Greiner, 2019).
In the causal inference framework, the introduction of propensity scores (Rosenbaum and Rubin, 1983) has been groundbreaking.There has been vast work on generalization, best-practice, and assessment of propensity scores in causal inference (e.g., Hirano et al., 2003;Imai and van Dyk, 2004;Lunceford and Davidian, 2004;Stuart et al., 2013;Franklin et al., 2014;Austin and Stuart, 2015;Griffin et al., 2017).While the general methodology has not yet been considered in the domain adaptation framework, it has been transferred to other areas, such as learning from positive and unlabeled data (Bekker et al., 2019), unbiased learning-to-rank (Joachims et al., 2017), or the evaluation of recommender systems (Schnabel et al., 2016), supplementing the examples provided in Section 1.Much work has been done to improve propensity score estimators to obtain unbiased treatment effects, with various proposed methods (e.g., McCaffrey et al., 2004;Setoguchi et al., 2008;Lee et al., 2010;Westreich et al., 2011;Mccaffrey et al., 2013;Pirracchio et al., 2014;Imai and Ratkovic, 2014;Pirracchio and Carone, 2018;Autenrieth et al., 2021).In the examples provided in the main paper, we show that StratLearn with a simple logistic regression propensity score estimator leads to strong predictive target performance, outperforming various importance weighting methods.The consideration of additional propensity score estimators, as well as optimizing the set of selected covariates, might further improve StratLearn in future work, e.g. by directly targeting covariate balance, similar to (Griffin et al., 2017;Parast et al., 2017;Pirracchio and Carone, 2018).
We note that the additional literature described in this section is by far not complete, but rather serves as an additional overview of the domain adaptation/covariate shift framework, and allows further placement of our work within this framework.We refer to surveys provided in the literature for a more thorough discussion on the topic (Pan and Yang, 2009;Kouw and Loog, 2019).The summary of propensity score methods serves as an overview of the extensive work that can potentially be transferred from causal inference to the covariate shift framework, by the general propensity score methodology as described in Section 3. We refer to Imbens and Rubin (2015) for a more comprehensive summary on propensity scores.

S3 Details of Comparison Models
Here we detail our implementation of the several methods listed in Section 4.1 that we compare with StratLearn in our numerical studies.

Importance Weighting Estimators:
Following Kanamori et al. (2009) and Bickel et al. (2009), by Bayes Theorem, The importance-weighting-based estimators, KLIEP, uLSIF and NN (described in Section 4.1) directly estimate ().IPS obtains an estimate of () by plugging in the estimated propensity score into the right-hand side of (18), which allows any probabilistic classifier to be used to obtain the weights, e.g., logistic regression (Bickel and Scheffer, 2007).The estimated weights in (18) can then be used for weighted empirical risk minimization (following Formula (1)); e.g., through (11) in Section 4.3, and through IWCV in Section 4.2.
To implement importance sampling, such as in Section 4.2, we employ the framework proposed by Zadrozny (2004).Specifically, Zadrozny (2004) shows that for any distribution D with feature-label-selection space X × Y × S and (, , ) examples drawn from D, for all classifiers  and loss functions ℓ = ℓ(  (), ), if we assume that ( = 1|, ) = ( = 1|) ≠ 0, then with D (, , ) The target risk can thus be minimised by sampling from D, e.g., by importance sampling of source domain data.Formula (20) allows us to directly use the inverse of the estimated propensity scores.Moreover, rearranging the terms in the formulation of () in the right hand side of (18) yields Then, substituting into (20), Thus, following Formula (19) (Zadrozny, 2004), IPS can be used for importance sampling via ( 22), and the direct-weight estimators KLIEP, uLSIF and NN can be used for importance sampling via (23).

S4 Details of Data and Software
The data required to reproduce the results for supernova classification, as presented in Section 4.2 of the main paper, and Sections S5 and S6 of this Supplement, can be found here: SPCC data link.The data used for photometric redshift regression, as presented in Section 4.3 was used by permission of the original authors of Sheldon et al. (2012).The data used from the UCI repository (Dua and Graff, 2017) is publicly available (UCI Wine quality data (link), Parkinsons Telemonitoring Data (link)).
The StratLearn software that we provide is written in R (R Core Team, 2019).For computation of the experiments, a 32 CPU Core cluster was available to speed up the simulations.However, computation time/resources was not a significant issue for the numerical results in this paper.As a benchmark, using a MacBook Pro with a 2.3 GHz 8-Core i9 processor, one scenario (e.g., medium covariate shift, high-dimensional covariate set) for conditional density estimation on SDSS data took less than 8 hours to run.All other experiments are less time-consuming.The supernova classification (Section 4.2) on preprocessed SPCC data (provided in the link above) took less than 2 hours on the same MacBook Pro.

S5 Additional Results for SNIa Identification
This Section provides additional results and details complementing Section 4.2, which describes the classification of supernovae (SNe) into type 1a (SNIa) and non-SNIa based on photometric light curve data, given a strongly biased source data set from Kessler et al. (2010).

Random Forest Implementation and Hyperparameter Selection:
We use the "ranger" random forest implementation in caret (Kuhn, 2008) for SNe classification in this paper.Repeated 10-fold cross-validation (with five repetitions) Table S1: AUC results on the updated SPCC (photometric) target data (Kessler et al., 2010) is implemented with a large hyperparameter grid, containing four different numbers of covariates to possibly split at in each node (3,5,8,10); five values for minimum node size (3,5,7,10,20); and two different splitting rules ("gini" and "extratrees").To optimize the selected hyperparameter setting, we compute the average log-loss via cross-validation on source data.Using StratLearn, cross-validation for hyperparameter selection (optimizing the average log-loss) is done on each source group separately (with groups defined in Section 4.2).We employ the log-loss to compute the empirical risk by evaluating the loss function  (  (), ) on each sample separately as described in Section 2. The AUC (used to assess target predictive performance in this example) does not include the evaluation of a loss function that compares the true label  and prediction  () for each sample  separately.However, to demonstrate robustness w.r.t. the choice of hyperparameter optimization, we also compute the AUC on source data (on each source group separately) via cross-validation, which leads to the same hyperparameter setting as using the log-loss.

Comparison of Importance Weighting Methods:
In addition to importance sampling (as presented in Figure 3), we investigate two further importance weighting approaches to obtain a comprehensive performance comparison of importance weighting and StratLearn.More precisely, we implement IWCV, and a combination of IWCV and importance sampling.Table S1 presents the resulting target AUC for all implemented weighting methods.In summary, none of the importance weighting methods are competitive with StratLearn, which obtained a target AUC of 0.958.In fact, using IWCV alone leads to a decrease in performance for NN (0.897) and IPS (0.897) relative to the 'Biased' fit (0.902).In order to reweight the loss of each object separately when implementing IWCV, we optimize with respect to the log-loss instead of AUC.We do not expect this to be responsible for the decreased target performance.For example, using log-loss as the hyperparameter selection criterion for StratLearn in each source group (instead of AUC) leads to the same target AUC of 0.958.Importance sampling is the only weighting method for which we present the associated ROC curve in Figure 3, because it has the highest average AUC.
Covariate Balance on Updated SPCC Data: Table S2 presents the covariate balance on the updated SPCC data before stratification ("raw" data), and after stratification via STACCATO and StratLearn, respectively.On the non-stratified "raw" data, we measure an average (SD) SMD of 0.114 (0.202) across the 102 covariates, with an average (SD) KS-stats of 0.187 (0.084).StratLearn leads to increased balance within strata, strongly reducing the average Table S2: Average (SD) of SMD and KS-stats computed on the observed covariates from the updated SPCC data (diffusion map coordinates, redshift and brightness) on "raw" data, and in strata built by StratLearn and STAC-CATO, respectively.Strata 3-5 are not displayed due to a shortage of source data.Smaller values indicate better balance.(SD) SMD to 0.034 (0.028) in stratum 1; KS-stats: 0.056 (0.048); and to an average (SD) SMD of 0.103 (0.161); KS-stats: 0.131 (0.08); in stratum 2. (The remaining strata contain too few source data to assess covariate balance.)On average (SD) across all covariates, STACCATO led to an SMD of 0.053 (0.039); KS-stats: 0.074 (0.05), in stratum 1, and an average (SD) SMD of 0.189 (0.195), KS-stats: 0.206 (0.093), in stratum 2, much higher than StratLearn.

S6 StratLearn applied to original SPCC Data
This section provides additional numerical evidence for StratLearn, performing SNIa classification on the original "Supernova photometric classification challenge" (SPCC) data set provided by (Kessler et al., 2010), an earlier version of the SPCC data described in Section 4.2.

Data:
The application of StratLearn on the original SPCC data is mainly for comparison with previous literature that use this data set.The earlier version of the SPCC data features a total of 17,330 simulated SNe of type Ia (SNIa), Ib, Ic and II.The data set is divided into a source (training) set   of 1217 spectroscopically confirmed SNe with known types, and (target set)   of 16,113 SNe with unknown types and only photometric information.The simulation used to generate the original SPCC data suffered from a bug that made classification easier, thus leading to the updated SPCC data used in Section 4.2.As in Section 4.2, we applied GP fitting and diffusion maps (Revsbech et al., 2018;Richards et al., 2012) to obtain a set of 102 predictive covariates; 100 covariates from the diffusion map, plus redshift (defined in Section 4.3) and a measure of the objects' brightness (Revsbech et al., 2018).

Results:
Table S3 presents the composition of the five strata obtained by conditioning on the estimated propensity scores, exhibiting a similar pattern as the strata composition on the updated SPCC data (Table 1), though with a higher proportion of SNIa in the first stratum and even less source data in strata 3 − 5. We thus follow the same strategy as described in Section 4.2.After stratification, a random forest classifier is trained and optimized on source strata   1 and   2 separately to classify SNe in target strata   1 Table S3: Composition of the five strata on the original SPCC data (Kessler et al., 2010).The number of SNe, as well as the number and proportion of SNIa are presented in each source and target stratum.
Table S4: AUC results on the original SPCC (photometric) target data (Kessler et al., 2010) 0.961 (Revsbech et al., 2018) with STACCATO, which employs a data augmentation strategy and target data leakage.Note that the predictive target performance on the original SPCC data (Kessler et al., 2010) is generally higher than on the updated version (Kessler et al., 2010), due to the aforementioned bug that made classification easier.
Most of the importance weighting methods (Table S4) lead to slight improvements over the (non-adjusted) 'Biased' model fit.The best performing weighting approaches (using estimated NN or IPS weights for importance sampling) obtain a target AUC of 0.942, substantially lower than StratLearn (AUC: 0.977).Importance sampling is the only weighting method for which we present the associated ROC curve in Figure S1, because it has the highest average AUC.

S6.1 Covariate Balance Check on Original SPCC Data
In this section, we illustrate the balancing property of propensity scores by assessing the covariate balance in the original SPCC data within strata conditional on the estimated propensity scores, such as described in the last paragraph of Section 3. In particular, we show that such balance assessment could be helpful to assess the suitability of choice of covariates and/or the propensity score model.
As mentioned in Section 4.2, the method proposed by Revsbech et al. ( 2018) (STACCATO) can be viewed as a prototype of StratLearn, as it augments the spectroscopic source data separately in strata based on the estimated propensity scores.STACCATO employs two covariates ("redshift" and measure of brightness) as main effects in a logistic regression model to estimate the propensity scores.In contrast, StratLearn includes all 100 diffusion map covariates (Section 4.2), in addition to "redshift" and "brightness", as main effects in a logistic regression model to estimate the propensity scores.We assess the balance achieved through stratification by means of two commonly used balance measures: absolute standardized mean differences and the Kolmogorov-Smirnov test statistics (Austin, 2011;Mccaffrey et al., 2013), computed for each covariate within strata.Table S5 shows that, on average across observed covariates, StratLearn leads to substantially reduced absolute standardized mean differences and Kolmogorov-Smirnov test statistics in strata one and two, compared to STACCATO.We only report balance measures in strata one and two because of a shortage of source data in the remaining strata.Thus, building the strata by conditioning on the StratLearn estimated propensity scores leads to improved covariate balance (i.e., reduced covariate shift) within strata, compared to using the STACCATO estimated propensity scores.Notably, comparing the predictive performance, STACCATO (without augmentation) yields a target AUC of 0.942, whereas with StratLearn we obtain an AUC of 0.977 on the target data -a substantial improvement resulting from the improved covariate balance by accounting for potentially confounding covariates.
For illustration purposes, in Figure S2 we display a more detailed balance check, comparing the absolute standardized mean differences computed for each covariate in stratum 1 (obtained through StratLearn (black) and STACCATO (red)) with the absolute standardized mean differences of the covariates without stratification.In general, points below the diagonal line indicate better balance in the stratum compared to the balance on the 'raw' non-stratified data.Even though Revsbech et al. (2018) (STAC-CATO) just include redshift and brightness in the propensity score estimation, both approaches balance the majority of the covariates.However, the black (StratLearn) standardized mean differences have visibly smaller values in the y-direction, indicating better balance compared to the approach in Revsbech et al. (2018) (red).Note that for clarity of Figure S2, we removed one outlier (brightness) with a raw standardized mean difference of 1.5, which was very well balanced with a standardized mean difference of around 0.09 in stratum 1 by both approaches.Assessing the balance of individual covariates as illustrated in Figure S2 might be particularly useful to evaluate (and improve) the resulting balance of covariates important for outcome prediction.Some domain-specific expertise might be necessary to identify such covariates.In Section S7, we show how covariate balance assessment can be employed to evaluate the propensity score model choice.
Table S6 presents balance comparisons between STACCATO and StratLearn via predicted outcome labels (as described in Section 3.3).StratLearn leads to much better balance between predicted outcome proportions within strata one and two, and thus to much higher p-values than STACCATO, implying a much weaker relation between source/target assignment and predicted outcomes.The comparison of predicted outcome proportions and p-values (between StratLearn and STACCATO) provide a straightforward indication to employ the StratLearn propensity score model for stratification, which in turn leads to much improved outcome model predictive performance of SNIa classification.

S7 StratLearn applied to Data from UCI Repository
In this section, we demonstrate the performance of StratLearn under violation of the covariate shift assumption.We apply StratLearn on two datasets from the publicly available UCI repository (Dua and Graff, 2017): the "wine quality data" (Cortez et al., 2009) and the "Parkinson Telemonitoring data" (Little et al., 2008).These datasets have previously been used to explore regression problems in the presence of covariate shift (Chen et al., 2016).Links to these data sets can be found in Section S4.

S7.1 Data Description
Wine Quality Data: The data set contains 6497 samples of which 4898 are of type "white wine", used as source data   , and 1599 samples are of type "red wine", used as the target data   .The output is a wine "quality score" (between 0 and 10), which we aim to predict for the target data (red wine), given the scores of the source data (white wine).There are 11 physicochemical, predictive covariates.We refer to Cortez et al. (2009) for detailed information.
Parkinson Data: The data comprises 5875 instances and 26 attributes.Similar to

S7.2 Implementation and Results on UCI Data
Following Chen et al. (2016), our objective is to predict the wine "quality score" and the Parkinson's "UPDRS" score using linear regression models, specifically ordinary least squares (OLS).We assume that the source and target splitting covariates (wine type and age) are not observed for model fitting.Note that, because these are confounding covariates, this violates the covariate shift assumption.All other available covariates are used to estimate the propensity scores and the importance weights, and as predictor variables for the outcome models.To improve covariate balance on the wine quality data, we use gradient boosting machines to estimate the propensity scores, commonly used in the propensity score causal inference literature (McCaffrey et al., 2004;Mccaffrey et al., 2013;Autenrieth et al., 2021).Covariate balance is assessed as described in the last paragraph of Section 3. On the wine quality data, we assess the covariate balance in stratum 4, which is the only stratum with both source and target samples (Table S7).Using logistic regression, the average (SD) absolute standardized mean difference in the observed covariate set is 0.667 (0.436), and the average (SD) of the Kolmogorov-Smirnov test statistics is 0.388 (0.154).Estimating the propensity scores using gradient boosting machines increases balance in stratum 4, by reducing the average (SD) absolute standardized mean difference to 0.561 (0.292), and the average (SD) Kolmogorov-Smirnov test statistics to 0.288 (0.109).On the Parkinson data set, logistic regression is used to estimate the propensity scores because it leads to better balance in the strata than using gradient boosting machines.Table S7 presents the composition of the StratLearn strata for the Wine and Parkinson data, both conditioned on the estimated propensity scores.For the wine data set, all target samples are in strata 4 and 5, and only stratum 4 contains both source and target samples.Following our StratLearn strategy, we fit OLS on   4 to predict   4 and   5 .In the Parkinson data set there is enough source data in each strata to fit OLS separately on each source stratum    to predict the data in the respective target stratum    ,  ∈ {1, . . ., 5}.
Table S8 presents the MSE of the target predictions, comparing StratLearn with (non-adjusted) OLS ('Biased') and WLS, using the methods described in Section 4.1 to estimate the weights.Applying StratLearn results in a target MSE of 0.715 for the Wine data and 114.97 for the Parkinson data, both improving upon the 'Biased' OLS We also compare StratLearn with the method that Chen et al. ( 2016) develops and illustrates on these datasets.Specifically, Chen et al. (2016) proposes a "robust biasaware regression" based on the Kullback-Leibler divergence.This proposal is tailored specifically to regression and to loss functions that account for uncertainty in the prediction (e.g., empirical logloss), whereas StratLearn is entirely general but can be successfully applied to this task.Following Chen et al. (2016), we compute the target empirical logloss as a performance measure.On the Wine data, StratLearn yields an empirical logloss of 1.271, which exactly matches the reported performance of the method proposed in Chen et al. (2016).On the Parkinson data we could not replicate the exact data split as reported in Chen et al. (2016), and thus a meaningful comparison of empirical logloss to the values reported by Chen et al. ( 2016) is not possible.
In summary, in these illustrative regression examples StratLearn demonstrates certain robustness against violation of the covariate shift assumption, improving upon the (non-adjusted) 'Biased' fit and performing comparably with the best importance weighting approach.

S8 Additional Results for Redshift Regression
In the context of the photometric redshift regression example in Section 4.3, we present additional numerical results that demonstrate the robustness of StratLearn to varying degrees of covariate shift.Specifically, we reproduce the results of Section 4.3, but under two additional covariate shift scenarios.

Summary -Conditional Density Estimation:
Figure S3 compares the resulting target risk R ( f ) across models and covariate sets, for the weak covariate shift (top row) and the strong covariate shift scenario (bottom row).These results, combined with those from the medium covariate shift scenario in Figure 5, reinforce the advantage of using StratLearn, especially for higher dimensional covariate sets.In Figures 5 and S3, StratLearn Series performs well in the setup with the fewest covariates (left column).In the weak covariate shift scenario, this leads to an additional boost in performance of StratLearn Comb (which combines StratLearn ker-NN and StratLearn Series , following (12)).In the setups with higher dimensional covariates, the Series methods (using either StratLearn or the weighting methods) exhibit relatively poor performance overall.In these cases, the StratLearn Comb predictions rely solely on StratLearn ker-NN , which again demonstrates the successful empirical risk minimization of (9), inasmuch as StratLearn Comb automatically selects the better target model (i.e., StratLearn ker-NN ) in each stratum, based on the respective empirical source risk estimates in each stratum.Overall, IPS and NN weighting are most competitive with StratLearn, in addition to KLIEP in the low dimensional, strong covariate shift setup.Note that the performances of IPS Comb , KLIEP Comb and uLSIF Comb are q q q q q q q q q q q q q q q q q q q q q q q q −6 −5 −4 −3 −2

Weak covariate shift, 5 covariates
Target Risk hist−NN ker−NN Series Comb q q q q q q q q q q q q q q q q q q q q q q q q −4 −3 −2 −1 0 1

Weak covariate shift, 15 covariates
Target Risk hist−NN ker−NN Series Comb q q q q q q q q q q q q q q q q q q q q q q q q −3 −2 −1 0 1 2

Weak covariate shift, 55 covariates
Target Risk hist−NN ker−NN Series Comb q q q q q q q q q q q q q q q q q q q q q q q q −4 −2 0 2 4 6 8

Strong covariate shift, 5 covariates
Target Risk hist−NN ker−NN Series Comb q q q q q q q q q q q q q q q q q q q q q q q q −3 −2 −1 0 1 2

Strong covariate shift, 15 covariates
Target Risk hist−NN ker−NN Series Comb q q q q q q q q q q q q q q q q q q q q q q q q −2 0 2 4

Strong covariate shift, 55 covariates
Target Risk hist−NN ker−NN Series Comb q q q q q q Biased uLSIF NN IPS KLIEP StratLearn Fig. S3: Target risk ( R ) of the four photo- estimation models under each method (different colors), using different sets of predictors.Bars give the mean ± 2 bootstrap standard errors (from 400 bootstrap samples).Top row: Weak covariate shift (following ( 24)); Bottom row: Strong covariate shift (following ( 25)).
Table S9: Composition of the five StratLearn strata for the medium covariate shift scenario (as described in Section 4.3) on SDSS data, using estimated propensity scores with different sets of predictors.The number of galaxies in source and target stratum, as well as the mean outcome (redshift ) are presented.

covariates 1covariates 5covariates Stratum Set
#galaxies (Mean ) #galaxies (Mean ) #galaxies (Mean )  degraded in some cases (e.g., Figure S3, upper left panel) with inclusion of the poorly performing Series predictions, illustrating that the weighted empirical source risk minimization ((11), as a form of (1)) does not lead to target risk minimization in these situations.With increasing number of covariates, none of the weighted models yield results that are competitive with StratLearn.Table S9 presents the composition of the five StratLearn strata for the medium covariate shift scenario (described in Section 4.3) for each of the three sets of covariates.Each source stratum contains a sufficient sample size to fit the conditional density estimators for prediction on the respective target stratum.The overall redshift averages in the source (0.11) and target data (0.28) are very unequal, a consequence of the covariate shift.Within strata, the redshift averages are well aligned between source and target, indicating improved balance after stratification.Notably, the composition of the strata in the high-dimensional covariate setup is similar to the that of the lowerdimensional setups, with well-balanced strata.This is an indication of the robustness of StratLearn with respect to high dimensional (noisy) sets of covariates.Table S10 compares the five StratLearn strata for the weak and strong covariate shift scenarios, again with well-aligned average redshift within each stratum.Note that for the strong covariate shift scenario there are almost no galaxies in the first two target strata.Thus, effectively only data in source stratum 3 to 5 are used for target prediction in the respective strata.These examples illustrate the advantage of using only a small subset of the source data when formulating predictions for individuals in the target, where the subset is chosen for its similarity to the target data in question.This is a markedly different strategy to the widespread practice of including all possible available data when fitting machine learning models.

Fig. 1 :
Fig. 1: StratLearn flow chart.(*Covariate balance and outcome balance is assessed as described in Section 3.3, with a numerical example given in Section 4.2.)

Fig. 4 :
Fig. 3: Comparison of ROC curves for SNIa classification using the updated SPCC data.Here, Biased and uLSIF are identical.Bootstrap AUC standard errors (from 400 bootstrap samples) are given in parentheses.
Fig 4, illustrating the balance improvement achieved with StratLearn.

)Fig 5
Fig 5 compares the resulting target risk R across models and covariate sets, showing that StratLearn C gives the best performance in all three covariate setups.For small covariate space dimension (Fig 5, left panel), StratLearn Comb improves upon StratLearn ker-NN and StratLearn Series , optimizing the source risk in each stratum separately and combining their predictions.In the presence of potential additional confounding covariates (Fig 5, middle and right panels), the performance of the Series David van Dyk acknowledges partial support from the UK Engineering and Physical Sciences Research Council [EP/W015080/1]; Roberto Trotta's work was partially supported by STFC in the UK [ST/P000762/1,ST/T000791/1]; and David Stenning acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) [RGPIN-2021-03985].Finally, van Dyk, Stenning and Autenrieth acknowledge support from the Marie-Skodowska-Curie RISE [H2020-MSCA-RISE-2019-873089] Grant provided by the European Commission.The authors report there are no competing interests to declare.

Table 1 :
Strata composition on the updated SPCC data (Section 4.2), applyingSTACCATO (left)and StratLearn (right).The number of SNe, as well as the number and proportion of SNIa are presented in source and target stratum 1 and 2. For conciseness, we present the combined strata 3 to 5, containing too little source data for meaningful comparison of the SNIa proportions.Fig 4 also plots (red) the SMD achieved by STACCATO (Revsbech et al., 2018), which uses two covariates (redshift and brightness, as opposed to the 102 used by StratLearn) in the logistic regression to estimate propensity scores.While STAC-CATO improves the balance of the majority (69%) of the covariates, most (66%) black (StratLearn) SMD have smaller vertical values, indicating better balance than STACCATO (red).

Table 2 :
Outcome balance diagnostics via predicted labels on the updated SPCC data (Section 4.2), applying STACCATO (left) and StratLearn (right).The number and proportion of predicted SNIa are presented in source and target stratum 1 and 2. P-values are computed via Fisher's exact test of independence between predicted SNIa target and source proportions within strata.
, using various importance weighting approaches to adjust for covariate shift.
, using various importance weighting approaches to adjust for covariate shift.

Table S5 :
Average (SD) of SMD and KS-stats computed on the observed covariates from the original SPCC data (diffusion map coordinates, redshift and brightness) on "raw" data, and in strata built by StratLearn and STAC-CATO, respectively.Strata 3-5 are not displayed due to a shortage of source data.Smaller values indicate better balance.

Table S6 :
Absolute standardized mean differences between source and target data of stratum 1 plotted against raw data absolute standardized mean differences for both propensity score (PS) estimation approaches (StratLearn and STACCATO).For visual clarity we left out one covariate which would appear at the coordinates (1.5,0.09).Outcome balance diagnostics via predicted labels on the original SPCC data (Section S6), applying STACCATO (left) and StratLearn (right).The number and proportion of predicted SNIa are presented in source and target stratum 1 and 2. P-values are computed via Fisher's exact test of independence between predicted SNIa target and source proportions within strata.
Little et al. (2008)we take 17 of the attributes to be predictive covariates, with the aim of predicting the UPDRS Parkinson's disease symptom score (output).The source data   contains all instances with age below 60, leading to a source subset of size |  | = 1877.Instances with age above (inclusively) 60 and below 70 are used as target samples   , with |  | = 2127.As inChen et al. (2016), instances with age above 70 are not considered.We refer toLittle et al. (2008)for further information on the data.

Table S7 :
Composition of the five StratLearn strata for the UCI wine and UCI parkinson data.The number of samples/subjects in source and target stratum, as well as the mean outcome ("quality score" and "UPDRS score" ) are presented.StratLearn further provides better results than WLS:uLSIF, WLS:KLIEP and WLS:NN on both data sets; only WLS:IPS leads to slightly better results than StratLearn.

Table S10 :
Composition of the five StratLearn strata for the weak covariate shift (24) and strong covariate shift (25) scenarios on SDSS data, using a set of five predictors for propensity score estimation.The number of galaxies in source and target stratum, as well as the mean outcome (redshift ) are presented.