Cox regression with linked data

Record linkage is increasingly used, especially in medical studies, to combine data from different databases that refer to the same entities. The linked data can bring analysts novel and valuable knowledge that is impossible to obtain from a single database. However, linkage errors are usually unavoidable, regardless of record linkage methods, and ignoring these errors may lead to biased estimates. While different methods have been developed to deal with the linkage errors in the generalized linear model, there is not much interest on Cox regression model, although this is one of the most important statistical models in clinical and epidemiological research. In this work, we propose an adjusted estimating equation for secondary Cox regression analysis, where linked data have been prepared by a third‐party operator, and no information on matching variables is available to the analyst. Through a Monte Carlo simulation study, the proposed method is shown to lead to substantial bias reductions in the estimation of the parameters of the Cox model caused by false links. An asymptotically unbiased variance estimator for the adjusted estimators of Cox regression coefficients is also proposed. Finally, the proposed method is applied to a linked database from the Brest stroke registry in France.


INTRODUCTION
Record linkage, also known as data matching, is a process of combining data from different sources that refer to the same individuals or entities.Nowadays, data are collected everywhere by different sectors, and the ability of combining information from several databases can lead to novel knowledge for analysts.For example, record linkage is widely used in epidemiology and medical studies to enrich data on clinical performance and other health-related information. 1,2In national censuses, population data files obtained at different times can be linked to create longitudinal data sets. 3Record linkage may also be applied early in a survey to link the sampling frame and administrative data. 4The linked data allows for statistical analysis (eg, Cox regression) which would not be possible with data collected solely by means of the survey.
The record linkage process is straightforward if unique identifiers (eg, Social Security Number) are available and free of error in both databases.However, this information is often not available, or sometimes cannot be used due to ethical reasons.In such cases, record linkage methods may only use partial identifying information shared between databases, such as name, address, and gender.The variables used for comparison are called matching variables.Over the last decades, several methods have been developed to link data efficiently, 5,6 such as the frequentist approach [7][8][9] and the Bayesian approach. 10,11However, because the matching variables are not unique and are likely to contain inaccuracies, linkage errors are unavoidable.The two kinds of record linkage errors are false links (false positives, ie, a non-matched pair predicted as a link), and missed links (false negatives, ie, a matched pair failed to be predicted as a link).Ignoring these errors may cause substantial bias in the analysis model, 12 causing misleading inference.It is therefore important to account for linkage errors in statistical analysis.
In published literature, two positions are usually considered to account for linkage errors in statistical analysis.Under the primary analysis framework, the data analyst is supposed to be granted access to the full linkage process, including knowledge of matching variables.From this perspective, Scheuren and Winkler 13 made use of the two highest matching weights of each record pair to reduce the bias of ordinary least square estimators under a linear regression model.However, the proposed estimators are not unbiased in full generality.Lahiri and Larsen 14 discussed this problem and proposed unbiased estimators in the same context, using the posterior matching probabilities obtained from the Fellegi-Sunter record linkage model.Hof and Zwinderman 15 extended the method by Lahiri and Larsen 14 for multiple links, and also proposed alternative estimators based on weighted least square methods, both for linear and logistic regression models.Recently, Han and Lahiri 16 adapted the approach by Lahiri and Larsen 14 to provide a system of estimating equations, which may lead to unbiased estimators under a generalized linear model.
In some applications, the analysis step is separated from the record linkage, for example, when the matching variables contain confidential information.This is the secondary analysis framework, in which the data analyst is only provided access to the final linked data, whereas the (unknown) record linkage process has been performed by a third-party operator. 17Starting from this perspective, Chambers 18 proposed the exchangeable linkage error (ELE) model, and bias-corrected estimating equations for both linear and logistic regression modeling.Under the ELE model, it is assumed that linked records may be split into distinct blocks inside which the probability of correct linkage and the probability of incorrect linkage are constant.Following this work, several authors [19][20][21][22] developed methods for secondary analysis of linked data.Recently, Zhang and Tuoto 23 proposed a pseudo ordinary least square method for secondary linkage-data linear regression analysis, which can accommodate heterogeneous linkage errors and incomplete match space problems.
Although the Cox proportional hazard model 24 is of routine use for survival analysis, comparatively very few papers have focused on accounting for record linkage errors in this context. 25performed a simulation study emphasizing the impact of missing matches on the parameter estimation of the Cox model, but did not propose any solution to obtain unbiased estimators for the model parameters.Hof et al 26 proposed a joint modeling for survival analysis and probabilistic record linkage.However, this analysis model is developed under a primary analysis viewpoint, while in many applications, a secondary analysis is more likely.In this work, we reason from the secondary analysis position.We propose a model to account for record linkage errors, and an estimation method to correct for the bias caused by false link errors in the Cox regression model.
The article is organised as follows.In Section 2, we propose a new estimating equation, which leads to an approximately unbiased estimation of the parameters for the Cox model with linked data.A variance estimator is also proposed.In Section 3, we evaluate the proposed estimator and the associated variance estimator through simulation studies.In Section 4, an application on a real dataset is presented.Finally, possible further research is discussed in Section 5.

Cox regression model
The Cox proportional hazard model 24 is the most popular method to assess the effect of covariates X on a survival time.This is therefore one of the most important models in medical research.Suppose that a random sample of n units is available.For each unit i = 1, … , n, we let Ti be a non-negative random variable, which denotes the duration between a time origin and the time of occurrence of some event of interest.We suppose that Ti is right censored, which means that the event is observed only if it occurs before censoring time C i .For units i = 1, … , n, we therefore observe T i = min( Ti , C i ).
We let  i = 1 { Ti ≤C i} denote the variable indicating whether the duration time is observed prior to censoring.The vector of covariates is denoted as ) T .In this section, we first suppose that X i is observed for any unit in the sample.
According to the Cox model, the hazard function of an event at time t is given by where  0 = ( 01 , … ,  0p ) T is a p-vector of unknown parameters and  0 (t) is a common baseline hazard function.Assuming that the survival times are observed on a finite interval, and that C is independent of T conditionally on X, a consistent estimator β of  0 may be obtained by solving the estimating equation: where Y j (t) = 1 (T i ≥t) is an at-risk indicator. 27We call (2) the theoretical estimating equation.This is also the maximum partial likelihood (mpl) estimation.Under some mild assumptions, a consistent estimator of the covariance matrix of β is given by Reference 27 (3)

Linkage error model
Suppose that we have a dataset A of n A time-to-event data.If the covariates X i were known for any unit i ∈ A, the parameter of the Cox model would be estimated by solving the theoretical estimating Equation (2).However, if the covariates are not known in database A, Equation (2) may not be solved in practice.
In order to obtain the needed covariates, a linkage is performed with a dataset B of size n B ≥ n A , containing in particular the auxiliary variables X i .For any unit i in A, we note Z i for the vector of auxiliary values resulting from the linkage process.Reasoning from the secondary analysis perspective, we do not have access to the matching variables and do not know the actual linkage process.
We assume that the linkage error is non-informative of the regression model, that is, may depend on the errors in the matching process, but not on the model covariates nor on the survival time. 28This is the key assumption of most secondary analysis approaches in the literature, for which Zhang and Tuoto 23 have proposed a diagnostic test.Adopting the modelling approach of Copas and Hilton, 29 we suppose that both databases are partitioned into blocks A v and B v , v = 1, … , V, and that the record linkage is performed independently in these blocks.Also, we suppose that for any entity i ∈ A v , we have: where (j) stands for some unit randomly selected in database B v .In other words, it is supposed that for any i ∈ A v , the correct entity is linked to i with probability  v , otherwise the unit j linked to i is randomly selected in B v .The correct linkage probability of record pairs within a block can be different in practice.However, most relevant approaches for secondary analysis seem to be robust to the failure of this assumption when the linkage errors are non-informative. 23It should be noted that we implicitly assume that A is a subset from B, and that all entities in A can therefore have some matching records in B. Also, we assume that there is at most one link for each record of both databases.In practice, there will often be some entities of A which remain unlinked after the linkage process.This may be due to errors in the matching variables, or to the fact they are not sufficiently discriminant for identifying links.Such incomplete record linkage can be problematic for further analysis if the missed links are not at random. 25There are some discussions on this incomplete matching space problem. 19,23,30This problem is out of the scope of our work.We therefore assume that the linkage is complete, or alternatively that any missing links are independent on the time of event and model covariates.

Adjusted estimating equation
By naively treating the linked covariates Z i as if they were the true covariates X i for the units i ∈ A, an estimator of  0 may be obtained by solving the following equation: We call (5) the naive estimating equation.Since some units are incorrectly linked, it may lead to biased estimates, see the simulation results in Section 3.
We propose a bias-corrected estimating equation, accounting for the fact that from the hit-miss model (4), the covariates may be incorrectly linked.We first introduce some notations.Let us define Also, let X B v , g B v () and h B v () denote the means of X i , g(, X i ) and h(, X i ) over B v , respectively.The linkage-error adjusted estimating equation (AEE) is given by where, for any i ∈ A v , We prove in Appendix A that H() is an (approximately) conditionally unbiased estimator for the function H 0 () involved in the theoretical estimating equation.Solving the proposed AEE therefore leads to a consistent estimator of , see the simulation results in Section 3.
Since there is no closed-form solution for the estimating equations considered above, an iterative method like the Newton-Raphson algorithm is commonly used in practice.Also, the probabilities  v may be (somewhat arbitrarily) specified by the record linkage practitioner, or estimated from a validation sample 18,23 if their true values are unknown.

Variance estimator
In this section, we discuss variance estimation for the estimator of the parameter  0 obtained by solving the AEE given in (6).We first note that several sources of variance need to be accounted for: (a) the (usual) variability associated to solving a sample-based estimating equation, (b) the variability associated to the linkage process, and (c) the variability associated to the estimation of the probabilities  v , v = 1, … , V. Using the variance estimator given in (3) fails to account for all these sources of variability, and therefore leads to an underestimation of the variance, see the simulation results in Section 3.
We propose a sandwich-like variance estimator, which reads as follows: with The first component V1 {H( 0 )} in ( 9) accounts for the variability in (c).Under the assumption that the validation samples S v used for such estimation are selected in the datasets A v through simple random sampling without replacement, this variance estimator is where n S v is the sample size of the validation set S v , and .
The second component V2 {H( 0 )} in ( 9) accounts for both the variability in (a) and (b).We have where and The derivation of this variance estimator is explained in detail in Appendix B. It is evaluated empirically in the next section through a simulation study.

A SIMULATION STUDY
In this section, we evaluate the performance of the proposed estimator for the parameter of the Cox model, and the associated variance estimator.The data generation process is first presented in Section 3.1.The estimation methods that we evaluate are presented in Section 3.2, along with the performance indicators.The simulation results are given in Section 3.3.To facilitate interpretation and to study the influence of different simulation parameters, we first consider in Section 3.3.1 scenarios with a single block.Scenarios with multiple blocks and different levels of linkage quality are considered in Section 3.3.2.

Data generation
Assume that there are two datasets A with n A individuals, and B with n B ≥ n A individuals.We first generate the n B units in database B with p = 2 covariates, including a continuous variable X 1 ∼  (0, 1) and a binary variable X 2 ∼ Bernoulli(0.7).Given the p-vector of coefficients  = ( 1 ,  2 ) ⊤ = (0.5, −0.5) ⊤ , the true survival time TB is generated as where U follows a standard uniform distribution, 31 and  is fixed as equal to 1 for simplicity.A constant censoring time is chosen (from 100 000 independent data generation runs) to yield a censoring rate of approximately 0.25 over all the simulation runs.Without loss of generality, we suppose that the units in dataset A are the n A first ones in dataset B. In other words, a pair of individuals (a i , b j ) for i ∈ A and j ∈ B is a match if i = j = 1, … , n A .The survival times T A i for i ∈ A are therefore obtained as Given the values of , the linked values Z for covariates in database A are obtained according to the linkage error model (4).
If there are multiple blocks, data for each block were generated independently as follows.Firstly, for each block v, we generate n B v observations (T, , X) from the Cox model described in Equation (10).Note that the value of the true parameters  and the distribution of X are the same over blocks v.Then, we choose randomly n A v ≤ n B v survival times (T, ) for block A v .All generated n B v values of X will be placed in block B v .Secondly, given the value of  v for block v, n A v linked values Z for block A v are obtained by the linkage error model (4).Inside each block A v , an audit sample of 10% of the units is selected by simple random sampling without replacement, and the proportion of correct links in the audit sample is used as the estimator αv .

Methods and performance indicators
For each scenario, we consider the following estimation methods.The Theoretical is obtained by solving the theoretical estimating equation ( 2) with the true values of covariates X.This is a benchmark estimation strategy, since it cannot be applied on linked data in practice.The Naive is obtained by solving the naive estimating Equation ( 5) with linked data.The Validation is obtained by solving the theoretical estimating Equation ( 2) with only correct linked pairs in the validation set.Note that, contrarily to Theoretical, this method may be used in practice if an audit sample is available.For each of these three methods, the variance of the estimator of the parameter in the Cox model is estimated by using the variance estimator Vmpl ( β) in Equation ( 3), implemented by means of R survival package.
For each scenario, we also consider estimation methods making use of the proposed approach.The TAEE (theoretical adjusted estimating equation) is obtained by solving the proposed estimating Equation ( 6) with the theoretical value of  v .The AEE (adjusted estimating equation) is obtained by solving the proposed estimating Equation ( 6), where  v is estimated by taking the proportion of correct links in the audit sample.For each method, the Newton-Raphson algorithm is applied with a maximum of 20 iterations and an initial parameter value  = (0, 0) ⊤ .We also report the number of time (Fails) when the Newton-Raphson algorithm does not converge.For AEE, the variance is estimated by using V( β) in Equation (B18).For TAEE, the variance is estimated by setting V1 {H( 0 )} = 0 in V( β).For both TAEE and AEE, we also compare to the variance estimator Vmpl ( β) in Equation (3).
The data generation and the estimation process are repeated R = 1,000 times.Over these simulations, we compare the estimation methods in terms of the Monte Carlo bias with β(r) the estimator computed on the r-th sample.We also compute the Monte Carlo standard deviation: For the variance estimation methods, we compute the Monte Carlo estimates of standard deviation a variance estimator computed on the r-th sample.The Monte Carlo estimate of standard deviation is compared to the true standard deviation Sd( β), approximated by Sd MC ( β).

3.3
Simulation results

One block situation
In this section, we consider the situation when the data sets are generated as presented in Section 3.1, with V = 1 block only.We consider two cases.In the first one, the sample sizes n A = 1,000 and n B = 2,000 are held fixed, and we let the  1.As expected, the Theoretical method leads to an unbiased estimation of the parameters.The Naive method leads to severely biased estimators, especially with the smaller value  = 0.75.The bias ranges from 0.029 to 0.147, corresponding to an absolute relative bias between 5.8% and 29.0%.This bias decreases as the probability of correct link increases, as expected.The proposed methods TAEE and AEE lead to approximately unbiased estimation of the parameters, with a larger variability for AEE as expected.The bias under AEE ranges from 0.000 to 0.015, corresponding to a reduction of the relative bias (as compared to Naive) ranging between 5.0% and 27.6%.We note that the variability under both TAEE and AEE is but only moderately increased, as compared to Theoretical.The Validation method also leads to unbiased estimators of the Cox regression coefficients, but with a larger variability than both TAEE and AEE.
We now turn to the variance estimators.The variance estimator Vmpl ( β) (3) performs well for Theoretical, Naive and Validation, but underestimates the variability of the estimators obtained under TAEE and AEE.This is due to the fact that this variance estimator only accounts for the variability of the sample-based estimating equation.On the other hand, the proposed variance estimator performs well, except for  1 when  = 0.75, in which case the variance is underestimated.We have also computed coverage probabilities (CP) for normality-based confidence intervals with a nominal coverage of 95%.We note that the coverage probability is very poorly respected in case of Naive, even in situations when the bias is moderate.
The simulation results obtained in Case 2 are presented in Table 2.We observe no qualitative difference compared to Case 1.The TAEE and AEE lead to almost unbiased estimations for the regression coefficients, and the proposed variance estimator performs well for both methods.The bias obtained under the Naive method does not decrease as the sample size increases.As could be expected, the variability obtained under any estimation method decreases as the sample size increases.

Multiple blocks
In this section, we consider the situation when the data sets are generated as presented in Section 3.1, with V = 3 blocks only.We take = (500, 1000,500).Also, we consider a first scenario where ( 1 ,  2 ,  3 ) = (0.6, 0.7, 0.8); a second scenario where ( 1 ,  2 ,  3 ) = (0.7, 0.8, 0.9); a third scenario where ( 1 ,  2 ,  3 ) = (0.8, 0.9, 1.0).Let  be the weighted average of  1 , … ,  v defined as This leads to a percentage of correct links approximately equal to  = 70% in Scenario 1,  = 80% in Scenario 2 and  = 90% in Scenario 3. In this context, we also consider two additional versions of our proposed methods, when we are unable to access to the value  v of each block, but we have only access to their weighted average: TAEE- where the TAEE is used with V = 1 and true value of , and AEE- where the AEE is used with V = 1 and estimated value of α.
The simulation results are presented in Table 3, and confirm the good results of the proposed methods observed in the situation of one block.Scenarios 1 and 2 are the cases when the behaviour of the Naive method is particularly poor, with a very large bias due to a larger number of false links, and very poor coverage for the confidence intervals.On the other hand, AEE performs well in reducing the estimation bias even in this situation.The proposed variance estimator also performs well in these cases.The standard errors of TAEE and AEE estimators decrease as  increase, that is, by going from Scenario 1 to Scenario 3 in Table 3.As explained in Section 2.4, there are three sources of variance in the estimation process: (a) the variability associated to solving a sample-based estimating equation, (b) the variability associated to the linkage process, and (c) the variability associated to the estimation of the probabilities  v .Since the sample size is kept constant, the term (a) is likely not affected by the value of  v .The term (b) decreases as  v increases, as the variance in the hit-miss model (4) does so.The term (c) also decreases as  v increases, as illustrated by the fact that V1 depends on (1 − )∕ 3 , which is decreasing as  → 1. Concerning the coverage probability of normality-based confidence intervals, we note that they are well respected under the proposed methods, although the confidence intervals are slightly conservative when the variance estimators are so.
When the block-specific true link rate is not correlated with the block-specific distribution of T and X, for example, this multiple blocks simulation set up, a single- adjustment (TAEE- and AEE-) can still perform well.The main result in Table 3 concerning AEE and AEE- is that they both lead to virtually unbiased estimators.The bias is indeed always smaller with AEE-, but the difference is no greater than 0.009, which is very small as compared to the value of the parameters ( 1 = 0.5 and  2 = −0.5).Closeness between TAEE and TAEE- confirms the somewhat favourable simulation setup of non-informative linkage error.Reduced bias of AEE- compared to AEE- may be due to the non-linearity of adjustment, such that the additional variance of AEE- adjustment is manifested in terms of the bias of adjustment.Moreover, a single- adjustment can provide a smaller variance.In practice, this is very helpful when the analyst cannot conduct auditing, and when the linker can only provide a single overall estimate of .However, the linkage error may be informative, such as when  and  vary across the blocks in a correlated manner.Block-specific adjustment would then be clearly more helpful at reducing the bias than adjustment by a single .Some additional simulation results are presented in the supplementary material.In particular, we have studied the situation when the non-informative assumption is not true.The simulation results in Tables S1 and S2 indicate that the Cox parameters estimated under TAEE and AEE wil be more biased when  is dependent on variables from the Cox model.This is especially true when  is small and dependent on T (see Table S2).This emphasizes the importance of the non-informative linkage error assumption.We note, however, that the proposed methods still perform better in this case than the Naive method.We have also performed a sensitivity analysis, evaluating the performance of TAEE with incorrect values for the parameter .The results are presented in Table S3.As could be expected, the bias in the estimated parameters increases with the error in , but the estimator remain less biased than with the Naive method if the error is moderate.

Data description
The proposed model is fitted to a linked dataset between a registry of strokes, denoted by AVC ("Accident Vasculaire Cérébral"), and an extraction of the national health information system of France, denoted by SNDS ("Système national des données de santé").The AVC recorded all stroke cases of patients aged 15 years and older, who have lived in the Brest SNDS is an extraction from the French health information system, and contains patients for whom at least one medical service or hospitalization were recorded since 2008 while they were living in the Brest area.Due to the limited information in the registry, there is a demand of linking AVC and SNDS to enrich the registry for further analyses.
The linkage was performed by a separate team, and due to confidentiality restrictions, we were not allowed to access to the matching data and have limited knowledge about the linkage.A deterministic record linkage method was used.This is the simpler linkage approach, which ideally requires agreement on all matching variables, or otherwise on a (large) subset of these variables.In the linkage process, there are nine matching variables, and the linkage is implemented sequentially.In the first step, it is required that the nine matching variables agree for a pair to be viewed as a link.The corresponding pairs are then suppressed, and among the remaining ones it is asked that eight matching variables agree for a pair to be viewed as a link.The procedure continues on similarly.The process is summarized in Table 4.
After performing the linkage process, a dataset of 3,535 patients has been obtained.It contains the survival time, the censoring indicator and three covariates (age, gender, type of stroke).We suppose that these covariates were obtained from SNDS by the linkage process, and may therefore be affected by linkage errors.A description of the dataset is presented in Table 5.In this application, we are interested in comparing the risk of death after the first stroke between males and females, taking into account the age and the type of stroke.

Cox regression analysis
In this application, we use the Cox regression model (1) to model the relationship between the survival time and three explanatory variables (age, gender, type of stroke).We consider AVC as database A and SNDS as database B in our proposed model.In the naive approach, we use the linked data as if it was directly observed.However, the simulation results in Section 3.3 show that linkage errors lead to biased estimators of the regression coefficients.Therefore, we also use the adjusted estimating Equation (6).
For the record pairs obtained at each step, the percentage of matching variables which are in agreement are seen as a proxy of the probability that the matching is correct.For example, for the 1,500 pairs obtained at step 4, the probability that the matching is correct is estimated as 6∕9 = 0.667.We suppose that the linked dataset is comprised of two blocks, so TA B L E 6 Estimated coefficients (coef), estimated standard deviation of the estimated coefficients (SD), and the hazard ratio (hr = exp(coef)) of the naive method and the AEE method from linked data.as to avoid the possibility of dependency between the linkage process performed into the different blocks.The estimates of  v for each block v are obtained as follows:
• Block 2: 1,743 remaining record pairs, with α2 = 170 Besides, because the covariates are not available for any units in the SNDS, the adjustment terms in ( 7) cannot be computed since the proposed approach requires full access to the set of covariates in database B. We therefore use the proxy solution suggested in Equation (C1), which requires that the covariates are known on the linked dataset only.Simulations in Appendix C show that if the database A may be seen as a random sample from the database B, or when the sampling leading to A is independent of the covariates, this method leads to comparable results as the method proposed in Section 2.3.
In Table 6, we present the estimations arising from both the Naive and the AEE methods.The two methods decidedly lead to different estimations.If the Naive method is used, the hazard ratio of sex is 0.887, which means that given the same age and the same type of stroke, the female's risk of death after the first stroke is 0.887 times smaller than male's.On one hand, this ratio from the adjusted estimating equation approach is just 0.865.

DISCUSSION
In this work, our simulations proved that the naive use of linked data may lead to substantial bias in a Cox regression model.Therefore, under the secondary analysis position where the analyst can access to linked data only, we have proposed an adjusted estimating equation for linked data, which can correct the bias from the naive estimating equation.
A variance estimator, which can capture three sources of variability has also been proposed.However, proving the asymptotic normality of the resulting estimators remains challenging.Through various simulation scenarios with one block and also multiple blocks, the proposed adjusted estimating equation is shown to lead to substantial bias reductions as compared to the naive estimating equation.Additional simulations study the non-information linkage assumption and the sensitivity analysis of α are also presented in the Supplementary material.
In our modeling, it is assumed that the probability of correct linkage is identical within blocks.The assumption may not be completely realistic, since the linkage errors vary over different individuals.However, the detrimental effect may be limited as long as the linkage errors are non-informative, as shown both by simulation and with real-life linkage data in Reference 23.It would be desirable to study modelling approaches that allow for individual linkage errors, as well as to carry out empirical validations of adjustment methods derived under the simplified assumption of linkage errors.A key to both would be the possibility to work with the data linkers directly on such problems.
We have also proposed different variants of the approach for scenarios where information is limited.For example, when the block-specific linkage rate  v is not available for each block, our method still works well by using the average true link rate .If the analysts are not able to fully access the covariates in database B, we proposed to use the adjustments in (C1) in the Appendix, which still maintain the good performance of the AEE if A is a random sample from B. Detailed simulation results are presented in Table S4 of the Supplementary material.In addition, a linear approximated estimating equation (LAEE), which can provide better estimation than AEE with small sample size, is given in Table S5 of the Supplementary material.
Although the proposed method has improved on the naive estimation, there are perspectives that need to be developed.In this work, we assumed that observations on survival time are already available and all explanatory variables are obtained from another database.In practice, there are some cases when a part of the covariates is also available in A, and only a part of the covariates is acquired from B by linkage.In addition, the covariates can be obtained from several sources with different linkage processes.The proposed model should be developed to adapt to these cases.
We also supposed that the survival time and the censoring indicator are observed in database A, while the explanatory variables are obtained from database B by a linkage process.However, the opposite situation may occur in practice: the covariates may be available for the units in A, while the survival time needs to be obtained from another database B by a linkage process.The proposed adjustment in Equation ( 6) only accounts for the error associated to Z i .If T i and  i are linked from dataset B, they are prone to linkage errors which need to be accounted for in modifying the estimating equation.This requires a different adjustment approach.

DATA AVAILABILITY STATEMENT
Our R programs for simulation results are available at Github, https://github.com/thanhhuanVO/Cox-regression-withlinked-data.The data used in the application of the methods may be obtained from a third party and are not publicly available.For all interested researchers, data are available via SNDS, https://www.snds.gouv.fr/SNDS/Accueil,subject to the authorization of CNIL (National Commission on Informatics and Liberty of France).The data that support the findings of this study are available on request from the corresponding author.The data are not publicly available due to privacy or ethical restrictions.

ORCID
Thanh Huan Vo https://orcid.org/0000-0001-8764-9256 denote the information related to the duration times and censoring indicators for the units in A, and to the true values of covariates for all the units in B. We have For each i ∈ A v and j ∈ B v , let l ij be an indicator equal to 1 if unit i and j are linked, and to 0 otherwise.Then for each i ∈ A v , we have Under the non-informative assumption for the linkage process, we obtain from the hit-miss model ( 4) that From Equation ( 7) and under the non-informative linkage assumption, we have By using a first order Taylor approximation, we have up to negligible factors of order O p (n −1 A ): where Similarly: Therefore, By plugging (A3) and (A5) into (A2), we obtain

APPENDIX B. VARIANCE ESTIMATION FOR THE PROPOSED ADJUSTED ESTIMATOR
In this appendix, the derivation of the variance estimator is explained.For simplicity, we focus on the case V = 1 when a single block is used.The extension to multiple blocks is straightforward.
We first recall the main notations.A database B of size n B is first obtained, and the covariates X i are observed for all the units in B. We use the notations We also note X B ≡ {X i } i∈B for the set of auxiliary variables in B.
A subsample A of size n A is then selected in B, and the variable T i is obtained for any unit i ∈ A. We note T A ≡ {T i } i∈A for the set of outcome values in A. The auxiliary variables are obtained in A by using record linkage, leading to the pseudo auxiliary variables Z i for any unit i ∈ A. We note Z A ≡ {Z i } i∈A for the set of pseudo values in A.
Finally, a validation sample V of size n V is selected in A by simple random sampling, and the true auxiliary variables X i are obtained for the units i ∈ V.By comparing the pseudo values Z i and the true values X i in V, we obtain an unbiased estimator α for the parameter .

B.1 Global estimating equation
Using the unbiased estimator α for the parameter  (see Equation 4), the global estimating equation for the parameter  is where we may rewrite the quantities in (B2) as ) .
After some algebra, this leads to: where with R * i (,  0 ) = ∑n A j=1 Y j (T i )h * j (,) ∑n A j=1 Y j (T i )g * j (,) , and with )

TA B L E 4
Description of the linkage process.days) between the first stroke and death or end of follow-up (31/12/2018) AVC Censoring If the patient died before 01/01/2019: 1 = Yes, 0 = No AVC Age Age (in years) at the first stroke SNDS Gender Sex: 0 = Male, 1 = Female SNDS Type AVC Type of stroke (0 = Ischemic, 1 = Hemorrhagic) SNDS area from 2008 to the end of 2018.
Simulation results in case 1 with three different values for the probability of correct link  ∈ {0.75, 0.85, 0.95}.
TA B L E 1probability of correct link  vary in {0.75, 0.85, 0.95}.In the second one, the probability of correct link is held fixed, equal to 0.85.We let n A vary in {500, 1000, 2000}, with n B = 2n A .The simulation results obtained in Case 1 are presented in Table Simulation results in case 2 with three different values for the sample size n A .
TA B L E 2 Simulation results with three blocks with different linkage quality.
TA B L E 3