Probabilistic matching of records is widely used to create linked data sets for use in health science, epidemiological, economic, demographic and sociological research. Clearly, this type of matching can lead to linkage errors, which in turn can lead to bias and increased variability when standard statistical estimation techniques are used with the linked data. In this paper we develop unbiased regression parameter estimates to be used when fitting a linear model with nested errors to probabilistically linked data. Since estimation of variance components is typically an important objective when fitting such a model, we also develop appropriate modifications to standard methods of variance components estimation in order to account for linkage error. In particular, we focus on three widely used methods of variance components estimation: analysis of variance, maximum likelihood and restricted maximum likelihood. Simulation results show that our estimators perform reasonably well when compared to standard estimation methods that ignore linkage errors.
Linked data sets, created by probabilistic matching of records, are widely used for research in health, epidemiology, economics, demography, sociology and many other scientific areas. However probabilistic matching can lead to linkage errors, which is a type of measurement error and can lead to biased inference unless appropriate steps are taken to control and/or adjust for this bias (Chambers, 2009). Unfortunately, these errors are typically ignored when analysis of linked data is undertaken. Although there have been a number of statistical methods developed for efficient linkage (see Herzog et al., 2007), there has been comparatively little methodological research carried out on the impact of linkage errors on analysis of linked data.
An early reference is Neter et al. (1965), who found that relatively small amounts of linkage error can lead to a substantial bias when estimating a regression relationship. Scheuren & Winkler (1993, 1997) investigated the effect of linkage errors on the bias of ordinary least squares estimators in a standard linear regression model and proposed a method of adjusting for the bias. However, their estimator is not unbiased in general. Subsequently, Lahiri & Larsen (2005) proposed an alternative unbiased estimator, based on a regression model with transformed covariates. In their simulations, they found that their approach performed very well across a range of situations.
A methodological framework for analysis of linked data was developed in Chambers (2009). Under this approach, appropriate modifications to standard statistical analysis methods are used to ensure that they remain unbiased when applied to probabilistically linked data. However, this development assumes that measurements are mutually independent. This is unrealistic when they correspond to observations from clusters of correlated statistical units, such as members of a family, patients in a hospital or students in a school. Nested error models are often used when analyzing such data. Consequently, in this paper we develop methods for efficient fitting of linear models with nested errors to probabilistically linked data.
The structure of the paper is as follows. In the following section we review the linkage error model used in Chambers (2009). In Section 'Estimation of regression coefficients' we then describe a framework for fitting a linear model with nested errors given linked data generated under this linkage error model, and obtain unbiased estimators of regression coefficients for this case. In Section 'Estimation of variance components' we describe three methods of variance components estimation using probabilistically linked data: analysis of variance, pseudo-maximum likelihood and pseudo-restricted maximum likelihood. Simulation results that compare the estimators defined in the preceding sections are presented in Section 'Simulation results'. Section 'Summary and further research' concludes the paper with a summary of its results and suggestions for further research.
2 The exchangeable linkage errors model
In this section we summarize the linkage error model underpinning the development in Chambers (2009). We assume that there is a population of units, indexed by For each unit in this population, there exists an observable value of a scalar random variable and a vector random variable . The aim is to model the relationship between and in this population, and in particular to estimate the coefficients of a linear model for the regression of on . However, there is no single database that contains the joint population values of and . Instead, there are two population registers, which we denote by register and register , that separately contain these values, i.e. register contains the values of and register contains the values of . Both registers refer to the same population and have no duplicates, so each consists of records.
Given a unique identifier for each unit in the population, it is straightforward to link the records from the two individual registers to create one joint register. However, such an identifier usually does not exist. Instead, some form of probability-based matching is used to link records from the two registers. We assume that the resulting linkage is complete (i.e. all records are linked) and one to one between register and register . However, since the linkage is probabilistic, the linked data set can contain linkage errors, i.e. records where the values of and that ostensibly belong to the same population unit actually come from different population units.
In most cases, there are common auxiliary variables measured on both registers. These variables are typically used for probability matching, and allow us to assume that the linked records can be partitioned into distinct sets or blocks such that there is no possibility that linked records in different blocks contain data for the same population unit. We characterize this situation by defining a categorical variable such that different blocks correspond to different values of . In other words, if a record on one register does not have the same value of as the record on the other register, then the two records cannot correspond to the same unit in the population. An immediate consequence is that only linked records with the same value of can contain linkage errors, i.e. linkage errors can only occur within a block.
Without loss of generality, we assume that takes the distinct values 1,...,, and let block correspond to the population units with so . Let denote the -value from block on the register that is matched to the -value in block on the register. Thus there are linked data pairs () in block . We denote the vector of dimension of the linked values in block by and similarly let denote the matrix with rows defined by the values in the same block. Finally, we use to denote the unknown vector of the true values in block that are associated with . Note that if linkage is perfect, then for all so .
Since linkage is assumed to be complete and one to one between register and register , randomness in the outcome of the linkage process can be modeled via the identity
where is an unknown random permutation matrix of dimension . Given that linkage errors can only occur within blocks, it is natural to assume that and are independently distributed when . We further assume that linkage is non-informative at each level of in the sense that the distribution of is independent of given , and define
The distribution of linkage errors will depend on the characteristics of the probability-linking method actually used. In many cases, this information will not be available to the data analyst. Consequently, we follow Neter et al. (1965) and model the distribution of linkage errors using an exchangeable linkage errors (ELE) model. Under this model, for each value of
where is the identity matrix of order and denotes a vector of ones of length . Since and , and . That is, (4) implies
A major advantage of the ELE model is that it only requires one parameter () to completely specify the first order properties of the probability-linkage mechanism.
3 Estimation of regression coefficients
In this section we consider the situation where a two level linear model with nested errors is the focus of inference. We therefore introduce an auxiliary grouping variable which takes values , and let group correspond to the population units with such that and where is the number of population units in block and group . That is, we allow distinct units within the same group to be independently linked (correctly or incorrectly) in different blocks. We assume throughout that the values of in the linked data are correct, i.e. this variable is stored on register . The two level linear model for the regression of on in the population is then given by
where is the vector of true values of associated with the records on the register, is the matrix whose rows correspond to the values of on the register, and is the matrix that identifies the group to which each record in the register belongs. The vector is a vector of random group effects, while is a vector of random individual effects, with
where and The between-group variance and the within-group variance are the variance components of the linear mixed model, and the variance-covariance matrix of is of the form
The values of and in each block then satisfy
where the subscript denotes conditioning on the value and is another block index. The naive linked data weighted least squares (WLS) estimator of is then
where and is the component of corresponding to block and block . It is straightforward to see that under the linkage error model (1), the naive WLS estimator (5) based on the linked data set is biased since
Given that and are known and that the inverse of in (6) exists, Chambers (2009) suggests an unbiased estimator using a ratio-type correction for the bias in the naive estimator of ,
so long as is of full rank.
Alternatively, since , and and are independently distributed given it follows that
We see that the can also be modelled linearly, with regression coefficient but with a modified set of explanatory variables in block . Following Lahiri & Larsen (2005), an alternative estimator of in this case is therefore
This estimator is not optimal since the variances of the regression errors defined by the linked data vary between blocks. That is
where the denote components of and and denote the block averages of the components of and their squares respectively. A similar approximation of can be developed. Defining , this approximation is given by
where is the number of groups in block and is the number of population units in block and group . Also the covariance between and is then
Thus the best linear unbiased estimator (BLUE) for given these data can be approximated by
where and is the component of corresponding to block and block .
Variance estimators for , and can be defined using first order approximations to solutions of estimating equations. These estimators are derived in Appendix C
4 Estimation of variance components
We now develop appropriate modifications to three standard methods of variance components estimation in order to account for linkage error. These are the method of moments, typically referred to as the analysis of variance (ANOVA) method, the maximum likelihood (ML) method, and the restricted maximum likelihood (REML) method. The details of the modified version of each method are set out in Sections 'Analysis of variance', 'Pseudo maximum likelihood (pseudo-ML)' and 'Pseudo restricted maximum likelihood (Pseudo-REML)', respectively. Note that all population quantities referred to in this Section are ordered as in the register, so we drop the subscript.
4.1 Analysis of variance
Historically, ANOVA is the starting point for estimation of variance components (Searle et al., 2006). The method is based on equating the between groups sum of squares (SSA) and the within groups sum of squares (SSE) to their expected values under the nested error model of interest. The two sums of squares that are the basis of ANOVA for the linked data are
The expected values of these two sum of squares are derived in Appendix A. When and are equated to their observed values we obtain the variance components estimators
Estimators of the large sample variances of and defined by (10) and (11) are derived in Appendix 3.
Note that if linkage is perfect, i.e. and , where
The ANOVA estimates in (10) and (11) can be negative. Consequently, it is usually better to use a method of estimation that explicitly excludes the possibility of negative estimates. Such methods are ML and REML.
4.2 Pseudo maximum likelihood (pseudo-ML)
Unlike the ANOVA method of estimation, a basic requirement of ML estimation is that the probability distribution of the data is known. We follow the usual convention of assuming multivariate normality. That is, we assume that . The log-likelihood function is then
In what follows we assume that is fixed. Differentiating (12) with respect to then yields
Similarly, differentiating (12) with respect to leads to
where Finally, differentiating (12) with respect to gives
The pseudo-ML estimators for and are defined by setting the derivatives (13), (14) and (15) to zero and solving for these parameters. Note that we refer to the resulting estimators as pseudo-ML because their estimating functions, which are defined by these derivatives, are based on the assumption that is a known matrix. However, in reality this matrix is a function of and , and so analytic solutions to these estimating equations do not exist. We therefore now describe how the method of scoring (Searle et al., 2006) can be used to solve them.
Let denote the vector of parameters to be estimated, i.e., . The method of scoring uses an iteration scheme defined by
where is the expected information matrix calculated at .
The expected information matrix is obtained by taking the expected values of the derivatives set out above, noting that and hence . Also, for non-stochastic It follows that
4.3 Pseudo restricted maximum likelihood (Pseudo-REML)
One criticism of the ML method is that in estimating variance components it takes no account of the degrees of freedom that are involved in estimating fixed effects (Searle et al., 2006). Also, the variance component estimators obtained by solving the likelihood equations are generally biased, unlike the ANOVA estimators (Harville 1977, Searle et al. 2006).
The first criticism above is overcome by using REML (Searle et al., 2006). Rather than using directly, REML uses ML estimating equations based on a modified response variable defined by a linear combination of elements of , chosen in such a way that the distribution of this combination does not depend on the fixed effects in the model. In particular, the vector is chosen so that , i.e.
Note that strict application of the REML approach requires that the distribution of does not depend on . However, when we use linked data, the variance of is implicitly a function of this parameter. Consequently, we refer to this method as ‘pseudo-REML’ since it is based on application of standard REML arguments, ignoring the fact that the variance still depends on the fixed effects in the model.
When of order has rank , there are linearly independent vectors satisfying (17) (Searle et al., 2006). Using a set of such linearly independent vectors as rows of , we then can form where is a matrix whose rows are linearly independent rows of the matrix
With we have, for
Let be the log likelihood for the variance effects defined by . That is
The REML estimating function for is unchanged from the corresponding ML estimating function (13). However the ML estimating functions (14) and (15) for the variance components and are now replaced by alternative REML estimating functions obtained by differentiating . In this context, we note that
The REML estimating equations are defined by setting (13) and the REML estimating functions for and to zero. As before, we use the method of scoring to solve these equations. In order to define the expected information matrix in this case, we need the first and second derivatives of :
The expected values of these second derivatives of are developed in Appendix B. It follows that the component of the observed information matrix is then given by
Estimators of the large sample variances of either the pseudo-ML or pseudo-REML versions of and can be derived using standard large sample approximations based on the expected information i.e. where is the matrix of variances and covariances of the variance components estimates and is given in (16) for ML and (18) for REML.
5 Simulation results
This section contains results from a small scale simulation study that illustrates the comparative performances of the parameter estimators described in previous sections, given an ELE model. The simulations themselves are based on a simple balanced two level population structure. In particular, in each simulation we generated a population of size made up of 50 equal-sized groups, so that each group consisted of 16 units. Population units were then randomly allocated to four equal-sized blocks, each of size 200, such that each group contained an equal number of units (4) from each block thus ensuring that the distributions of and were the same in each block. We note that more complex distributions of block sizes and different covariate distributions within the different groups will usually prevail in realistic settings. However, the purpose here is to demonstrate the comparative bias-correcting properties of the different estimators rather than to evaluate their stability and efficiency under realistic population structures and linkage scenarios.
Values of were independently drawn from the uniform distribution over [0,1] with corresponding values of given by
where the were independently drawn from the distribution and the were independently drawn from the distribution. The true data pairs were then randomly allocated to blocks and groups. Next, linked data pairs were generated by using the ELE model defined by (1)-(4) with correct linkage probabilities , , and . That is, all links for block 1 were assumed to be correct, while those for blocks 2, 3 and 4 were assumed to have some errors. This ELE model allows a record in a block with to be potentially matched to any record located in the same block irrespective of the group status of the record. That is, the group identifier is not a component of the blocking variable and hence not part of the linkage process. This ensures that the non-informative linking assumption holds in the simulations.
We present simulation results for two scenarios. The first corresponds to known linkage probabilities. In the second, these probabilities were estimated by taking random audit samples of linked pairs from each of blocks 2–4 and checking to see how many of these sampled links were correct. Following Chambers (2009), the estimate of was then calculated as
where is the proportion of correctly linked pairs identified in the audit sample in block . Variance estimators were adjusted for the extra variability induced by this estimation of using the approach described in Chambers (2009). The details of this approach are described in Appendix C.
A total of 800 independent simulations were sufficient to illustrate the different bias and variance properties of each estimator. Table 1 and Figure 1 show the relative biases and relative root mean squared errors of the regression coefficient estimators described in Section 'Estimation of regression coefficients'; and Table 2 and Figure 2 show the relative biases and relative root mean squared errors of the variance components estimators described in Section 'Estimation of variance components'. The WLS estimator TR based on perfectly linked data and the naive WLS estimator based on the actual linked data were obtained using the default settings of the lme function in the R software package. The estimators R, A and C denote the bias-corrected estimators (7), (8) and (9) respectively. Note that variance components estimators obtained using the ANOVA method are functions of , and so were evaluated using the bias-corrected options (R, A and C) for this parameter. These different ANOVA estimators are distinguished by R, A and C suffixes in Tables 2 and 3. The actual coverages of the nominal 95% confidence intervals for all the model parameters are shown in Table 3.
Table 1. Simulation results for estimators of the regression coefficients of the linear mixed model.
Scenario 2: Linkage probabilities estimated from audit sample
The results set out in Table 1 show that the naive WLS estimator that assumes the data are perfectly linked is clearly biased. Since linkage error is a particular type of measurement error, this bias attenuates the estimate of the slope parameter and exaggerates that of the intercept. On the other hand, all five of the adjusted estimators correct this bias, with the REML estimator being the most efficient. The results are unchanged under Scenario 2 where linkage probabilities were estimated by taking small audit samples.
Table 2. Simulation results for estimators of the variance components of the linear mixed model.
Scenario 2: Linkage probabilities estimated from audit sample
The results displayed in Table 2 show that the naive variance components estimators that treat the linkage as perfect are also biased. As expected, the estimator obtained using the ML approach is slightly biased. All of the remaining adjusted estimators are essentially unbiased, with REML being the most efficient. Again, the results under Scenario 2 are in the same direction as those under Scenario 1.
Table 3. Actual coverages of nominal 95% confidence intervals for the parameters of the linear mixed model.
Scenario 2: Linkage probabilities estimated from audit sample
Finally, we note that the results displayed in Table 3 show that variance estimators that allow for the extra variability induced by estimation of the correct linkage probabilities lead to confidence intervals with good coverage properties.
6 Summary and further research
In this paper we show how one can extend the inferential framework of Chambers (2009) to obtain unbiased estimators of the regression parameters when fitting a two level linear model to probabilistically linked data assuming an ELE model. We also show how three standard methods of estimation for the variance components of the linear mixed model (ANOVA, ML and REML) can be modified in order to make them approximately unbiased under this model. Our simulation results indicate that all the methods developed in this paper work reasonably well in terms of correcting biases induced by linkage errors. However, they also show evidence of increased variability due to the use of bias correction.
An important area of application of two level models using linked data is where registers are linked over time to create data sets suitable for fitting longitudinal models with random individual effects. Further research extending the methodology described in this paper to this situation is ongoing. An important aspect of this research is that it addresses the issue of linkage errors in the model grouping structure – something not considered in this paper. Another issue concerns the assumption of an ELE model. Although a convenient first approximation, most realistic linkage applications involve multiple linkage operations and so will possess a more complex error structure. Further research into correcting linkage error bias under alternative linkage error models is therefore necessary.
A limitation of the research reported in this paper is that the simulation results were based on a small sample size and a relatively simple two-level population structure. This was because the simulation study was designed to illustrate the performances of the different bias-correction methods, rather than to provide an extensive comparison for a variety of population structures and linkage error situations. A larger study is desirable, especially to test the robustness of our methods to failure of key assumptions (e.g. non-informative linkage), but will require considerable computational resources. These are issues that will be considered in our further research in this area.
The research described in this paper was financially supported by the Prince of Songkla University and a University Postgraduate Award from the University of Wollongong.
Appendix A: NOVA estimation
We first develop an expression for . The corresponding expression for follows similarly. We have
Consider the first term on the right hand side above. This can be written
The second term on the right hand side of the expression for can be expanded similarly:
where is the covariance between and . It immediately follows that
Replacing and by SSA and SSE respectively in these two equations and solving for and then leads to the estimators
where and .
Appendix B: Expectations used in pseudo-REML information matrix
where the second equality follows because and the third equality follows because .
Appendix C: Variance estimation
Given the values of the variance components, the estimators , and can all be represented as the solutions to estimating equations. Consequently, we follow the approach described in Chambers (2009) in order to define large sample estimators of the variances of these regression parameter estimators. In particular, we note that each of these estimators is defined by solving an equation of the form where is a -dimensional unbiased estimating function for the regression parameter . Let denote the vector defined by the block-specific values of . The general form of the unbiased estimating function used by all three regression parameter estimators above is then
which is a function of both and . Let denote the partial differentiation operator with respect to . Then, using a first order Taylor series approximation,
where . Note that and denote the true values of these parameters with and denote their corresponding estimators. We can then approximate the variance of by
An ultimate cluster variance estimator can be used for . This is based on the representation
and Note that the are mutually uncorrelated. A variance estimator for can therefore be written in the form
It only remains to determine . If the estimates of the probabilities of correct linkage are obtained by checking a random audit sample of linked records in each block i.e. the number of correct linkages follows the binomial distribution, then . The required estimator of can be obtained by plugging in estimates for unknown quantities in the approximation above. That is, this estimator is given by
Next, we derive the variance estimators for the ANOVA-based variance components estimators. From (10) and Appendix A
Since , by using Theorem S4 of Searle et al. (2006) we have
The estimator of is obtained by substituting estimates for unknown quantities.