## 1 Introduction

Linked data sets, created by probabilistic matching of records, are widely used for research in health, epidemiology, economics, demography, sociology and many other scientific areas. However probabilistic matching can lead to linkage errors, which is a type of measurement error and can lead to biased inference unless appropriate steps are taken to control and/or adjust for this bias (Chambers, 2009). Unfortunately, these errors are typically ignored when analysis of linked data is undertaken. Although there have been a number of statistical methods developed for efficient linkage (see Herzog *et al*., 2007), there has been comparatively little methodological research carried out on the impact of linkage errors on analysis of linked data.

An early reference is Neter *et al*. (1965), who found that relatively small amounts of linkage error can lead to a substantial bias when estimating a regression relationship. Scheuren & Winkler (1993, 1997) investigated the effect of linkage errors on the bias of ordinary least squares estimators in a standard linear regression model and proposed a method of adjusting for the bias. However, their estimator is not unbiased in general. Subsequently, Lahiri & Larsen (2005) proposed an alternative unbiased estimator, based on a regression model with transformed covariates. In their simulations, they found that their approach performed very well across a range of situations.

A methodological framework for analysis of linked data was developed in Chambers (2009). Under this approach, appropriate modifications to standard statistical analysis methods are used to ensure that they remain unbiased when applied to probabilistically linked data. However, this development assumes that measurements are mutually independent. This is unrealistic when they correspond to observations from clusters of correlated statistical units, such as members of a family, patients in a hospital or students in a school. Nested error models are often used when analyzing such data. Consequently, in this paper we develop methods for efficient fitting of linear models with nested errors to probabilistically linked data.

The structure of the paper is as follows. In the following section we review the linkage error model used in Chambers (2009). In Section 'Estimation of regression coefficients' we then describe a framework for fitting a linear model with nested errors given linked data generated under this linkage error model, and obtain unbiased estimators of regression coefficients for this case. In Section 'Estimation of variance components' we describe three methods of variance components estimation using probabilistically linked data: analysis of variance, pseudo-maximum likelihood and pseudo-restricted maximum likelihood. Simulation results that compare the estimators defined in the preceding sections are presented in Section 'Simulation results'. Section 'Summary and further research' concludes the paper with a summary of its results and suggestions for further research.