Multi‐source Statistics: Basic Situations and Methods

Many National Statistical Institutes (NSIs), especially in Europe, are moving from single‐source statistics to multi‐source statistics. By combining data sources, NSIs can produce more detailed and more timely statistics and respond more quickly to events in society. By combining survey data with already available administrative data and Big Data, NSIs can save data collection and processing costs and reduce the burden on respondents. However, multi‐source statistics come with new problems that need to be overcome before the resulting output quality is sufficiently high and before those statistics can be produced efficiently. What complicates the production of multi‐source statistics is that they come in many different varieties as data sets can be combined in many different ways. Given the rapidly increasing importance of producing multi‐source statistics in Official Statistics, there has been considerable research activity in this area over the last few years, and some frameworks have been developed for multi‐source statistics. Useful as these frameworks are, they generally do not give guidelines to which method could be applied in a certain situation arising in practice. In this paper, we aim to fill that gap, structure the world of multi‐source statistics and its problems and provide some guidance to suitable methods for these problems.

For each dimension and for the aggregation level, one or more aspects are important, given in the tables below. Each aspect can have multiple 'states'; for instance, the aspect 'population' can have two states: we know the population-for instance, because that information is available from a population register (frame) or from a Census-or we do not know the population.
We present the representation dimension in Table 1, the measurement dimension in Table 2  and the time dimension in Table 3. Finally, for the aggregation level, we distinguish three different states (Table 4).

Completeness
Variable distinctness Relatedness 1. Together the data sets 1. There are no overlapping 1. There are no logical relations contain all target variables variables in the data sets between variables in different 2. Part of the target variables 2. There are no overlapping data sets need to be derived from the target variables in the data 2. There are logical relations source variables sets, but there are overlapping between variables in different auxiliary variables data sets (hard or soft 3. (Some of the) target constraints) variables in the data sets overlap, but the concepts are measured in different ways 4. (Some of the) target variables in the data sets overlap, and the concepts are measured in the same way not contain all data for all units from first availability but becomes gradually available over time Table 4. Aggregation level.
1. The data sets consist of only micro data. 2. The data sets consist of a mix of micro data and aggregated data. 3. The data sets consist of only aggregated data Table 5. Characteristics of targeted output.
Type of output Usage of data sets Quality improvement of processing 1. The output concerns micro 1. Estimates are obtained by 1. Achieve relevant estimates data sets direct tabulation from 2. Achieve accurate and 2. The output concerns micro data reliable estimates population registers 2. Estimates are indirectly 3. Achieve timely and punctual 3. The output concerns obtained by more complex estimates statistics estimation methods 4. Achieve coherent and 4. The output concerns comparable metadata estimates 5. Achieve accessible and clear estimates

Characteristics of Targeted Output
We now turn towards the output characteristics. For the targeted output, three different aspects are important: the type of output, the usage of data sets and the main quality improvement that is intended by data processing. For each of those aspects, different states are relevant, given in Table 5. The states of the aspect 'quality improvement' refer to the five quality dimensions that are distinguished in Eurostat (2015, pp. 21-107).
In the present paper, we limit ourselves to descriptive statistics, such as totals and means, as output. In particular, we will assume that the main aim of multi-source statistics is to produce high-quality estimates at an aggregated level.
The total number of possible states when combining only two data sets is already very large. To give a first idea, assume that both data sets concern micro data on events. We would then obtain the following combinations of states: 'population' (2 states) ×'unit selections' (3 states) ×'coverage' (4 states) ×'unit distinctness' (2 states) ×'completeness' (2 states) ×'variable distinctness' (4 states) ×'relatedness' (2 states) ×'repeated measures' (2 states) ×'availability' (2 states) ×'progressiveness' (2 states) D 6 144 potential states. We have omitted multiplication  with the two states of 'combined unit selections' because that partly follows from the unit selections in the individual data sets. We also omitted 'time reference', because event data are often only longitudinal. It is clear that we cannot describe all possible different situations. In the remainder of this paper, we have limited ourselves to eight often occurring 'basic' situations in combining data sets in official statistics. Besides being situations that often occur in practice, each of them also illustrates certain problems that can arise when combining data sets. That these eight situations indeed cover most situations occurring in official statistics is confirmed by feedback we received on presentations at various conferences (e.g. NTTS conference 2017; see De Waal, Van Delden and Scholtus, 2017b) and workshops. Table 6 provides an overview of the eight basic situations together with 'defining states' for each of these basic situations. An asterisk (*) in Table 6 denotes that for that basic situation, the characteristic is not a 'defining state'.

General Issues when Combining Data
Two issues apply to many situations where data sets are combined: harmonisation and record linkage. Both units and variables in the various data sets may need to be harmonised before these data sets can be combined. An important reason for harmonisation is the so-called unit error problem. Unit errors occur when units are defined differently in one data set than in another data set, when the units in available data sets are not defined according to the official definition that one wants to use at the NSI or when units have to be constructed. In the Netherlands, for instance, administrative units for value added tax (VAT) data may differ from administrative units for profit and loss data. In turn, those administrative units may differ from the statistical units for which the target population is defined. A specific version of the unit problem occurs when data are available at different levels of aggregation only. For instance, we may want to combine data on bankruptcies (available at the level of legal persons) with data on the number of jobs of employees. The latter are available at the level of enterprises, where an enterprise may be a combination of legal persons. For more details on the unit error problem, we refer to Zhang (2011Zhang ( , 2012 and Van Delden et al. (2018a).
Target variables in the data sets may also need to be harmonised. For example, in the Netherlands, quarterly turnover of enterprises available from administrative data obtained from the tax office often differs from quarterly turnover available from a survey. An important special case that requires harmonisation of variables occurs when we have a subset of the variables in one data set (say administrative data) and other variables in a second data set (say sample survey data) and the sets contain overlapping units, but the reference periods of the two sets are different. For many variables, values differ for different reference periods.
A closely related harmonisation issue is timeliness of data sets. Different data sets may be available at different moments, and the quality of the data sets may vary over time. In particular, the progressiveness of administrative data, that is, the fact that administrative data sets generally contain more and/or higher quality data as time passes, often presents a problem for early estimates (see also Zhang, 2014). The problem that part of the data are at first missing may in some cases be solved by means of weighting (see, e.g. Särndal et al., 1992) or imputation techniques (see, e.g. Little and Rubin, 2002). A complicating aspect is that the initially observed data may be from a selective part of the population (see, e.g. Ouwehand and Schouten, 2014, for assessing the representativeness of data). A further issue is that corrections on originally reported data may become available after a long time. Measurement error in earlier versions of the data may in some cases be treated by measurement error correction methods, for instance, methods as discussed in Section 4.4.
Micro-integration is often a first step to harmonise units and variables (see, e.g. Bakker, 2011aBakker, , 2011b. In micro-integration, for instance, rules may be used to derive the target variables from those present in the input data sets. Micro-integration cannot solve all the harmonisation problems that arise in the context of multi-source statistics, and more advanced methods are often required; see Sections 4.2 and 4.4. The second common issue is record linkage. We need a record linkage step to link the units in the data sets to the population register or to each other. When unique unit identifiers, such as unique personal identification numbers, are present in the data sets, deterministic linkage can be used (see, e.g. Chapter 8 in Herzog et al., 2007). When the same non-unique identifier variables, such as names and addresses, are present in both sets, probabilistic linkage (see, e.g. Fellegi and Sunter, 1969) or machine-learning based record linkage (see, e.g. Christen, 2012) might be used.
Misspelling and variation of formats of, for instance, names and addresses, can severely complicate the record linkage process. As a result, correct matches may be missed in the record linkage process ('false negatives'), and incorrect matches may be made ('false positives'). Such 'false negatives' and 'false positives' may lead to bias in estimates based on linked data and may hamper the analysis of linked data. Some methods have been proposed that aim to correct these biases for record linkage error. For more details on the issues of record linkage, the effects of record linkage error on estimates and the analysis of linked data, and on methods to correct for record linkage error, we refer to Harron et al. (2016), especially Chapters 1, 4, 5 and 6.
Record linkage becomes even more problematic in the case of unit errors, which emphasises the important role of harmonisation.

Basic Situations and Their Methods
In this section, we present eight basic situations that we consider to be the most important ones in practice (see also Table 6). We propose and elaborate these basic situations with respect to the aspects mentioned in Section 2. Many practical situations can be built on these basic situations.
We use figures to illustrate the eight basic situations. Concerning these illustrations, we note that the white rectangle to the left represents the population frame with units; the two light grey colours ( , ) represent different input data sets and the dark grey colour ( ) represents derived output statistics; blocks with horizontal line patterns represent aggregated data and blocks without a filling pattern represent micro data. The arrow refers to the complete process to go from input data to output statistics. In some basic situations, specific methodology is needed as part of this process, and in those cases, the methodology is mentioned in the corresponding section. The target variables in the data sets, denoted by Y 1 ; : : : ; Y p , are observed for units 1; : : : ; N in the case of a full enumeration of the population, or observed for units 1; : : : ; n with n < N in the case of a sample. The general notation for the corresponding target parameters is O Â 1 ; : : : ; O Â p . In practice, these will often be estimated for a set of domains h D 1; : : : ; H within the population. For clarity of presentation, those domains are omitted in most of the figures. Further, background variables, denoted by Z D .Z 1 ; : : : ; Z k / 0 may play a role in the methodology to link the data sets. Background variables are omitted from the figures unless they are a crucial part in the estimation procedure of the target parameters. In some figures, specific symbols are used, which are explained in the corresponding basic situation.

Data Sets with Full Population Coverage and Complementary Target Variables
The first basic situation concerns multiple cross-sectional micro data sets covering the target population where the different data sets contain complementary target variables (see Figure 1). We refer to this as the 'split-variable' case. Provided that the data are error-free, the data can simply be linked to produce output statistics. Figure 1 illustrates the situation that we are interested in: estimating a set of p target parameters based on variables that are observed for all N units of the population or for a probability sample of size n < N . The sampling case may be less common for Situation 1 than for the other situations, but it may occur when linking a sample survey to register data.
In this situation, record linkage is an important issue. We assume that the data sets also contain a set of background variables Z, for instance, variables that are used to link the data sets to the population register.
An example of Situation 1 is the integration of different administrative data sets on economic performance of businesses. For instance, in the Netherlands, administrative data on profit and loss are sometimes combined with administrative data on personnel costs.
An example of unit type differences and linkage issues occurred in the integration of various administrative data sets at Statistics Netherlands to compute energy use per meter squared for dwellings and for businesses or institutions. The central data concern administrative client energy data sets (CAD) obtained from gas and electricity distributors, which consist of the complete volume of energy delivery in the Netherlands. The CAD is linked to a central register on addresses and buildings (Kadaster), which contains building/dwelling type and their area. It is also linked to a general business register (GBR) to identify business activities and to find the economic activity. The unit type within the CAD is the 'energy connection point', identified by a unique energy connection point number (Dutch: EAN). The EAN is related to an address and client name. This address information is also found in the Kadaster data and in the GBR data.
The linkage by address is not always one-to-one. One address may contain multiple energy connection points, which can be solved by adding up the energy use of the different EANs. In addition, one may also have one EAN that is linked to a building that contains multiple activities/enterprises. In this case, one often appoints the energy use to the dominant activity in that building, which is not an ideal approach. One alternative is to introduce explicit categories expressing the mixture, such as 'building with multiple business activities'. Another alternative is to try to estimate the energy use per economic activity by using auxiliary variables. For instance, a linear regression model could be constructed with size class, economic activity per enterprise and floor surface per enterprise as explanatory variables. An example of a similar approach can be found in Enderer (2008). If the model predictions are reasonably accurate, we prefer the use of a model in this and similar situations.

Data Sets with Full Variable Coverage and Complementary Units
The second basic situation also concerns multiple cross-sectional micro data sets covering the target population, but in this case, the different data sets contain different units (see Figure 2). We refer to this as the 'split-population' case. Provided the data are in an ideal error-free state and the concepts are identical, the different data sets are complementary to each other in this case, and likewise to Situation 1, they can be simply 'added' to each other in order to produce output statistics. However, in practice, often a harmonisation step will be necessary to correct for differences in the conceptual definitions of the variables.
An example of Situation 2 is the estimation of quarterly turnover at Statistics Netherlands. The turnover data are available from a combination of census data and administrative VAT data, and both are linked to the GBR (Van Delden and De Wolf, 2013). The VAT data are available for fiscal administrative units, and they can be uniquely linked to the enterprises in the GBR only for the small and medium sized enterprises. The complementary group of large and complex enterprises receives a census survey. Statistics New Zealand (Chen et al., 2016) uses a very similar approach, where sub-annual sales data are obtained from administrative Goods and Service Tax data, complemented by survey data for the large and complex units.
A method to harmonise variables based on multiple data sets that relies on the assumption that one data set can be used as the 'gold standard' is given in Van Delden et al. (2016). They analysed the relation between the metadata and the data of annual survey turnover and VAT in 2009 and 2010, where survey turnover was considered to be the 'gold standard'. The relation was analysed for more than 300 domains of economic activity. They divided the domains into four groups. The Control group concerned domains where there are no conceptual differences in the definitions of survey and VAT turnover. These domains showed a linear relationship with an intercept close to 0 and a slope that was very close to 1. The Accept group concerned domains with conceptual differences but only small numerical differences. The Adjust group concerned domains with conceptual differences and systematic numerical differences. For the units in this domain, a correction factor can be applied to estimate the survey turnover values from the VAT turnover values. The final group, Reject, concerned domains with conceptual differences and large non-systematic numerical differences. For units in the Reject group, VAT data cannot be used, and we have to continue using survey data. For units in the Control, Accept and Adjust groups, the survey can be abolished. Examples of the relations between survey and VAT turnover can be found in Figure 2D-F in Van Delden et al. (2016).
Another example of Situation 2 is where national figures are composed from a large set of decentralised, autonomous administrations, for instance, national health care figures based on regional administrations. In such situations, hierarchical models may be of use where the mean of each decentralised system is modelled as a random effect and the individual records are nested within each separate source. The challenge there is to account for a bias component

Overlapping Variables but Non-overlapping Units
A slightly different situation occurs when, besides having non-overlapping units as in Situation 2, we also have a number of overlapping variables and some target variables that are available in only one of the data sets. We call this Situation 3 (see Figure 3). We still would like to join the target variables Y 1 in one of the data sets to target variables Y 2 in another data set and estimate the joint distribution of variables Y 1 and Y 2 (represented by the rectangle in Figure 3, where the estimates of both variables are divided into different classes). For this, statistical matching techniques are available.
In Italy, the main data sets available for estimating household income and expenditure are the Household Budget Survey conducted by the Italian National Institute of Statistics and the Survey on Household Income conducted by the National Bank of Italy. Unfortunately, there is no single data set available that contains data on both household income and expenditure. In order to examine the effects of policy changes on the relation between household income and expenditure, one therefore resorts to using statistical matching (see Conti et al., 2017).
Statistical matching differs fundamentally from record linkage. Whereas in record linkage one aims to link a record from a unit in one data set to a record from the same unit in another data set, in statistical matching, one essentially aims to match a record of a unit in one data set to a record from a similar, but generally not the same, unit in another data set.
Statistical matching can be carried out at the micro level or at the macro level. When statistical matching is carried out at the micro level, one combines data from individual units in the different data sets to construct synthetic records with information on all variables. In particular, when there are two data sets, information from one data set, the donor data, is used to estimate target values in the other data set, the recipient data. The records constructed are a mix of data from different units from different data sets.
When statistical matching is carried out at the macro level, one assumes a parametric model for all the data, for instance, a multivariate normal model for numerical data, and then estimates the parameters of this model. These parameters are subsequently used to estimate the population parameters one is interested in. For an overview of methods for statistical matching at both the macro level and the micro level, we refer to Chapters 2 and 3 in D' Orazio et al. (2006).
In Figure 3, we have two data sets. Data set 1 contains variables Y 1 and Z and data set 2 Y 2 and again Z. Variables Z are the common (background) variables that are used to statistically match the records. When statistical matching is carried out at the micro level, variables Z are used to match individual units in data set 1 to individual units in data set 2. The fundamental issue of statistical matching is that the relationship between the target variables Y 1 and Y 2 cannot be estimated directly, but only indirectly. In order to do so, one has to rely on untestable assumptions, that is, untestable from the data sets themselves, about this relationship. The most common assumption is the conditional independence assumption (CIA), which says that conditional on the values of background variables Z, the target variables Y 1 and Y 2 are independent. In general, the joint relationship between Y 1 and Y 2 can be decomposed into a part which is explained by Z and a remaining part which is unexplained by Z. In the simple case of a trivariate normal distribution, this can be written as Stuart and Ord, 1991, pp. 1010-1011. If the CIA holds, then Y 1 Y 2 jZ D 0.
As an alternative to the CIA, the so-called instrumental variable assumption has recently been proposed (see Kim et al., 2016). An instrumental variable is a variable that induces changes in the target variable of one data set but has no effect on the target variable of the other data set. In practice, it may be hard to find such a variable.
When the total output uncertainty based on the CIA or instrumental variable assumption is too large, one can make use of auxiliary data (Singh et al., 1993). One option is to link an administrative variable to both data sets. Van Delden et al. (2019) found that even when the administrative variable is strongly related to a target variable in one of the data sets, the resulting uncertainty is often too large to be useful in official statistics. Alternatively, one might use a third data set where the common variables and the target variables in the two data sets are observed. This third data set can be obtained from a population that is close to the target population (a proxy) or it can concern data from a small overlap of the two data sets. The use of such a third data set would lead to Situation 4, which is discussed in the next section.

Overlapping Variables and Overlapping Units
Situation 4 (see Figure 4) is characterised by a deviation from Situation 2, by which there exists an overlap concerning both units and measurements between the different data sets.
In this situation, at least for a subset of the units in the population, we have multiple measurements of the same target variable(s), coming from different data sets. Due to measurement and timing errors, these observed variables from different sets will usually not agree exactly for all units. An example of Situation 4 arises in education statistics in the Netherlands. There exist both administrative and survey data on the education level of Dutch people (Linder et al., 2011). Some persons can be found in both data sets, and the respective education level measurements do not always agree with each other as both sets may contain measurement errors.
When the same phenomenon is observed for the same units in multiple data sets, one can utilise the multiple observations to identify and correct residual errors. An approach that is often used in practice at NSIs is micro-integration (Bakker, 2011a). In addition to the harmonisation step described in Section 3, in the present situation, micro-integration also involves comparing the available observations for each overlapping unit to determine which of the data sets is most likely to contain the best approximation of the true value for that unit. Often, deterministic correction and derivation rules are used for this. In many applications, some form of micro-editing is also needed to obtain consistency between different target variables observed in different data sets (Di Zio and Luzi, 2014;De Waal et al. 2011).
Micro-integration is a rather crude and somewhat subjective technique. It can be used to harmonise the most important and most obvious inconsistencies between data sets, but not to harmonise more subtle inconsistencies. When such more subtle inconsistencies are caused by measurement error, it may in some cases be possible to find an appropriate statistical model for the measurement errors in the observed variables. Model-based estimates can then be obtained for the underlying true values of the target variable(s), either at the individual level or directly at the level of the target parameters. The true value itself is (usually) not observed; this is called a latent variable. The precise relation between the latent true value and the observed values depends on the type of model. In their basic form, most measurement error models assume that the errors are independent across observed variables, given the underlying true value; this is known as the local (or conditional) independence assumption.
To model measurement errors in numerical data, one may use a structural equation model (e.g. Bollen, 1989) or a finite mixture model (e.g. McLachlan and Peel, 2000). Recently, applications of structural equation modelling to multi-source statistics have been considered by Bakker (2012) and Scholtus et al. (2015). Finite mixture models have been developed by Meijer et al. (2012) and Varriale (2015, 2016). Under such a model, the population is supposed to consist of two or more components where each component has a different distribution of observed values, and each unit is supposed to belong to one of these components. Guarnera and Varriale explicitly consider the case that measurement errors are 'intermittent': part of the observed values in each data set are correct, and the remaining values contain errors.
For categorical data, models based on latent class (LC) analysis can be used (e.g. Hagenaars and McCutcheon, 2002). Application of LC models to measurement errors in statistical data are considered by, among others, Biemer (2011), Si and Reiter (2013), Pavlopoulos and Vermunt (2015) and Oberski (2017). Boeschoten et al. (2017) also use an LC model to model the true value of a variable that is observed (with measurement error) in multiple sources. We sketch their approach. Let Y D .Y 1 ; Y 2 ; : : : ; Y s / 0 denote a vector of observed categorical variables that measure the same conceptual variable of interest (e.g. in s different data sources). The true value with respect to the variable of interest is represented by a latent variable X . We assume that all variables Y j and X have the same set of categories, say 1; : : : ; L. Under the local independence assumption, the marginal probability Pr .Y D y/ of observing the particular vector of values y D .y 1 ; y 2 ; : : : ; y s / 0 can be expressed as Estimating the LC model amounts to estimating the probabilities in the right-hand-side of this expression. The model can be used to estimate, for each unit in the data, the probability of belonging to a particular LC, given its vector of observed values:  Rubin, 1987, p. 76). An essential aspect of these pooling rules is that an estimated variance of the pooled estimates is obtained.
The method is, besides the local independence assumption, based on two additional assumptions: that measurement errors are independent of the covariates and that covariates do not contain classification errors. When covariates do contain classification error, the method can lead to biased estimates.
Estimated relations between the target variable and covariates are only valid when these covariates are taken into account in the LC model and if there is not too much measurement error in the underlying data sets. If covariates are not taken into account in the LC model, either a new LC model needs to be estimated and applied or a correction method should be used (see, e.g. Boeschoten et al., 2018).
A related method for correcting for measurement error is multiple over-imputation, where data affected by measurement error are multiply imputed (see, e.g. Blackwell et al., 2017). In contrast to imputation, with over-imputation observed values may be replaced by imputed values. Van der Heijden et al. (2018) proposed an imputation approach for the case where the measurements of a target variable in one data set are considered to be of higher quality than the measurements of that variable in other data sets, and some values in the higher quality data set are missing.
Before applying a structural equation model, LC model or imputation model, large errors in the data usually need to be corrected by micro-integration or a form of micro-editing.

Undercoverage and Overcoverage
Situation 5 is characterised by a further deviation from Situation 4, by which the combined data entail undercoverage of the target population, even when the data are otherwise in an ideal error-free state (see Figure 5). In this situation, the total population size is not known.
Producers of official statistics are often interested in estimating the unknown size of a population. In particular, an important problem in a population census is to estimate the number of persons in the target population who were missed by all data sets used in the census. The socalled capture-recapture methods are often used to solve this problem (Fienberg, 1972;Chapter 6 in Bishop et al. 1975;International Working Group fror Disease Monitoring and Forecasting, 1995).
The simplest application of the capture-recapture method is based on two independent samples from the target population. Consider a 2 2 contingency table with the observed counts of persons being included or excluded in the first and second sample. Let n 11 denote the observed number of persons in the overlap of the two samples, and let n 10 and n 01 denote the numbers of persons observed in the first sample but not the second sample and vice versa. By definition, one does not observe any persons that are not in either sample (n 00 D 0/. Let m 00 denote the expected number of persons in the population that are not observed in either sample. If the samples are independent, a consistent estimator for m 00 can be obtained from the observed counts as follows (e.g. Bishop et al., 1975, p. 232): O m 00 D n 10 n 01 =n 11 . An estimate for the total population size, including the part that was missed by both samples, is then given by O N D n 11 C n 10 C n 01 C O m 00 . Formally, the capture-recapture method can be derived from a loglinear model for the aforementioned contingency table (see Chapter 6 in Bishop et al., 1975). This approach is also referred to as dual system estimation (Ding & Fienberg, 1994) An example of Situation 5 where the capture-recapture method can be applied concerns a population census followed by a post-enumeration survey (Wolter, 1986;Brown et al., 1999;Brown et al. 2006). Here, the post-enumeration survey is conducted with the specific aim of estimating the undercount in the original population census. The capture-recapture method can also be applied by NSIs that conduct a census based on administrative data ( Van der Heijden et al., 2012;Baffour et al. 2013;Gerritse, 2016). In this case, data from at least two administrative sources are linked together, and each data set is considered as an independent sample from the population. Gerritse et al. (2016) applied a capture-recapture method to estimate the amount of undercoverage in the population size estimate of the 2011 Dutch census, which is a virtual census in the sense that it is mainly based on a number of administrative data sets, supplemented with sample survey data. The census itself was based on the Dutch population register. For the estimation of undercoverage, two additional registers were linked to the population register: an employment register and a crime suspects register. The census aims to count the number of 'usual residents', where persons are classified as usual residents if they have lived at least 12 months in the Netherlands or intend to do so at the time of the census. Gerritse et al. (2016) used probabilistic linkage to link the three registers. To handle missing values on the 'usual resident' status, two different approaches were used: maximum likelihood estimation and imputation by predictive mean matching. The latter approach was found to be more flexible and therefore preferred by the authors.
The capture-recapture method is based on five assumptions (Gerritse, 2016): (a) The event of being observed in one data set should be independent of the event of being observed in the other data set. This assumption can be relaxed if there are three or more sources (see Chapter 6 in Bishop et al., 1975)  (e) The data sets do not contain units that do not belong to the target population ('erroneous captures'), nor do they contain duplicates.
These assumptions are rather strong. Research has shown that estimates of population size based on the capture-recapture method can be severely biased when some of these assumptions are violated (Brown et al., 2006;Van der Heijden et al., 2012;Gerritse, 2016).
There is ongoing research into generalisations of the capture-recapture method and alternative methods that require less strong assumptions. Assumptions (a) and (c) are often relaxed by adding covariates to the model. Here, a problem may be that some covariates are not available in all data sources. Incomplete covariates may be handled by maximum likelihood under a Missing At Random assumption; see Van der Heijden et al. (2018) for a recent discussion with applications. Lawless (2014, Chapter 17) discussed adaptations of the capture-recapture method to open populations (assumption (b)). Extensions that can account for linkage errors (assumption (d)) were developed by Fienberg (1994, 1996) and Di Consiglio and Tuoto (2015). De Wolf et al. (2018) provide a synthesis and further generalisation of these extensions. These methods work under probabilistic record linkage, by correcting the observed counts for bias due to erroneous and missed links.
Assumption (e) is violated in the presence of overcoverage in one or more data sets. Di Cecco et al. (2018) have developed an extended capture-recapture method that can account for overcoverage as well as data sets that contain certain specific subpopulations only (so that not all units in the target population have a positive probability of being observed in each of the data sets, and assumption (c) is violated). This approach is based on an LC model, with erroneous captures indicated by a latent variable. A practical drawback of this method is that it requires at least four linked data sets. An alternative approach for handling simultaneous undercoverage and overcoverage, which is not based on the capture-recapture method, was proposed by Zhang (2015).
Overcoverage is a wider problem that also occurs outside the context of capture-recapture methods. For instance, a population register may suffer from overcoverage due to delayed deregistration of inactive units. In practice, overcoverage and duplicated records are often handled by clerical review or by applying deterministic rules (Di Cecco et al., 2018). Assessing the amount of overcoverage and its effects on estimates may be difficult in some applications, in particular, when overcoverage is caused by false positive linkage errors (Bakker, 2011b). In the context of a traditional census, the overcoverage rate is usually estimated from a postenumeration survey. In a multi-source context, the overcoverage rate may be assessed by linking administrative or survey data from auxiliary sources to the main data set (UN/ECE, 2014, pp. 75-77).

Aggregated Data Only
Situation 6 (see Figure 6) is the macro data counterpart of Situation 4: in Situation 6, only aggregated data overlap with each other and need to be reconciled. An example of Situation 6 is provided by the National Accounts, where aggregated data from different data sets need to be reconciled with each other subject to both equality and inequality constraints.
To reconcile aggregated data, macro-integration can be used (see, e.g. Mushkudiani et al. 2012). When macro-integration is applied, only estimated figures at an aggregated level are adjusted. The goals of macro-integration are to obtain a more accurate, numerically consistent and complete set of estimates for the variables of interest.
Often, the starting point of macro-integration is a set of estimates in tabular form. The entries of the tables are adjusted so all differences between tables are reconciled, and the entries with the highest variance are adjusted the most. In the macro-integration approach, often a constrained optimisation problem is constructed. A target function, for instance, a quadratic form of differences between the original and the adjusted values, is minimised, subject to the constraints that the adjusted common figures in different tables are equal to each other and additivity of the adjusted tables is maintained. Inequality constraints can be imposed on these quadratic optimisation problems. In the literature, Bayesian macro-integration methods have also been proposed. Several methods for macro-integration have been developed, see, for instance, Stone et al. (1942), Byron (1978), Sefton and Weale (1995), Magnus et al. (2000), Boonstra et al. (2011), Mushkudiani et al. (2012 and Daalmans (2015).
Macro-integration can reconcile several tables simultaneously, as long as the number of variables or constraints does not become too large. With current software and computers, problems with several hundred thousand unknowns and constraints can nowadays be solved.
Macro-integration can only be applied for correcting random errors, not for correcting systematic errors as application to systematic errors is likely to lead to biased results. Systematic errors, especially large ones, have to be corrected by another approach, for example, by manual data editing, before macro-editing can be applied successfully.
When one wants to use macro-integration, it is important that (an approximation to) the variance of each entry in the tables to be reconciled is available, can be computed or can somehow be approximated. In some cases, one may have to rely on expert knowledge in order to approximate these variances (see, e.g. Xie et al., 2018).
In practice, results after macro-integration of large sets of tables, such as National Accounts, are checked manually for plausibility, for instance, by inspecting time series of reconciled figures. If needed, the reconciliation is repeated after removing some errors overlooked in the first instance.

Micro Data and Aggregated Data
Situation 7 (see Figure 7) is characterised by a variation on Situation 4, by which aggregated data are available besides micro data. There is still overlap between the data sets, from which the need arises to reconcile the statistics at some aggregated level. Of particular interest here is the case that the aggregated data are estimates themselves. Otherwise, the reconciliation can be achieved by means of calibration which is a standard approach in survey sampling (see, e.g. Chapter 6 in Sarndal et al., 1992). In Figure 7, the aggregated data are denoted by O T 1 ; : : : ; O T p to highlight that in practice, these are often estimated population totals.
We assume that several tables have to be estimated using the available micro data and aggregated data. An example of Situation 7 is the Dutch Population census, which is based on a mix of administrative data sets and sample survey data as mentioned before. Population totals, either known from an administrative data set or previously estimated, are imposed as benchmarks provided they overlap with an additional survey data set that is needed to produce new output statistics.
When micro data and aggregated data have to be reconciled, several methods are available, such as repeated weighting, repeated imputation, mass imputation and macro-integration (see also De Waal, 2016). In repeated weighting, population tables are estimated sequentially. Data from a data set covering the entire population can simply be counted. Data only available from surveys are weighted. A separate set of weights is assigned to survey units for each table of population totals to be estimated. When estimating a new table, all cell values and margins of this table that are known or have already been estimated for previous tables are kept fixed. This is achieved by using regression weighting to calibrate to these known or previously estimated values (Houbiers, 2004). This ensures numerical consistency of the cell values and margins of the new table and previous estimates, if calibration weights can be found. That such calibration weights can be found is not guaranteed, however. Repeated weighting is mainly applied to ensure numerical consistency between estimated tables. However, calibrating to totals based on large sample sizes generally leads to a reduction of the sample variance for tables based on smaller sample sizes (see, e.g. Houbiers, 2004).
A strong aspect of repeated weighting is that (statistical and logical) relationships between data items from a single data source are automatically maintained. The occurrence of empty cells in high-dimensional tables, that is, cells without any observations, complicates the use of repeated weighting as weighting empty cells leads to population estimates with value zero. In some cases, either very large or very small weights may then have to be given to other cells in order to preserve known or previously estimated values. In other cases, it may not even be possible to find suitable weights at all.
Repeated imputation is similar to repeated weighting. Repeated imputation is again a sequential approach where tables are estimated one by one. For some variables in a table, estimates may have already been produced while estimating a previous table. These variables are then calibrated to the previously estimated values by applying an imputation method that preserves known or previously estimated values. For each new table to be estimated, a new imputation model is constructed.
The occurrence of empty cells is usually not a serious problem for these imputation methods. However, with repeated imputation it may be difficult to preserve relationships between variables, even for variables occurring in the same data set. The results of both repeated weighting and repeated imputation depend on the order in which tables are estimated.
A prerequisite for applying repeated imputation is an imputation method that succeeds in preserving the statistical aspects of the true data as well as possible and that is able to preserve previously estimated values. Preferably, the imputation method should also satisfy edit restrictions on the data. Such imputation methods have been developed by, for instance, Chambers and Ren (2004), Zhang (2008), Zhang and Nordbotten (2008), Pannekoek et al. (2013), Coutinho et al. (2013), Kim et al. (2014), Da Silva andZhang (2014) andDe Waal et al. (2017a). Which imputation method is most appropriate depends on the kind of data (e.g. numerical versus categorical data), the missing data mechanism and the aims one tries to fulfil (e.g. should logical rules, such as that males cannot be pregnant, be fulfilled at the micro level?).
When mass imputation is used, one imputes all fields for which no value was observed for all population units. Mass imputation hence leads to a data set with values for all variables and all units. After imputation, estimates for population totals can be obtained by simply counting or summing the values of the corresponding variables.
The major risk of mass imputation is that the mass-imputed data may be used to estimate or analyse aspects that were not accounted for in the imputation model. The results of such an estimation or analysis procedure are likely to be biased. It is generally impossible to capture all relevant variables and relations in the imputation model, simply because there are not enough observations to estimate all model parameters accurately, which implies that many relations found in the imputed data will not reflect the relations in the population. Note that this is not necessarily a problem for repeated imputation. In that case, a separate imputation model, involving a limited number of variables only, is constructed for each new table. Mass imputation has, for instance, been studied by ,  and Shlomo et al. (2009).
Macro-integration has already been described for Situation 6 and can be applied in Situation 7 too by first transforming the micro data to aggregated data themselves. As the transformation is usually carried out by means of weighting the data, empty cells may complicate the procedure, just like for repeated weighting. A (potential) drawback of the macro-integration approach in Situation 7 is that one cannot re-calculate the adjusted table figures from the underlying micro data directly. This problem may in some cases be overcome by deriving weights by means of the calibration estimator, using the reconciled macro-integrated figures to calibrate the results. Such weights do not necessarily exist, however.
An advantage of macro-integration over repeated weighting and repeated imputation is that all tables to be estimated can be produced simultaneously. So, whereas the results of repeated weighting and repeated imputation are order dependent, the results of macro-integration are not. Besides, the simultaneous estimation of all tables may lead to more accurate estimates. In summary, what is the most suitable method for reconciliation of micro data on macro data depends on the properties of the data and on the targeted results. The answer depends on questions such as: is it important that the macro estimates can be directly (re-)calculated from the micro data, are there many empty cells, do logical relations play a role, and will the micro data be used by other researchers?

Longitudinal Data
Finally, longitudinal data are introduced in Situation 8. We limit ourselves to the issue of reconciling a time series of high frequency with one of a low frequency, as illustrated in Figure 8. The difference with the macro-integration in Situation 6 is that the data are now related to each other over time. The data of the low-frequency series are usually considered to be exogenous and are kept fixed, because these are usually based on the most comprehensive information.
When a high-frequency series is adjusted to have temporal consistency with a low-frequency series of the same variable, usually measured from a different data source, this is known as benchmarking (European Commission, 2018, p. 7). A related problem is that of disaggregation: a series of low frequency of a target variable is disaggregated by using an indicator series of high frequency for the target variable (European Commission, 2018, p. 7).
Situation 8 is for instance found at Statistics Netherlands where monthly turnover based on a sample survey of enterprises is used to compute turnover indices for the short-term statistics. These indices are computed for a number of publication cells. An example of the time series of the publication cell 'Manufacture of cutlery, tools and general hardware', from January 2010 till December 2011 is given in Figure 9. These sample survey data (labelled as 'source' in Figure 9) are benchmarked against quarterly turnover values. The horizontal lines in Figure 9 represent the average monthly index values per quarter of the source and the benchmark data. The quarterly benchmark turnover values are largely based on VAT data supplemented by survey data, which was explained already in the example for Situation 2. These quarterly data are kept fixed, because they cover nearly the complete population.
A wide range of methods is available for benchmarking. Perhaps the most basic method is to preserve the original levels with prorating. Prorating means that the level estimates are adjusted with the same relative factor. Another method to preserve the original levels is that by Chow and Lin (1971). It expresses the estimation of the high-frequency values as a linear regression on the low-frequency values and finds the solution by generalised least squares.
A disadvantage of prorating and of the Chow-Lin method is that they lead to the so-called step problem: when observing reconciliation adjustments of the changes between two successive high-frequency periods, disproportionally large adjustments may be observed in the transition from one low-frequency period to the next. For instance, in the turnover example, the monthly growth rate in January 2011 was 57.5% in the source data, and after applying prorating, it was adjusted to 16.1% due to the step problem ( Figure 9). A similarly large adjustment can be seen in the growth rate of July 2011.
An alternative to level preservation is movement preservation (MP). MP methods aim to preserve the changes in the original high-frequency series. Examples of methods in this class are the ones by Denton (1971), their slightly modified variants by Cholette (1984) and the extensions of Chow-Lin by Fernández (1981).
In order to give a more formal presentation of benchmarking, let x D .x 1 ; x 2 ; : : : ; x n / 0 stand for the values of a monthly time series and let b D .b 1 ; b 2 ; : : : ; b m / 0 be the values of a quarterly time series, which is kept fixed. Denote the benchmarked values by x . After benchmarking, it should hold that P 3q j D3.q 1/C1 x j D b q . The additive first-order Denton method finds benchmarked values by minimising the squared differences between adjusted and original first-order differences over the entire period of the series (Bikker et al., 2011), more formally stated by min x j X n j D1 x j x j 2 with x j D x j x j 1 and x 1 D x 1 : Therefore, the benchmarked values are determined not only by the corresponding quarters but also by previous and next quarters. This way, a large shift in monthly changes just before and after the end of a quarter is avoided. In the turnover example, the monthly growth rate in January 2011 for the series benchmarked by the MP approach was 35%, which is closer to the growth rate of the source than was the case after benchmarking with prorating. Also, the growth rate adjustment in July 2011 was smaller after applying the MP approach than after prorating. Benchmarking can also be applied to multiple time-related variables. The problem now is to deal with time constraints and with cross-sectional constraints between variables (Bikker et al., 2011). Di Fonzo andMarini (2003;2005) and Bikker and Buijtenhek (2006) combined the Denton method for time constraints with the method of Stone et al. (1942) for handling cross-sectional constraints between the variables.
A multivariate benchmarking method can be refined by applying weights to the adjustments made to each series. These weights should reflect the relative accuracy of the estimated growth rates of the high level frequency series. Usually, growth rates of reliably measured series are preserved more strongly than the growth rates of inaccurately measured series. Bikker et al. (2013) extended the method to include other modelling features, such as constraints that have to be satisfied only approximately (soft constraints), ratio constraints and inequality constraints.
The reconciliation methods in this section cannot be used for data with (large) systematic errors, because of a smearing effect: an error in one value contaminates other values' estimates. Hence, it is important to check the time series for large systematic errors and to correct those before applying benchmarking. This is usually carried out interactively by confronting the preliminary data with the constraints.
After benchmarking, one should always inspect the corrections to judge the plausibility of results. Guidelines on how to apply benchmarking in specific situations can be found in European Commission (2018).

Discussion
We are fully aware that the basic situations we have considered in this paper do not offer a complete description of all situations that may arise in practice and that our basic situations give a simplified view of reality. At the same time, we do feel that this paper offers useful guidelines to producers of multi-source statistics. Many situations arising in practice are variations of the basic situations that we have discussed in this paper or combinations of such basic situations. The basic situations and the corresponding methods we discussed in this paper should at least give producers of multi-source statistics a good starting point to handle such cases. For instance, when we are dealing with a combination of two basic situations, a logical starting point would be to consider using methods for these two situations in combination. As an example, for multi-source data with undercoverage and a common target variable with measurement for overlapping units, one could consider using capture-recapture techniques (Section 4.5) in combination with LC models (Section 4.4). This is indeed the approach taken at Statistics Netherlands.
In the discussion of the basic situations, we have pinpointed important issues that can occur for these situations. This will allow producers of multi-source statistics to anticipate the problems that may occur for their specific situation. In the discussion of the basic situations, we also described and gave references to important methods that can be used to overcome the problems. Hopefully, this will give the producers of multi-source statistics a flying start to overcome the problems for their own specific case. Many of the methods referred to in this paper have only recently been developed. These methods are therefore still in their infancy and will hopefully be improved upon in many different aspects in the coming years.
Finally, we remark that after combining data sets, one is usually interested in estimating the accuracy of the outcomes. Different quality measures and methods to compute them for various situations are currently under development for this purpose in the ESSnet on Quality of Multi-source Statistics, which is partly funded by the EU (see, e.g. De Waal et al., 2017b).
We hope that our discussion of various situations will inspire other researchers to do research on the highly important and interesting area of producing multi-source statistics.