Re‐identification in the Absence of Common Variables for Matching

A basic concern in statistical disclosure limitation is the re‐identification of individuals in anonymised microdata. Linking against a second dataset that contains identifying information can result in a breach of confidentiality. Almost all linkage approaches are based on comparing the values of variables that are common to both datasets. It is tempting to think that if datasets contain no common variables, then there can be no risk of re‐identification. However, linkage has been attempted between such datasets via the extraction of structural information using ordered weighted averaging (OWA) operators. Although this approach has been shown to perform better than randomly pairing records, it is debatable whether it demonstrates a practically significant disclosure risk. This paper reviews some of the main aspects of statistical disclosure limitation. It then goes on to show that a relatively simple, supervised Bayesian approach can consistently outperform OWA linkage. Furthermore, the Bayesian approach demonstrates a significant risk of re‐identification for the types of data considered in the OWA record linkage literature.


Introduction
Many important questions can be addressed (if not definitively answered) through the analysis of statistical data. Relevant data might be held by a variety of data stewardship organisations (DSOs) such as governments, charities and businesses. Although the dissemination of such data can benefit society, there are also dangers. Individuals in the data could be harmed if information about them is revealed, and there could also be harm to the relevant DSO through loss of trust. A business that leaks customer information is likely to suffer financially. A government might have a much harder task gathering reliable census data following a confidentiality breach.
On the face of it, we have a simple decision problem. If the utility of the data exceeds the risks, then release the data; otherwise do not. In reality, the problem is much more nuanced. Utility and risk are abstract concepts, and it is a non-trivial task to generate meaningful measures. We also need to bear in mind that the data are a means to an end. We want to use them to answer questions, and we might be able to answer those questions without releasing the complete raw data. The field of statistical disclosure limitation (SDL) is concerned with finding ways to allow data to be used to answer questions of interest while limiting the risks of disclosing sensitive information.
In this paper, we focus on risk, in particular the risk associated with linking two sets of microdata. By somehow merging the data in two (or more) datasets, it might be possible to International Statistical Review International Statistical Review (2020), 88, 2, 354-379 doi:10.1111 recover both the identity of individuals and the information that those individuals would rather keep private.
Classical record linkage is carried out using variables that are common to both datasets (Fellegi & Sunter, 1969). Matching records (those which relate to the same individual) would be expected to match on the common variables, while non-matching records would only match on these variables by chance. However, matching between two datasets with no common variables has been attempted in the context of SDL. Nin and Torra (2005) used a data science approach and demonstrated 'better than random' linkage performance. They did not demonstrate levels of risk of practical significance. In this paper, we show that a relatively simple Bayesian approach can consistently outperform the approach of Nin and Torra (2005). Furthermore, we show that the risks can be of practical significance.
In Section 2, we provide a broad overview of SDL. Section 3 provides a brief review of classical record linkage-this is used as a part of the ordered weighted averaging (OWA) approach, and the theory underlying classical linkage is also used to justify the Bayesian alternative to OWA. Section 4 reviews existing SDL approaches to risk assessment that are based on record linkage. Section 5 describes the OWA approach. Section 6 presents details of the Bayesian alternative. Section 7 presents experiments that compare the OWA and Bayesian approaches. Risk assessment using the Bayesian approach is discussed in Section 8.

Statistical Disclosure Limitation
In SDL, we are usually concerned with the risks of either identification or attribution. Identification is the association of an individual with a data record. Attribution is the association of an individual with a (previously unknown) value of a variable. It is important to note that neither of these implies the other. Identification might require knowledge of all the data values in a record, in which case no new values are disclosed. Attribution can result from the association of an individual with a group of records that share common variable values. In the main, we are concerned about identification as a means of achieving attribution. This would be the case if we were to link two datasets, one containing directly identifying information such as name and address, and the other containing sensitive information relating to health status. But there is also concern regarding perceived risk. A DSO's reputation could be damaged if it released data that allowed individuals to be (re-)identified, even if no attribution occurred.
There will generally be uncertainty over whether a claimed identification or attribution is correct, not least because the data might contain errors. We use the term exact to describe identification/attribution that can be made with certainty under the assumption that the data are error free. We use the term approximate to describe identification/attribution that cannot be made with certainty even with error free data. Approximate inferences are a concern if the level of confidence in a hypothesis (usually expressed as a probability) exceeds a given threshold. The greater the confidence, the more likely a claim of identification/attribution will be made and the more likely it will be correct.
Disclosure risks are not limited to any particular form of data. In recent years, researchers have considered the risks posed by genetic data (Gymrek et al., 2013), biometric data (Kumar et al., 2008;Bohannon, 2015), spatio-temporal data (Brownstein et al., 2006;Montjoye et al., 2015;Tockar, 2014) and social network data (Backstrom et al., 2007). Here, we focus on the more traditional problems of risk assessment for microdata (lists of records) and aggregate data (tables of counts). These are the forms of data commonly held by statistical agencies. These agencies would assess the risks of identification/attribution before data were released.
For risk assessment, we generally assume some plausible attack scenario (Paass, 1988;Elliot & Dale, 1999). This will describe the motivation and skills of a data intruder and the precise International Statistical Review (2020), 88, 2, 354-379 method of attack. The target of the attack might be a specific individual (e.g. a neighbour of the data intruder). In other cases, the intruder's goal might be to discredit the DSO, in which case all the individuals in the relevant data might be under attack. The attack might focus on re-identification or attribution. Distinct attack scenarios give rise to distinct measures of risk.

Identification Risk
Microdata are often anonymised by removing the more obvious identifying fields such as name and address. However, this is often not enough to reduce identification risks to acceptable levels because combinations of variables can form quasi-identifiers. Sweeney (2002) found that 87% of the US population were distinct from all other members of the population on the quasiidentifier (date of birth, gender and ZIP code). Much of the research relating to identification risk is focused on this kind of distinctiveness, often termed uniqueness. If an intruder attempts to link sample records against a population by matching on the observed values for a given quasiidentifier, then the probability that any found link is a correct match is simply the reciprocal of the population frequency for those values. The sample frequencies constitute lower bounds on the population frequencies, so it is only the sample uniques that can be associated with match probabilities greater than 0.5. However, the population frequency might be well in excess of 1 for some sample uniques. So one issue in SDL is distinguishing which sample uniques are likely to correspond to low population frequencies. This exercise might be undertaken by an intruder in order to focus an attack but would also be carried out by a risk assessor who has no access to the population data. One approach to this problem uses methods developed for association rule mining . Others have chosen to model the population frequencies (Skinner & Holmes, 1998;Smith, 2006;Forster & Webb, 2007).
Another problem arises when a DSO considers releasing a number of aggregate tables from the same underlying dataset. Each released table will have smaller dimension than the underlying dataset, perhaps tailored to a specific research question. The counts in such tables will be consistent-for example, two tables with an 'age' dimension will have the same marginal frequency distribution over 'age'. Armed with this information and the non-negativity constraint on counts, a data intruder can attempt to recover the counts in unpublished, more highly dimensional tables. Bounds on such counts can be found using integer linear programming approaches. A relatively simple but effective approach was developed by Dobra and Fienberg (2008), although it is not generally guaranteed to produce the tightest possible bounds. For certain collections of tables, the exact bounds can be calculated very efficiently (Dobra & Fienberg, 2000). The relevance for identification risk is that uniques that were thought to be disguised might be recoverable. Of course, bounds on counts can be used to generate bounds on any empirical probabilities associated with the underlying data. So these calculations are also of use when assessing attribution risk.
There are a range of measures for identification risk, some of which relate to specific attack scenarios. Fienberg and Makov (1998) proposed the proportion of sample uniques, which are population unique, P r.P U jSU /. Skinner and Elliot (2002) favoured the number of sample uniques divided by the sum of the corresponding population frequencies, P r.CM jUM /. This is the probability of producing a (correct) match by randomly searching the population until an individual is found who matches (on the relevant variables) against any sample unique. Skinner and Elliot (2002) also present a measure, P r.CM jSU /, which is the probability of finding a (correct) match by randomly selecting a single sample unique and searching the population until a match (on the relevant variables) is found. Smith (2011) discusses these risk measures and others that can be generated under different search strategies. For instance, an intruder can find a (correct) match with probability greater than any of the earlier measures by selecting International Statistical Review (2020), 88, 2, 354-379 a number of sample uniques and searching the population until matching individuals (on the relevant variables) are found for each. The intruder selects the last individual found. Smith (2011) also shows that an intruder can do a lot better still by using the modelling approach(es) in Smith (2006) and Forster and Webb (2007), albeit at much greater search cost. (All the earlier approaches are based on a data intruder who has no access to a population dataset and who must perform some kind of search.)

Attribution Risk
Attribution risk depends on the inferences that can be made by a data intruder as a consequence of released data. If the target is known (by the intruder) to be a member of the dataset, then inference is as simple as conditioning on known levels of variables. Any variable values that are associated with all the members of the resulting sub-population can be attributed to the target. Similarly, any variable values that are not associated with any members of the subpopulation cannot be associated with the target. So exact attribution is only possible if one or more combinations of variable values are missing or, equivalently, if we have one or more zeros in a table of counts. With no zeros, it is still possible for the intruder to perform approximate attribution. Smith and Elliot (2008) produced a risk measure for releases of tabular data that was based on an attack scenario where the intruder had knowledge of a number of the data subjects. This could be feasible when the data subjects all live in a small geographical area. By subtracting known individuals from the dataset, the intruder could produce a more disclosive dataset.
There are relatively few measures for attribution risk. Whereas measures for identification risk can focus on uniques (or low counts), attribution risk depends on the sensitivities of variable values. It is also the case that we should not simply use the posterior beliefs regarding sensitive values as the basis of risk measures. These might not be very different from prior beliefs or the posterior we might obtain from the dataset without conditioning on the information in a target record. Dwork (2008) argues that if we consider that making inferences about the general population is a legitimate use of data, then we should only be concerned about attribute disclosure to the extent that inferences relating to each member of the dataset differ from the inferences that would have been made if the individual had not been included in the data. If these differences are sufficiently small, then the data are said to be differentially private. The author goes on to develop a differential privacy framework for query systems that is discussed in Section 2.3.2. An alternative view is that analyses of data that might stigmatise certain groups (e.g. religious or ethnic) should be prevented (Institute of Medicine, 2015). Of course, suppressing data or analyses on this basis introduces bias, and this is an issue that has attracted insufficient attention in the SDL literature to date.
A third type of data disseminated by statistical agencies is magnitude data. The published data are totals for variables such as investment or turnover. If the data only cover a small number of enterprises, perhaps limited by geographic region, then a company could make inferences about the investment or turnover of their competitors. In the degenerate case of only two companies, each can discover the relevant quantities for their competitor by subtracting their own quantity from the published total. For magnitude data, there are published risk measures such as the p/q rule and the (n,k) rule (Willenborg & de Waal, 2001).

The Limitation of Risk
There are three general forms of disclosure limitation, although they are not mutually exclusive. These are restricted access, suppression and perturbation.
International Statistical Review (2020), 88, 2, 354-379 Restricted access avoids releasing data to the general public. End users would typically apply to a DSO with details of the intended use of the data. If approved, the end user might be supplied with the data, be required to access the data in a safe setting or be allowed to run a remote analysis using the data. A safe setting could be a room containing a non-networked computer holding the data. The end user would be allowed to analyse the data using that particular computer, and there would generally be steps to prevent the user copying the data to a removable storage device. A more restrictive approach would be to allow the end user to submit analyses (generally as code in a suitable programming language). If approved, the code would be run and the results sent back to the end user. This has the advantage that the end user does not need to be physically present.
Suppression involves removing some aspects of the data before release. Record suppression might remove records that are considered to be risky-perhaps in terms of the risk of re-identification or the sensitivity of the information they contain. Records might be randomly suppressed (sampled) in order to increase uncertainty over any possible re-identification or inference regarding variable values. Attribute suppression involves removing variables from the dataset. These might be potential identifiers (as in anonymisation) or variables that are considered sensitive. In tables of counts, we might suppress individual cells (essentially all records with a given set of attribute values). It is also common practice to suppress detail by aggregating values into broad(er) categories. A typical example would be to publish age in, say, 10-year intervals.
Perturbation involves changing the data before release. Random noise might be added to numeric values. Table counts might be rounded to some multiple of a chosen integer base n. Parts of microdata records might be swapped with other records-maintaining the correct marginal distributions for at least some subsets of variables. Details of these, and other schemes, can be found in more general publications on SDL such as Willenborg and de Waal (2001) and Duncan et al. (2011).
Limitation methods are sometimes applied to ensure that the released data possess certain properties. A dataset is said to be k-anonymous if, for each individual in the data, there are at least k-1 other individuals who share the same information (Sweeney, 2002). k-anonymity is usually applied to quasi-identifiers, rather than all the variables in a dataset, and is achieved via the suppression of values or detail (via generalisation). It does not prevent attribution in cases where an intruder can associate a target with k (or more) records, which share common values. In response to this issue, Machanavajjhala et al. (2007) introduced l-diversity. This extends kanonymity by also requiring that there should be sufficient diversity in the values of sensitive variables for any equivalence class. Practitioners define diversity in various ways. Whatever the definition, it should create sufficient uncertainty in the mind of the intruder over the underlying values of the sensitive variables. t-closeness (Li et al., 2007) extends k-anonymity by requiring that the distribution of sensitive variables in any equivalence class must be sufficiently close to the marginal distribution of those variables in the full dataset.
For SDL approaches to be effective, while limiting the negative impact on data utility, we must also consider the environment into which the data are being released (Purdam et al., 2003;Smith & Elliot, 2014). The intruder might have access to other data, which could facilitate reidentification or attribution. These data (or their sources) are generally specified in an attack scenario. The DSO itself might be releasing other data, which could be helpful to the intruder. One case in point is the practice of releasing marginal totals for tables of counts. Clearly, the suppression of a single cell achieves nothing if the value can be recovered by subtracting the sum of the remaining cells from a published total. So at least one more cell would need to be suppressed. Even then, the total of the suppressed cells would need to be greater than zero to achieve any protection. This secondary cell suppression problem is discussed in Willenborg and de Waal (2001). Rounding of both counts and published total does not guarantee that the original counts cannot be recovered. For a single rounded table and rounded total, the measure in Smith and Elliot (2008) can be calculated efficiently. For more general cases, the approaches in Dobra and Fienberg (2000) and Dobra and Fienberg (2008) can be used to generate bounds on underlying counts. The main point here is that we cannot generally risk assess data releases in isolation.

Risk, utility and sensitivity
Various jurisdictions will have different legal requirements for data protection. Once we have satisfied those, we need to decide what additional aspects of the data should be protected. The usual concerns are for harm to the data subjects and harm to the DSO. Standard practice is to identify sensitive variables and ensure they are protected. This is necessarily subjective as different individuals will have different privacy concerns. In many cases, it is only certain levels of those variables that are sensitive, and this might be reflected in the risk reduction measures adopted.
Assuming we have identified the sensitive aspects of the data and a potential means of protecting them, we should consider the impacts of protection on data utility. If analyses of the protected data are likely to be misleading, the best decision might be not to release them at all. Purdam and Elliot (2007) refer to 'reduction of analytical completeness' and 'loss of analytical validity'. The former would generally be a consequence of suppression, with analyses that would be possible with the underlying data not being possible with the protected data. The latter would often be due to perturbation. The same analyses would be possible with protected data, but they might lead to substantively different conclusions.
Many SDL methods are parameterised. For instance, with noise addition, we can choose to add noise from distributions with low or high variance; we can choose to categorise continuous variables into narrow or broad intervals; we can choose to randomly round counts to base 3 or 5; and so on. Duncan (2002) advocates plotting risk (R) against utility (U ) for ranges of parameters in order to assess the trade-off between risk and utility and perhaps find parameters that maximise some objective function of the form R C .1 /.1 U /. The same approach can also be used to compare distinct SDL methods. Each .R; U / pair is a potential solution. But in practice, measuring utility is difficult. We do not always know how the released data will be analysed, so there is often no clear measure of utility. It is not uncommon for some measure of data quality to be used instead-usually based on a measure of distance between the underlying data and the data prepared for release. So rather than trying to optimise an objective function, many choose to specify some threshold on the risk and then maximise utility conditional on not exceeding the threshold. If this approach is coupled with an assumption that risk is a monotone increasing function of utility, then utility might not even be considered at all. Clearly, this is not an approach to be encouraged.
Some DSOs prefer to keep aspects of the protection measures secret to improve security. A DSO might add noise but not publicise the details of the distribution from which the noise is sampled. This will generally decrease utility and make analysis more difficult for the end user. Furthermore, if the details are leaked, then all released datasets protected by the procedure become compromised. This is analogous to the cryptography principle that a system should be secure even if everything (but the key) is known about it (Kerckoffs, 1883). There is also the prospect that protection measures could introduce biases that lead to algorithmic unfairnessdiscrimination against legally protected groups. Seen as a part of a machine learning process, disclosure control methods could be subjected to audit under law (Hacker, 2018), and protected International Statistical Review (2020), 88, 2, 354-379 data resulting from insufficiently transparent SDL methods could be very restricted in terms of their subsequent use.

Differential privacy
In recent years, there has been an enormous amount of research into differential privacy (Dwork, 2008). It is based on the notion that an individual in the data can only be harmed to the extent that inferences about them are different to inferences about similar individuals outwith the data. This is a powerful idea. Differential privacy is defined in Dwork (2008); thus: (W)e move from comparing an adversary's prior and posterior views of an individual to comparing the risk to an individual when included in, versus when not included in, the database. This new notion is called differential privacy.
It allows us to release useful data that might be very informative about hypotheses of interest, without necessarily considering someone to have been harmed because our beliefs about them have significantly changed. If we were to consider a significant change in beliefs to be disclosive in itself, then it is hard to see how we could ever release useful data. But when authors refer to 'differential privacy', they usually mean -differential privacy (also presented in Dwork, 2008) or one of its variants. -differential privacy is an implementation of differential privacy that builds upon the basic definition in a specific way. Strictly speaking, it only applies to query systems.
-differential privacy requires that if we have two datasets D 1 and D 2 differing in only a single element, then no possible response to a query should be significantly more or less likely if answered using D 1 or D 2 . Thus, any posterior beliefs regarding any hypotheses would be similar, whether or not any targeted individual was a member of the dataset. Dwork (2008) presents an example for count queries and shows that the addition of Laplace noise to the underlying counts will produce responses such that the likelihood ratio Pr.RjD 1 /=Pr.RjD 2 / for any response R will be in a prespecified range (the bounds depending on the Laplace parameter).
An intruder submitting the same query sufficiently many times (or multiple users in collusion) would be able to identify the underlying count with a high degree of confidence.
-differential privacy combats this form of attack by adding additional noise to queries that have been previously answered.
There are several claimed benefits of -differential privacy. It can be justified mathematically-it is not ad hoc. It protects all individuals in the data with regard to all hypotheses. It is not affected by auxiliary information that might be held by a data intruder. It depends only on the type of data and form of query-not on the dataset itself. It can be used by non-experts. However, -differential privacy is no free lunch. It is clearly assumed that the data intruder does not know if the target is in the dataset, which is not the case under many attack scenarios. The amount of noise that needs to be added can significantly impact on data quality. In some cases, it can prevent data from being adequately exploited for legitimate use-unless (perhaps) the end user can obtain queries submitted before the answers to those queries attract too much added noise. In contrast, traditional SDL approaches can take much more information into account in order to minimise the impact on data quality while limiting risks to acceptable levels. Also, the theoretical basis of -differential privacy is based on query systems rather than the more usual one time data release. Bambauer et al. (2014) provide a detailed discussion ofdifferential privacy, the conditions under which it can be usefully applied and several examples International Statistical Review (2020), 88, 2, 354-379 of how it can fall short in the SDL arena. Garfinkel et al. (2018) discuss the challenges faced by the US Census Bureau in adopting -differential privacy as a data release mechanism.

Record Linkage
In previous discussion of measures of identification risk, the term 'match' was occasionally used to denote matching on variable values. The terminology was inescapable as it is used in the names of the measures. When discussing record linkage, it is important to distinguish between record pairs that are classified as belonging to the same entity and those that actually do belong to the same entity. Usually, the former are termed links, and the latter are termed matches (or matched pairs). That is the terminology we will use from here on. In this section, we simply summarise the theory behind classical record linkage so that we can refer to it in later sections.
In classical record linkage (Fellegi & Sunter, 1969), we have two databases A and B and seek to identify record pairs that correspond to the same population entities. A and B are assumed to be independent samples from a common population. The Fellegi-Sunter approach is essentially a Bayesian approach.
We have the set of all possible matches, which can be partitioned into sets of matched and non-matched pairs, Assume the data are aligned so that each index i 2 f1; : : : ; ng corresponds to the same variable in A or B. Then, under Fellegi-Sunter theory, the posterior odds of a match is given by m and u probabilities are defined as The Bayes factor is a product of terms in the form m i =u i or .1 m i /=.1 u i / depending upon whether the a i and b i are equal. m probabilities less than one allow for distortions in the data.
The log (to the base 2) of the Bayes factor is termed a match weight in Fellegi and Sunter (1969). Thresholds on the match weights are used to allocate possible matches to one of three sets: A 1 -a set of positive links, A 2 -a set of possible links and A 3 -a set of positive non-links.
International Statistical Review (2020), 88, 2, 354-379 Pairs of records allocated to A 2 are subjected to clerical review-they are manually inspected and subsequently allocated to either A 1 or A 3 . Fellegi and Sunter (1969) present a decision rule that can be used to generate thresholds corresponding to specified conditional error rates.
There are a number of approaches for estimating m and u probabilities and the marginal probability of a match, p D Pr..a; b/ 2 M /. As the proportion of matches will often be very low, the u probabilities can be estimated from the proportion of possible matches where variable values are equal. For certain problems, the population size might be known, in which case p is simply the reciprocal of the population size. Jaro (1989) presents an expectation maximisation (Dempster et al., 1977) algorithm for generating maximum likelihood estimates of all the required parameters.

Record Linkage and Disclosure Risk Assessment
Disclosure inevitably involves some form of linkage. Sample data will often be linked against a population in order to re-identify an individual. Most risk measures are based on finding an exact match on the key variables, and it is the search strategy and the sample/population frequencies that provide the measure. The advantages of classical linkage are that we can accommodate errors on the key variables, and we can match between samples. A typical scenario would be where one of the datasets, say A, contained identifying information and another dataset, B, contained sensitive information. Another scenario would be where neither dataset alone contained enough information to re-identify an individual, but linked records would contain quasi-identifiers. Thus, linkage would increase the risk of re-identification as well as attribution.
An obvious and relatively straightforward approach to risk assessment is for the risk assessor to adopt the position of the data intruder and attack the data. The risk assessor is often in the position of having access to the original data and can check whether record pairs associated with high match probabilities actually are matches. In other cases, the assessor might have access to the dataset flagged for release but not the dataset that a data intruder would link it against. That is, the data intruder could be an individual with access to a dataset held by a private company. For an approach to risk assessment in this situation, see Smith and Elliot (2014). In the more usual situation when the assessor can link the same datasets as the data intruder, it is important to carry out the linkage at least as well as the data intruder. Under 'knowledgeable intruder' attack scenarios, we must assume that the data intruder will use all means available to produce the best linkage performance. This might include knowledge of any disclosure limitation methods applied to the data before release, leading to specialised or ad hoc linkage algorithms (see, e.g. Winkler, 2004;Nin et al., 2008). The intruder might exploit similarities between values on key variables (see, e.g. Winkler, 1990;Smith & Shlomo, 2014) or matching constraints. Similarity scores help to distinguish between non-matching pairs of values that are due to typographical errors and non-matching pairs of values that are truly distinct. Matching constraints can improve linkage performance by, for example, specifying that a record in A can match at most one record in B and vice versa. In this instance, maximum likelihood bipartite matchings can be generated from the linkage results using the Hungarian algorithm (Kuhn, 1955). Others have sought to incorporate matching constraints into the linkage process itself (Fortini et al., 2001;Tancredi & Liseo, 2011;Gutman et al., 2013;Sadinle, 2017).
Under some scenarios, a data intruder might attempt to link between more than two datasets. Recent research has investigated approaches for such linkage. Sadinle and Fienberg (2013) present an approach to this problem that uses an extended expectation maximisation algorithm for parameter estimation, a generalised decision rule and a generalised approach for handling International Statistical Review (2020), 88, 2, 354-379 matching constraints. They report good linkage performance for three datasets, but it is not clear that the method will easily extend to larger numbers of datasets. A more promising Markov chain Monte Carlo (MCMC) approach is described in Steorts et al. (2016).
The data that might be exploited by a data intruder do not always present as lists of records containing no duplicates. A data intruder might seek to exploit the information in online profiles, usenet posts, tweets and so on. The links between such entities might be of many types, not only identity. For example, there might be familial links, membership links or ownership links. So there is a much more general linkage problem that is sometimes referred to as graph linkage. Entities are represented as graph nodes and relationships as graph edges. This type of linkage has been considered in the SDL literature. Backstrom et al. (2007) consider both passive and active attacks on anonymised social networks. Fu et al. (2014) combine record linkage and a form of graph linkage in order to match households, rather than individuals. Getoor et al. (2002) discuss probabilistic relational models (PRMs). These are designed to capture both probabilistic interactions between the variables associated with related entities and probabilistic interactions between the variables and the link structure itself. Taskar et al. (2004) present a relational modelling approach based on relational Markov networks.
Data come in many forms as do the linkage algorithms that might be used by a data intruder to disclose sensitive information. The focus here has largely been on classical linkage and extensions to classical linkage. However, Torra (2004) proposed that there might be an element of risk from linking records that contain no common variable values. Nin and Torra (2005) showed that for certain types of dataset with certain matching constraints, it is possible to achieve linkage performance that is better than simply randomly pairing records. They did not demonstrate anything other than a negligible risk from an SDL perspective. The remainder of this paper will examine their approach, present an alternative approach and demonstrate that the risks of disclosure can be non-negligible. These risks are largely constrained to the types of data considered in Nin and Torra (2005)-small datasets with a one-to-one correspondence between the records. However, this form of data could occur in some plausible attack scenarios. Torra (2004) describes the use of OWA operators for record linkage (and therefore reidentification) when the two files A and B contain no common variables. The idea behind this approach is that files A and B will often contain common structural information and that this can be extracted via OWA operators. Each operator is used to construct a new variable, and these variables are then used for record linkage. OWA operators can only be applied to numeric variables, so categorical variables are integer coded before linkage is performed.

Ordered Weighted Averaging Operators
An OWA operator of dimension N can simply be specified as a vector W D OEw 1 ; : : : ; w N of non-negative weights that sum to 1. The operator computes the weighted mean of an ordered vector OWA.x 1 ; : where y j is the j-th largest of the x i .
It is possible to specify OWA operators that will calculate common summary statistics such as the minimum, maximum, mean or median.
International Statistical Review (2020), 88, 2, 354-379 An alternative way to specify an OWA operator is via a process that will generate the weight vector for a given N. Torra (2004) achieves this via a non-decreasing fuzzy quantifier. This is simply a non-decreasing function F with domain [0,1] and range [0,1]. The weights for an OWA operator of length N are then calculated as Thus, a non-decreasing fuzzy quantifier is a specification of an OWA operator that can be applied to vectors of differing lengths.

The Linkage Process
Initially, all the variables in A and B are normalised either by a translation to the unit interval (range normalisation) or via standardising so that the variable values have mean 0 and variance 1. Then a set of OWA operators are applied to the records of A and B to construct new variables. The non-decreasing fuzzy quantifier specification is used to handle differing numbers of variables in A and B.
Two new files, A 0 and B 0 , are created with numbers of rows equal to the numbers of rows in A and B, respectively, and with a column for each OWA operator. Each OWA operator is applied to each row of A, and the resulting representative value is placed in the relevant row and column of A 0 . B 0 is similarly constructed from B using the same OWA operators. The representatives generated by a common OWA operator are treated as the same variable for record linkage purposes. Nin and Torra (2005) present the results of linkage experiments using data from the UCI machine learning repository (Lichman, 2013). The authors list two main assumptions: (i) Both files share a large set of common individuals. (ii) Data in both files contain, implicitly, similar structural information.
There are few details regarding the record linkage step, although they do seem to use a traditional approach. The authors demonstrate that when A and B tend to separate pairs of highly correlated variables (one in A and one in B), then it is possible to achieve better linkage performance than by randomly pairing records from A and B.

A Simple Bayesian Alternative
In assessing the risk of statistical disclosure, we should take into account all the useful information held by a data intruder. This is not only the data that they might hold regarding individuals but also information regarding the relationships between variables. The OWA approach attempts to exploit this information in a relatively unsupervised manner. However, we must assume that an intruder would be willing to exploit all prior knowledge or training data that were available. So here, we outline a supervised learning approach that a data intruder might adopt in preference to the OWA approach.
Fellegi-Sunter linkage only exploits the data in variables common to A and B. Without such variables, we need to exploit the data ignored by Fellegi-Sunter. From Bayes theorem, we have and under the assumption that A and B are random samples from a common population, We have two estimation problems. We need to estimate the Bayes factor, and we also need to estimate p D Pr..a; b/ 2 M / if we want to produce posterior odds or probabilities. Firstly, we note that p can be estimated from a vector of Bayes factors using expectation maximisation, just as in Jaro (1989). For any given p, we can generate a vector of posterior match probabilities over the record pairs. The mean of the posterior probabilities is an estimator for p. So given a starting value for p, we can iteratively generate new posterior probabilities (expectation step) and new estimates for p (maximisation step). We iterate until the absolute difference between consecutive estimates of p is within a chosen tolerance.
For the Bayes factor, the univariate marginals f .a/ and f .b/ could potentially be estimated from A and B, respectively, leaving us with the problem of estimating f .a; b/. We could also re-express the Bayes factor leaving us with the problem of estimating f .ajb/ .
Here, we choose to estimate the terms in the Bayes factor in Equation (3) via a full probability modelling approach exploiting the theory of decomposable graphical models.

Decomposable Graphical Models
A decomposable graph is an undirected graph G D .V; E/ that contains no unchorded cycles of length greater than three. The node set represents a set of variables X D .X v / v2V , and the absence of an edge {v,w} implies that X v is conditionally independent of X w given the variables in .X u / u2V nfv;wg . A decomposable graph can also be represented as a cluster tree. Each maximal pairwise connected subgraph of G is a cluster, and clusters are connected into a tree (or forest in the case of statistically independent components) so as to respect the running intersection property (Lauritzen & Spiegelhalter, 1988): If a node is contained in two clusters, C 1 and C 2 , then it is contained in all clusters on the unique path between C 1 and C 2 .
Each edge in the cluster tree is associated with a sepset-the intersection of the node sets associated with the clusters that it connects. A cluster tree implies a factorisation over the joint distribution of the variables in X , where C is the set of clusters in the cluster tree (or forest) and S is the multiset of sepsets. For categorical variables, the marginal distributions associated with clusters are marginal probability tables. The tables for sepsets can be generated by marginalisation from cluster tables. Given a structural model, the table parameters can be estimated from data via maximum likelihood or via Bayesian estimation using a hyper Dirichlet prior.
Posterior beliefs over clusters given observed evidence can be generated via message passing in a cluster tree (Lauritzen & Spiegelhalter, 1988). This exploits conditional independencies and avoids calculating Pr.X /. Posterior beliefs over sets of variables not contained in a single cluster can be generated via variable firing (Jensen, 1996) or, at least as efficiently, by manipulating the tree so that the relevant variables appear in a single cluster (Smith, 2001).

Model Determination
Model determination algorithms for decomposable graphical models generally depend on two important results. The first result is that it is possible to move between any pair of decomposable graphs, G and G 0 , by iteratively adding or removing only a single edge at a time while remaining within the class of decomposable graphs (Frydenberg & Lauritzen, 1989).
The basic rules for edge addition/deletion in decomposable graphs are as follows: An edge {v,w} can be added if, and only if, it is not already present, and v and w are either in adjacent clusters or in distinct connected components An edge {v,w} can be deleted only if, and only if, it is present in exactly one cluster.
The second important result is that the Bayes factor for two neighbouring models (differing in only a single edge) involves only four terms, which can be calculated locally.
Assume the variables in X are categorical, taking values in finite sets .
I v denote the possible configurations of X. Assume we have a random sample of X contained in a contingency table of counts n D .n.i // i 2I . Let n Z denote the counts n(i Z ) in the marginal table I Z over the variables in Z. If we also specify a hyper Dirichlet prior as a contingency table of parameters D . .i // i 2I , then: For any complete set C, the marginal likelihood is where n 0 P i I n i and 0 P i I i . Under the hyper multinomial Dirichlet law (Dawid & Lauritzen, 1993), the marginal likelihood for the full dataset is If we have graphs G D .V; E/ and G 0 D .V; E 0 /, where E 0 contains the edges in E and an additional edge fv; wg, then the Bayes factor (ratio of marginal likelihoods) is given by where C is the unique cluster in G 0 containing fv; wg and A D C n fvg, B D C n fwg and S D C n fv; wg. These results have been exploited by various model determination algorithms. MCMC algorithms (e.g. Madigan & York, 1995) generate a posterior distribution over the model space. Averaging over this distribution takes into account uncertainty in the model structure and generally provides improved predictive performance (Hoeting et al., 1999). Madigan and Raftery (1994) use an alternative model selection strategy where they reject any models that are sufficiently poorer than the best model(s). Their Occam's razor strategy is based on comparisons of models differing by only a single edge. If the evidence favours the larger model to a sufficient degree (decided by a threshold on the Bayes factor), then the smaller model and all its submodels are rejected. A model M 0 is defined as a submodel of M 1 if all the edges in M 0 are also in M 1 . Search can start from an arbitrary set of candidate models. If search starts from the complete graph, then only edge removals are considered (the down algorithm). If search starts International Statistical Review (2020), 88, 2, 354-379 from the model with empty edge set, then only edge additions are considered (the up algorithm). Otherwise, the down and up algorithms are run in turn to generate a set of candidate models. Finally, any candidates that are sufficiently poorer than the best model(s) are also removed. The posterior probabilities of the remaining acceptable models are normalised to sum to 1 for model averaging purposes.
The model determination approach used here is described and justified in Section 7.2. It has some similarities to the Occam's razor approach.

Linkage Experiments
Experiments were carried out to compare the OWA approach with a Bayesian approach based on decomposable graphical models. In order to limit experimenter bias, we chose to emulate the simulations in Nin and Torra (2005) as faithfully as possible. Results are presented in a similar format so that they can be easily compared. Any departures from the approach in Nin and Torra (2005) (due to subjectivity or lack of detail in the original paper) are described and justified. In Section 7.3.1, we also present results for a more realistic partitioning scheme than that presented in Nin and Torra (2005).
We used four of the datasets used by Nin and Torra (2005)-the abalone, dermatology, housing and ionosphere datasets from the UCI Machine Learning Repository (Lichman, 2013). The same preprocessing steps were used-non-numeric variables were recoded using integer codes, and records with missing observations were removed. Nin and Torra (2005) reported numbers of correct links for one-to-one correspondence on samples of size 30 and 100 and for three distinct sets of OWA operators. We report results for the same sample sizes using the same OWA operators. However, for each combination of sample size and OWA operator, we report the mean numbers of correct links over 100 randomly generated samples. There may be significant differences in the details of the linkage approach and in the exploitation of one-to-one correspondence. Details are contained in the following subsection. Nin and Torra (2005) only considered highly correlated variables and adopted a strategy of deliberately separating highly correlated variables when partitioning variables into A and B. Variables were chosen via inspection of the correlation matrix over all the variables in the relevant dataset. 1 They used a threshold of 0.7-variables that had no correlations with other variables above 0.7 were ignored.

The Ordered Weighted Averaging Approach
We cannot reliably emulate the manual partitioning, so in the interests of transparency and objectivity, we formalise the process. A graph is constructed with variables as nodes and pairwise correlations as edge weights. From this, we generate a maximum weight spanning tree using Kruskal's algorithm (Kruskal, 1956), stopping when weights are below the threshold. The tree nodes are bi-coloured so that no pair of adjacent nodes are identically coloured (this is always possible for a tree or forest). The colouring provides us with our partitioning of variables into files A and B.
We used the following functions from Nin and Torra (2005) to generate our OWA operators: Q 1 D fx˛W˛2 f0:2; 0:4; : : : ; 2gg; Q 2 D f1=.1 C e 10.˛ x/ / W˛2 f0; 0:1; : : : ; 0:9gg; For each function F in each set, we define F .0/ D 0 and F .1/ D 1. Comparing representatives for equality would provide very poor linkage performance. We would expect very few (if any) matching values and the vast majority (if not all) comparison vectors would be vectors of zeroes. However, it is still possible to use a standard record linkage approach if we generate binary comparison vectors by other means. There is little detail of the linkage in Nin and Torra (2005), so here, we choose to employ a similarity score, which is dichotemised to generate binary comparison vectors.
sim.x; y/ D max .1 jx yj ; 0/: A threshold of 0.95 was found to provide reasonable linkage performance. Linkage used the expectation maximisation approach detailed in Jaro (1989). A moderate degree of Bayesian regularisation (Dirichlet priors and maximum a posteriori estimation) was used to avoid parameter estimates of zero. The post hoc weighting scheme contained in Winkler (1990) was also used. This tends to improve linkage performance by more fully exploiting the information in similarity scores via piecewise interpolation on match weights. Nin and Torra (2005) considered only subsets of data containing 30 or 100 records, and no records were removed from A or B after partitioning. Thus, there was a one-to-one correspondence between the records in A and the records in B. This knowledge provides important additional information that can be used to improve linkage. Firstly, we can specify p-it is simply the reciprocal of the size of the subset. Using a fixed p can result in improved estimation of m and u probabilities. Secondly, we can attempt to find the best one-to-one matching-one that maximises the product of the match probabilities for the selected links. We construct a bipartite graph connecting each record in A to each record in B, with edge weights equal to the log posterior match probabilities. A maximum weight one-to-one matching is then found using the Hungarian algorithm (Kuhn, 1955). The Hungarian algorithm was compared with a greedy algorithm where we iteratively linked the highest weight record pair (A, B) such that neither A nor B had previously been linked. The Hungarian algorithm generally produced better linkage performance, and those are the results presented here.

The Bayesian Approach
Using a threshold of 0.7 for partitioning the variables resulted in relatively low numbers of variables allocated to A and B. These could easily have been handled by MCMC and other model determination approaches. However, we also wanted to consider other partitioning schemes that would generate larger partitions and require model determination with larger numbers of variables. MCMC and Occam's razor approach in Madigan and Raftery (1994) would have been too computationally demanding for so such large numbers of variables. So here, we chose to use a simpler approach that searches for a single locally optimal model. In common with other approaches, we base our full probability modelling on adding and removing single edges while remaining within the class of decomposable graphs. We also choose to work with categorical variables-categorising continuous variables as necessary. This reflects the fact that the majority of datasets that we will be considering in practice will contain relatively few continuously scaled variables, and those that do will often categorise these variables for disclosure risk limitation purposes. It also allows us to exploit the hyper multinomial Dirichlet law (Dawid & Lauritzen, 1993) presented earlier.
We use a greedy algorithm that has some similarity with the Occam's razor approach. We start with a single candidate model. In an upwards search, we iteratively improve the model by adding whichever edge produces the greatest increase in marginal likelihood. We stop when no International Statistical Review (2020), 88, 2, 354-379 improvement is possible. In a downwards search, we iteratively improve the model by removing the single edge that produces the greatest increase in marginal likelihood. We alternate between upwards and downwards searches until no improvement is possible.
The final model is locally optimal, but in most cases, many local optima will exist, and choice of initial model is highly influential on the selected model. The goal for the present application is to find a reasonably good model in a reasonable time, while acknowledging that a sufficiently motivated data intruder might be able to do better. Thus, we chose to start with the model with no edges (full independence model). This tends to produce much sparser models than starting with the fully connected graph (full dependence model). We have a preference for sparse models, not only with respect to Occam's razor but also for the reduced computational cost of performing inference. Experimentation showed that with a more manageable number of variables, this often produced the same model as the highest posterior probability model under the Occam's razor scheme of Madigan and Raftery (1994).
Sampled observations were used for linking, while the remaining data were used for model determination. All continuous variables were split into eight categories so that the data were evenly distributed across the categories. Model determination used hyper Dirichlet priors with all parameters equal and summing to 1. Subsequent estimation of probability tables used the same prior in order to avoid probabilities of 0. Again, we exploited the Hungarian algorithm for one-to-one matching. We report results for the same random samples generated for the OWA approach.

Results
As well as the results for a threshold of 0.7 (on pairwise correlations), we also considered thresholds of 0.5 and 1. These were used to investigate the impacts of including additional variables, and all variables, respectively (one variable in the ionosphere dataset contains only a single value and was removed). The numbers of variables are shown in Table 1.

Tabular results
Tables 2 and 3 show the numbers of matches for simulations using range normalisation for OWA. Standardisation was not used as the differences between representative values would not have been bounded and the choice of similarity score would have been less obvious. The numbers of matches reported by Nin and Torra (2005) are shown in braces. The largest proportion of matches within each dataset/threshold combination is shown in bold typeface.
We note that there appear to be some large differences between the OWA results and those reported in Nin and Torra (2005). This could be a result of the partitioning of variables, the record linkage approach or the use of the Hungarian algorithm for one-to-one matching. We also note that although many possible comparisons are highly statistically significant, we place little weight on this and do not report p-values. There are simply too many parameters that can   be varied in both the OWA and Bayesian approaches that could affect performance. We restrict ourselves to the more general conclusions that we can reach from examination of the tables. Firstly, we note that the OWA approach performs relatively poorly with the dermatology and housing datasets. The mean numbers of matches are low across all thresholds. For the abalone dataset, we have a decline in performance for OWA as the threshold is reduced, whereas for the ionosphere dataset, performance is better at a threshold of 0.5.
The Bayesian approach seems to generally benefit from the inclusion of additional variables. For each dataset and sample size, the best performance is achieved with the inclusion of all variables. In fact, the Bayesian approach including all variables provides the largest mean number of matches for all datasets and sample sizes, except for the abalone dataset where performance is similar for all thresholds. Of course, the conditional independence relationships encoded in the model structure will sometimes dictate that variables will have no impact through contributing to the numerator and denominator terms of the Bayes factor equally. The limiting case for this, of course, would be the full independence model where the Bayes factor would equal one.
The partitioning of variables in Nin and Torra (2005) is a device to show how effective the OWA approach might be in a more or less ideal situation (for the data intruder). In practice, a data intruder will have to simply deal with the variables present in A and B. Prior knowledge could still allow the intruder to identify separated variables with correlations above a given threshold. So the most appropriate comparison is perhaps with randomly partitioned variables where the OWA intruder removes all variables from A that do not have at least one correlation above a given threshold with a variable in B and vice versa. Of course, the Bayesian intruder would still use all the variables.
Simulation results for randomly partitioned variables are shown in Table 4. Nin and Torra's threshold of 0.7 was used for OWA, except for the ionosphere dataset where a threshold of International Statistical Review (2020), 88, 2, 354-379 0.5 produced better performance. Again, these are mean numbers of matches over 100 randomly generated samples. On each iteration, the variables are randomly partitioned such that the maximum difference in partition size is 1. Performance tends to be generally worse than under the original partitioning scheme. Again, we find that the Bayesian approach tends to be superior to the OWA approach, except for the abalone dataset. Fortunately, there are published metadata (Lichman, 2013), so we can investigate why the difference in performance is less marked for the abalone data.
The abalone data contains nine variables (Table 5). The first variable Sex is nominal and on preprocessing has its values replaced with integer codes. All other variables are numeric, and all but the final variable (Rings) relate to abalone size. 2 All the 'size' variables are highly correlated.
The graph in Figure 1 shows the decomposable graphical model fitted from the whole dataset with nodes labelled by variable index. The blue and red nodes represent the partitioning of variables at the 0.7 threshold using Kruskal's algorithm. All the size variables have been included. Clearly, any OWA operator is going to generate some summary measure of size, and it is no surprise that these can be used for linkage purposes. This explains the similar performance of the OWA approach when compared with the Bayesian approach.
For the threshold of 0.5, the variable Rings (with index 8) is also coloured blue. The graph suggests that Rings is conditionally independent of the other variables given Shell Weight. In fact, it is approximately conditionally independent of the remaining variables given any size variable. Thus, its inclusion does little more than add noise to the OWA approach. Similarly, it is likely to be relatively uninformative for the Bayesian approach. The threshold of 1 additionally includes the final variable Sex coloured red. This is similarly uninformative and results in another drop in performance for OWA. This is less marked for the Bayesian approach as it less reliant on separating highly correlated variables and will not be affected by the arbitrary integer coding of categorical variables in the same way as the OWA approach. In many ways, the abalone dataset is ideal for the OWA approach, which explains why performance is comparable with the Bayesian approach, even when the Bayesian approach exploits additional variables.
International Statistical Review (2020), 88, 2, 354-379 Given this, it is perhaps a little surprising that the results reported for the abalone data in Nin and Torra (2005) are not better (Tables 2 and 3).
We can contrast the abalone dataset with the housing dataset. There are few large positive pairwise correlations, and of 14 variables, only five are included in partitions at the 0.7 threshold. Performance is relatively poor across the board for the OWA approach. It does not appear to perform much better than a random matching strategy for thresholds of 0.5 or 1. On the other hand, there is plenty of structure for the Bayesian approach to exploit, and as larger numbers of variables are considered, performance increases substantially. The metadata 3 also demonstrate that this is more typical of the datasets that we will meet in social statistics, with variables relating to crime rates, ethnicity, social class and so forth. The dermatology and ionosphere datasets are also atypical. The dermatology data relate to the differential diagnosis of erythemato-squamous diseases, and the ionosphere data consist of various radar measurements. The analysis of these datasets is included here simply for comparison purposes.

Precision-recall plots
We have seen that linkage performance drops when we use randomly partitioned variables. We might also expect it to drop if we do not have one-to-one correspondence that can be exploited by the Hungarian algorithm. That is not to say that we will not have structural information to exploit. We might have the constraint that each record in A can map to at most one record in B (injection) and vice versa. In some cases, we might need to entertain the possibility of duplicate records within a dataset. We can compare the OWA and Bayesian approaches without the benefits of postprocessing by generating precision-recall plots. The plots in Figures 2  and 3 were generated from the simulations used to generate Table 4. Results were aggregated over all 100 randomly generated partitions in order to assess the general performance of OWA and the Bayesian approach as classifiers.
International Statistical Review (2020), 88, 2, 354-379 For any given threshold on a score (here the posterior probability of a match), we will have a number of false positives fp and a number of false negatives fn. Similarly, we will have a number of true positives tp and a number of true negatives tn.
Recall D tp tp C fn : A plot of precision against recall allows the comparison of classification approaches. Good classifiers will produce curves in the upper right of the plot. The area under the curve is sometimes used as a performance metric.
Although these curves are based on the match probabilities, the curves generated on the basis of Bayes factors would be identical due to the constant marginal probability of a match (the reciprocal of the sample size). The expected performance of a random pairing strategy is shown by a broken line.
The superiority of the Bayesian classifier is evident. In several cases, the record pair with the highest match probability (over the 100 randomly generated partitions) is a match. This suggests that an intruder could sometimes infer a match with a high level of confidence. It does not however imply that such high match probabilities are common. In fact, the curves for individual partitions are highly variable. This is only to be expected-the approach relies on the existence of dependencies between the variables in A and the variables in B and would International Statistical Review (2020), 88, 2, 354-379 have no discriminatory power if the variables in A were independent of the variables in B. This also highlights the fact that an informed intruder might be able to readily identify vulnerable datasets from prior information regarding the dependencies between variables.
We also see a general decline in performance when moving from n D 30 to n D 100. This is to be expected due to the reduction in the marginal probability of a match. A priori knowledge of this marginal probability would also help in the identification of vulnerable datasets.

Risk Assessment
Attack scenarios are generally paired with risk measures, and an obvious measure is the probability of a successful re-identification. The plots in Figures 2 and 3 are potentially useful for risk assessment, but the use of precision recall plots is better illustrated if we consider an alternative attack scenario.
An intruder simply seeking to discredit a data stewardship organisation might attempt to maximise the probability of a successful re-identification by making a single claim of reidentification against the record pair with highest match probability and only if the probability is sufficiently high. Very few datasets might result in a claim, but where many datasets are released covering small geographical areas, there might be a significant disclosure risk. One-to-one correspondence between small datasets is also more plausible for this form of release.
So rather than constructing precision-recall plots from all record pairs, we restrict consideration to those that are the most probable matches for each randomly generated dataset/partition.  The first thing to note is that the proportion of matches (precision at recall = 1) is no longer constant. Excluding all but the record pair with highest match probability from each dataset/partition has increased the proportion of matches substantially. For n D 30 and the abalone dataset, we have proportions of 0.21, 0.52, 0.69 and 0.46 for Q1, Q2, Q3 and Bayes, respectively. For n D 30, housing and Bayes, we have a proportion of 0.83. These represent the empirical probabilities of a successful re-identification at a threshold of 0 on the match probability-that is, the proportion of record pairs with highest match probabilities that are matches.
Another notable feature is that choosing higher thresholds does not consistently increase the probability of re-identification. The data intruder certainly benefits from only considering the record pair with highest match probability from each dataset but only seems to clearly benefit from adopting a non-zero threshold on the match probability for the abalone dataset with the Bayesian approach. One reason for this perhaps counterintuitive effect is that the simulations use the non-sampled data for model determination. Thus, there will be some variation in the generation of match probabilities across distinct samples. The numbers of records for the abalone, dermatology, housing and ionosphere datsets are 4177, 358, 506 and 351, respectively-so we would expect less variation for the abalone dataset. In practice, a data intruder would use the same training data to attack multiple datasets (relating to a common set of variables), generating more consistent match probabilities, and might see some benefit in excluding record pairs with match probabilities below some non-zero threshold.
The results for n D 100 are consistent with those for n D 30. Again, only for the abalone dataset and the Bayesian approach does the use of a non-zero threshold on the match probability clearly increase the probability of re-identification. Risks are generally lower, although we International Statistical Review (2020), 88, 2, 354-379 still have a probability of successful re-identification of 0.57 for the housing dataset and the Bayesian approach.
In practice, a DSO might be interested in assessing risk for a single dataset or a collection of datasets. Nevertheless, as long as there is a notion of an intruder-specified threshold on the match probability, then precision-recall plots can be useful risk assessment tools. Of course, this also applies in situations where there are also common variables that can be used for linkage. Nin and Torra (2005) showed that the OWA approach could perform significantly better than a random pairing strategy. We have shown that a relatively simple Bayesian approach can consistently outperform OWA, the exception being a rather degenerate dataset that is ideally suited to OWA. Both approaches require prior information and/or training data. We need to generate a model for the Bayesian approach, and the OWA approach requires information regarding pairwise correlations. In fact, we should not really consider the OWA approach to be unsupervised.

Conclusions
We have shown that the disclosure risks are of practical significance for the one-to-one matching problems considered in the OWA literature. In this case, linkage performance can be significantly improved by exploiting structural information via the Hungarian algorithm. The precision-recall analysis demonstrated that there can still be appreciable risks of reidentification under more realistic scenarios and when this type of structural information is either not exploited or not present.
International Statistical Review (2020), 88, 2, 354-379 In order to exclude experimenter bias, analysis has been restricted to datasets that are considered in Nin and Torra (2005). These are suited to the OWA approach as they contain large numbers of numeric variables. In practical circumstances, we will tend to meet datasets containing categorical variables-not least because numeric variables are often categorised for statistical disclosure risk limitation purposes. The Bayesian approach was designed to deal with the more usual case, and numeric variables had to be categorised. We expended some effort trying to optimise the OWA approach and hardly any effort trying to optimise the Bayesian approach. Given the above, and the difference in performance, we would have to recommend the Bayesian approach over the OWA approach for risk assessment. Precision-recall plots are useful tools for risk assessment and can be generated for both collections of datasets and individual datasets and under various attack scenarios.
An obvious extension to the Bayesian approach is to exploit the information in non-key variables to improve classical record linkage. This is an area for future work. Some early work has shown that this can be effective for some linkage problems. Another consideration is that Fellegi-Sunter is designed to accommodate errors (perhaps introduced via deliberate perturbation) through the m-probabilities. To some degree, this will also be true of the OWA approach. The simple Bayesian approach presented here does not accommodate errors in files A or B unless they are present in (or introduced to) the training data.