Model fitting in Multiple Systems Analysis for the quantification of Modern Slavery: Classical and Bayesian approaches

Multiple systems estimation is a key approach for quantifying hidden populations such as the number of victims of modern slavery. The UK Government published an estimate of 10,000 to 13,000 victims, constructed by the present author, as part of the strategy leading to the Modern Slavery Act 2015. This estimate was obtained by a stepwise multiple systems method based on six lists. Further investigation shows that a small proportion of the possible models give rather different answers, and that other model fitting approaches may choose one of these. Three data sets collected in the field of modern slavery, together with a data set about the death toll in the Kosovo conflict, are used to investigate the stability and robustness of various multiple systems estimate methods. The crucial aspect is the way that interactions between lists are modelled, because these can substantially affect the results. Model selection and Bayesian approaches are considered in detail, in particular to assess their stability and robustness when applied to real modern slavery data. A new Markov Chain Monte Carlo Bayesian approach is developed; overall, this gives robust and stable results at least for the examples considered. The software and datasets are freely and publicly available to facilitate wider implementation and further research.


Introduction
The original motivation for this work came from the estimation of the number of 'potential victims of human trafficking' in the UK, based on the National Crime Agency strategic assessment of 2013. This was part of the strategy leading to the Modern Slavery Act 2015. See Silverman (2014) and Bales et al. (2015). The method used was multiple systems estimation.
Quantifying modern slavery has crucial importance for policy. For example Cockayne (2015) has written 'without good data on where slaves are, how they become slaves and what happens to them, anti-slavery policy will remain guesswork', and goes on in this context to cite the use of multiple systems approaches as a significant innovative approach in a field where good quantification is in its infancy. It is not just in narrow policy terms that good prevalence estimates are important; they also play a vital role in raising the public and political consciousness of modern slavery.
Multiple systems estimation is a development of the classical capture-recapture approach and has been used in many different contexts, such as counting casualties in armed conflicts (Manrique-Vallier et al., 2013;Manrique-Vallier et al., 2019) and numbers of injecting drug users (King et al., 2013). Cases that come to light are recorded on a number of different lists. By identifying cases across the various lists, the numbers that fall on each possible combination of lists are tabulated. Then a mathematical model is used to estimate the 'dark figure' of cases that have not come to attention, and so are not recorded on any list. For an overall survey, see Bird and King (2018).
Crucial to this approach is the choice of model, in particular deciding which interactions or correlations to allow between the various lists. Some methods choose a particular model, while others seek a model averaging approach. This paper reviews a number of methods and investigates their performance on a range of real data sets. There is a deliberate focus on data collected in the area of modern slavery and human trafficking, because the primary aim of this paper is to develop methodology relevant to that area. In addition one of the data sets considered, drawn from the wider human rights area, relates to deaths in the Kosovo conflict in 1999. The choice of existing methods for discussion and review is again guided by our particular context, focusing on methods already proposed for the multiple systems analysis of human rights and modern slavery data.
The modern slavery context presents particular challenges for the use of multiple systems analysis. There is no 'ground truth' available to investigate the accuracy of any estimates, and so we need to assess other properties of estimation methods. For example, it is clearly desirable to have reasonable stability under operations such as combining or omitting lists with small counts or adjusting model parameters. Also, if multiple systems estimation is to be used more widely to quantify modern slavery, it is important to consider the performance of the various possible approaches specifically on data sets of the kinds likely to be observed. Furthermore it may be important that there should be an agreed standard approach, at least as a starting point for more detailed investigation, and it is hoped that our detailed comparative study may contribute to that.
Another issue that has to be borne in mind is the extremely sensitive nature of the data. Typically, much as one would like more details, such as covariate information, about the individuals observed in the study, these are not available to the statistical analyst. Without giving assurances of confidentiality to individual victims, for example, it would often not be ethical or even possible to collect their data. Collation of data between lists naturally involves sharing or matching information, but this is often done by a trusted individual who cannot reveal any details. Indeed, on some occasions all details of the lists themselves, and even of the type of organisation that provided particular lists, have to be obfuscated.
Our comparative study using real data sets and the methods so far proposed will demonstrate that, unfortunately, all the existing methods display instabilities of various kinds, sometimes dramatic, when tested on the real data sets. To address this issue, we introduce a Bayesian/thresholding approach that places prior distributions on the individual terms in the standard model.
In Section 2 below, the various data sets are reviewed and tabulated. Section 3 sets out the standard Poisson model which underlies various possible approaches. Section 4 then examines frequentist approaches to model selection, including the one used by Silverman (2014). Two other, rather different, Bayesian methods have been proposed and these are investigated in Section 5. In Section 6 our proposed Bayesian/thresholding method for the Poisson model is introduced. This casts the problem in a form where a standard Markov Chain Monte Carlo package can be used to estimate the parameters, but there are some mathematical aspects that have to be taken into account for this to work. The method is demonstrated on the various data sets; it appears to avoid some of the gross instabilities that can arise with the existing methods, but still requires care in its application. Finally, some conclusions are drawn in Section 7.
A key factor in developing a standard approach is the open accessibility of data and of methodology. All the data sets, together with R software to implement the methodology described in this paper, and to reproduce its results, are given in Silverman (2018b). For some additional remarks about the importance of open data and open research, see Silverman (2018a).

The data sets
The full data analysed by Silverman (2014), broken down into six lists, are given in Table 1. The abbreviations of the lists are explained as follows: LA Local authorities NG Non-government organisations such as charities PF Police forces GO Government organisations, such as the Border Force and the Gangmasters and Labour Abuse Authority GP The general public, through various routes NCA The National Crime Agency Some of the methods we consider do not deal with more than five lists, and so for some purposes we will combine the list PF with the list NCA to construct the 'UK five-list' data set. The NCA is not, strictly speaking, a police organisation, but it has many powers and characteristics in common with police forces and so combining these two lists is the natural way to reduce to a smaller number.
In addition, the list GP raises issues because cases on this list may not always Table 1: Potential victims of trafficking in the UK, 2013. Numbers of cases on each possible combination of lists. For example there are 54 cases that appear only on the LA list, and 15 cases that appear on the overlap between LA and NG, but not on any others. There is one case that appears on all four of LA, NG, PF and GO but not on the other two. Those combinations of lists for which no cases were observed are omitted from the table, but are still taken into account in the analysis. From Bales et al. (2015).
be specified in sufficient detail to allow for reliable matching with other lists. Therefore, at least to test for the robustness of any results, it will be helpful to consider, in addition to the full and five-list data sets, a 'UK four-list' data set constructed by omitting the GP list and combining the PF and NCA lists. The total number of observed cases is 2744 for the five-and six-list data, but only 2428 for the four-list data set. A second important data set Cruyff et al., 2017) comprises six lists for identified victims in the Netherlands for the period 2010-2015. The data are given in Table 2. For a five-list version of these data, we combine the two smallest lists I and O. The total number of observed cases in this data set is 8234.
The third example is constructed from data collected by eight agencies in the New Orleans-Metairie Metropolitan Statistical Area (Greater New Orleans) and analysed by Bales et al. (2019). These include 185 individuals who interacted with law enforcement and service providers in Greater New Orleans during the year 2016. They are given in Table 3. The sensitivity among the various agencies, partly for legal reasons, means that it is not possible even to label the lists themselves informatively. No further information was available to the statistical analysis than the table itself, with lists labelled A to H. Where it is necessary to reduce the number of lists, a five-list data set is constructed by combining the lists with the four smallest counts into a single list BEFG.
Finally, we consider a data set from a different area of human rights, that Table 2: Victims of trafficking in the Netherlands. Numbers of cases on each possible combination of lists, leaving out combinations for which no cases were observed. The lists are as follows: P = National Police; K = Border Police; I = Inspectorate SZW (Ministry of Social Affairs and Employment); R = regional coordinators; O = residential treatment centres and shelters; Z = others (for example, ambulatory care centres, organizations providing legal services, Immigration and Naturalization Service). Constructed from van Dijk et al. (2017), Table 3.
of determining the numbers of victims of armed conflict. The data, due to Ball et al. (2002), relate to the numbers of those killed in Kosovo in a three month period in 1999. They are available within the R package LCMCR (Manrique-Vallier, 2017) and are reproduced in Table 4. This four-list data set, which includes 4400 known victims, displays high correlation between lists, and has larger numbers in the higher order three-list and four-list overlaps than do the modern slavery examples. This is in the nature of the particular application and is highly unlikely to occur in any modern slavery data set.

Models and methods
In this section, we review the basic log-linear model as proposed by Cormack (1989). Suppose we have K lists labelled {1, 2, . . . , K}. For each subset A of {1, 2, . . . , K}, let N A be the number of cases that occur on all the lists in A but on no others. So, if K = 6 there are 64 possible subsets A, including the empty set ∅. The 'dark figure' is the number of cases N ∅ that do not appear on any list. Using the UK data as an illustrative example, Table 1 only gives counts for 26 subsets A, and the first step in the analysis is to reinstate all the rows in the table for which the observed count is zero, yielding 63 observations in all. There is no observed count for the dark figure.
The basic model is that each N A has, independently, a Poisson distribution with parameter λ A , with some structure on the λ A . This is quite a strong assumption, because it assumes that the cases each behave independently of one another and obey the same probability laws of appearing on the various lists. Especially if there are observed covariates, the model will only be a jumping off point for more detailed modelling, but it is at least a start. The model does not assume that the various lists are independent; interactions between the lists are allowed by appropriate modelling of the parameters λ A .
Under the model, the dark figure N ∅ ∼ Poiss(λ ∅ ). It is likely that the estimation error in λ ∅ will be much larger than the Poisson variation, and so in practice the parameter estimate of λ ∅ will be taken as the estimate of the dark figure, though if possible the Poisson variation should be taken into account too. To get an estimate of the total population, the estimate of the dark figure is added to the total number of cases actually observed.
The Poisson model can also be seen as an approximation to a multinomial model where there is a fixed (unknown) total population size, and cases independently fall on the various lists or combinations of lists with probabilities proportional to the expected values under the Poisson model. Cormack (1992) provides a way of using the profile likelihood under the Poisson model set out below, to obtain confidence intervals for the total population size under the multinomial model.
For the most part, the model we will investigate will be of the form For example, if K = 6 then there will be six main effects α i and 15 two-list interactions β ij , making 22 parameters altogether to be estimated from the 63 observable values N A . Within this model, we have log λ ∅ = µ. Therefore the estimate of the dark figure is exp µ; we do not actually need estimates of the other parameters to estimate the dark figure.
There are basically two approaches to model fitting in this context. One is to use a model selection criterion to choose a particular set of parameters to fit, constraining all the others to zero. The other is to use some sort of model averaging approach, usually of a Bayesian nature.

Frequentist model selection
The package Rcapture (Baillargeon and Rivest, 2007) can be used as the basis of various approaches, which are explored in this section. The simplest is to set all interaction terms to zero, fitting main effects α i only. Under this model, the lists themselves are independent, an assumption that may be unrealistic. Nevertheless this model may be a good reference point for more detailed analysis.

Adding parameters stepwise
In their original work on the UK data, Silverman (2014) and Bales et al. (2015) used a stepwise approach, starting with main effects only and then adding twolist interactions β ij stepwise. At each step, the interaction that best improves the AIC criterion is chosen. The process of adding interactions is stopped if the AIC cannot be improved by adding an interaction, or if the new interaction is not significant at some threshold. This variable selection method is implemented within the R package modslavmse (Silverman, 2018b), and makes use of the package Rcapture.  Table 5 shows estimates and confidence limits using main effects only, and the stepwise procedure with two different p-value thresholds, for the UK data summarised into six, five and four lists as set out in Section 2. The original work used the stepwise method with p = 5%. Here and subsequently in this section, the confidence intervals are constructed from the profile likelihood using the approach of Cormack (1992) as implemented within Rcapture. Both the six-and five-list data give a 95% confidence interval, conditional on the model choice, of 10,000 to 13,000 in round terms. Using main effects only, or a more stringent criterion for adding parameters to the model, gives larger estimates. The results for the four-list case, on the other hand, give smaller estimates, but none of these effects is dramatic.

Choosing from a large class of models using an information criterion
The stepwise method is not the only possibility. Another approach is to fit all possible models, considering every subset of the interactions, and to choose between these using some criterion. This can be done using the routine closedpMS.t within the package Rcapture. If the full six-list data are used, then there are 2 15 models even if only pairwise interactions are considered, which presents an excessive computational burden. To make the method computationally feasible in practice, the approach is only applied to the five-list data, allowing for two-list interactions only, leaving 2 10 models to be considered. The package Rcapture displays the results using the BIC rather than AIC as the primary method of There is a subsidiary cloud of results corresponding to a much larger estimate for the population size, and the estimate for the best BIC is actually within that cloud.
Closer examination of the top ten models chosen by each of BIC and AIC is instructive. There are no models in the top ten for AIC which yield estimates over 17,000, and only one which yields an estimate much outside the range suggested in the original analysis. On the other hand, BIC chooses models yielding a much wider range of estimates. Overall, the results for the five-list data demonstrate that the estimate of the total population can vary considerably depending on the model that is chosen, and that even concentrating on wellfitting models, by some criterion, does not necessarily resolve this issue.
Because of the caveats about the GP list, the analysis was repeated for the four-list data. Figure 3 shows all models and demonstrates that the cloud of points corresponding to the much larger estimate disappears altogether if GP is omitted.

Further examples
In Table 6 we present the results of applying the main-effects only and the stepwise AIC choice methods to the other three example data sets. There is a somewhat alarming instability in the analysis of the Netherlands data; combining the two smallest lists more than doubles the stepwise estimates. There is no such instability if main effects only are fitted. Some further intuition may be gained from Figure 4. There is a long tail of models with very large estimates.
Although the globally optimal model according to BIC is not in this group, the stepwise method is choosing one of these, indeed one yielding almost the largest estimate among all choices of model. For the full New Orleans data with eight lists, the lower threshold for the p-value yields a very different estimate, indeed one where the profile likelihood does not allow an upper 97.5% confidence value, and a warning is generated by the routine within Rcapture. With a large number of lists and so many possible parameters to fit, it is not surprising that it should be inappropriate to use p = 5%. All the other estimates are similar to estimates fitting main effects only, which are virtually unaffected by reducing to five lists.
The Kosovo data yield a quite different result if interactions are allowed. This is to be expected given the strong correlations evident in the data.

Identifiability and existence of estimates
The three data examples drawn from the study of human trafficking all give rise to contingency tables with some zero cell counts. This raises issues discussed in generality by Fienberg and Rinaldo (2012a), and in our particular context by Chan et al. (2019).
One possibility is that there are no finite maximum likelihood estimates of all the parameters, but that the likelihood is maximized when one or more parameters tend to minus infinity. This yields what Fienberg and Rinaldo (2012a) term an extended maximum likelihood estimate, which gives a bona fide estimate, possibly zero, of each λ A . This is handled within Rcapture, somewhat unsatisfactorily, by returning large negative estimates for some of the β ij . These then give estimates for the resulting λ A which are very close to zero. A consequence of this behaviour is that it is no longer possible to expand the log likelihood as a quadratic approximation around the maximum, and hence the standard likelihood theory, including the justification of information-based criteria, breaks down. This breakdown is discussed further and illustrated in a small simulation example by Chan et al. (2019). Other aspects will be considered in Section 6.2 below.
There are two other estimability issues for maximum likelihood, neither of them addressed in Rcapture. One is that the extended maximum likelihood estimate does not exist; consideration of a small artificial example in Chan et al. (2019) shows that this may manifest itself as an infinite (or, numerically, very large) estimate of the dark figure. Fienberg and Rinaldo (2012b) show in a very general context that non-existence of the extended maximum likelihood estimate can be checked by solving a linear programming problem, set out for our particular case in Chan et al. (2019). The other possibility is that, although the likelihood can be maximized, the parameters that attain this maximum are unidentifiable; this can be checked by finding the rank of a particular model matrix. Chan et al. (2019) derive an efficient algorithm that can demonstrate (without actually checking every single model) whether either check would be failed by any choice of the set of interaction parameters β ij to include in the model. For each of the data sets considered in this paper, every possible model passes the checks. Thus it seems unlikely that the instabilities and bimodalities in the estimation displayed in Section 4, and in some of the other methods discussed later in the paper, are due to the problems considered by Fienberg and Rinaldo (2012a).

Bayesian approaches
Two rather different Bayesian approaches have been proposed or developed specifically for human rights data. Their performance on our data sets is reviewed in this section. Unfortunately, neither method escapes the instabilities already seen on the actual data sets.

Graphical models
A graphical models method was developed by Madigan and York (1997) and implemented in the package dga (Johndrow et al., 2015). This uses every decomposable graph model of dependencies between the various lists, and obtains the joint posterior probabilities of the models and the total population size. The routine bma.cr which carries out the analysis requires an array of possible values of the dark figure. A reasonable standard range is from zero to 10 times the number of cases actually observed, but this will be discussed further below. The routine is only fully implemented for three, four and five lists, where the numbers of possible models are 8, 61 and 822 respectively. The combinatorial burden becomes excessive if six or more lists are used. Therefore, the method is only applied on the five-list versions of the UK, Netherlands and New Orleans data, as well as on the Kosovo data and the four-list UK data.
For the UK data, initial application of the method on the five-list data showed a strong bimodal distribution which extended beyond the standard range, and so the calculation was repeated with the range for the total population extended to 40000. The results for both the five-and the four-list data are shown in Figure Table 7, though of course in the case of the full five-list data these are not an adequate description of the bimodal distribution. Now turn to the Netherlands data, where the number of observed cases is 8234. The top panel of Figure 6 shows the posterior when calculated on the range of up to ten times this figure for the dark figure. In contrast with the UK data, there is no suggestion of any second mode within this range. However, Post. Prob. By Model Figure 5: The posterior distribution of the total population size for the UK data, for the five-list and four-list data, using the method of Madigan and York if the range is extended further, a noticeable mode appears, which has total posterior probability about 34%. The quantiles for the two estimates are given in Table 7.
Results are also given in Table 7 for the New Orleans and Kosovo data. In these cases the posterior distribution is definitely concentrated within the standard range. Interestingly, and in contrast with the other data considered, these two datasets illustrate two extremes of the method. For the New Orleans data, the largest posterior probability of any of the possible models is about 0.05, so no model is dominant, while for the Kosovo data, one model has posterior probability nearly 0.99. The corresponding probabilities (for the extended ranges) are 0.44 for the UK data and 0.67 for the Netherlands data.

Dirichlet process mixtures
Another approach that has recently been proposed is a Bayesian latent class method (Manrique-Vallier, 2016). This is implemented in the R package LCMCR (Manrique-Vallier, 2017). It provides a Markov Chain Monte Carlo estimate of the population size. In contrast with the method described in Section 5.1 above, there is no restriction on the number of lists. The results for the various data sets are shown in Table 8. Because the output from the method is a Monte Carlo estimate, it is necessary to check whether there has been sufficient burn-in and also whether the output demonstrates sufficient mixing to be reliable. In order to ensure reproducibility the seed was set to 12345 rather than the default setting which yields different results each time. To ensure better mixing than the default, the parameter thinning was set to 100 and the burnin value was set to 100K.
Comparing the two Bayesian methods of this section is instructive. For the UK data not omitting the GP list, the Dirichlet process approach essentially ignores the lower component of the posterior distribution found by the Madigan-York method and displayed in the upper panel of Figure 5. Once GP is omitted, the two methods give very similar results. For the Netherlands data, the Dirichlet approach homes in on the upper mode for the full data and the lower mode for the five-list case-the reverse of the behaviour of the AIC stepwise approach.
6 The Bayesian/threshold approach 6.1 Defining the prior and thresholding the results In this section, we return to the Poisson log-linear model as specified in Section 3, and set out a Bayesian/threshold approach to fitting the model, dependent on two prior parameters, λ and τ . The first step of the model is to specify a prior which does not constrain the intercept parameter or the main effects, but allows for the prior to shrink the interaction parameters towards zero. In the second step, those interactions for which there is no strong evidence that they are not zero are dropped from the model, and the analysis repeated. The steps of the model are as follows.
Step 1 Use a prior model under which • the parameters µ, α i and β ij for all i and j are independent • µ and the α i have uniform (improper) prior on (−∞, ∞) • the β ij have Gaussian prior with mean zero and variance 1/λ for λ ≥ 0. If λ = 0 this is interpreted as an improper uniform prior on (−∞, ∞).
In every case the R package MCMCpack (Martin et al., 2011) and in particular the function MCMCpoisson enable MCMC to be used to simulate from the posterior distribution. The improper uniform prior is the default for parameters within MCMCpoisson.
Step 2 Constrain to zero those β ij for which the ratio of their posterior mean to their posterior standard deviation does not pass some threshold τ , and repeat the MCMC analysis with these β ij omitted.
One justification for the thresholding step is that it is an approximation to a prior for the interactions which is a mixture of an atom of probability at zero and some other distribution, a prior which in other contexts leads to a thresholding approach; see, for example, Johnstone and Silverman (2004). The exact implementation of such a prior is a topic for future research. If τ = 0 then no thresholding is carried out. In broad terms, a case is exp(β ij ) times more or less likely to be on both the lists i and j than if occurrence on the lists is independent. This interpretation makes it seem unlikely that values of β ij much outside the range ±1 should be contemplated, and so, if a Gaussian prior is used, the precision parameter λ might be chosen in the range 1 to 10.
Turning to the thresholding parameter, two different approaches will be investigated. The first is to take a 'liberal' view, to include interactions where they are not clearly spurious; this would suggest using a threshold parameter of something like 2. The other is to take a 'parsimonious' view, using a much larger threshold, so that interaction parameters will only be included if there is very strong evidence that they are not zero. For this approach we use a threshold of 5, admittedly chosen rather arbitrarily.

Implementation issues
There are two implementation issues taken into account in the package modslavmse (Silverman, 2018b). Firstly, the routine MCMCpoisson in MCMCpack does not appear to deal properly with the case where some of the parameters have an improper uniform distribution while others have finite variance, so if λ > 0 the calling routine MCMCfit in modslavmse gives the intercept and main effects a prior with large finite variance 10 4 . Note, in passing, that a proper Bayesian approach will avoid the issues considered in Section 4.4 because there will necessarily be a well-defined posterior distribution for the parameters.
If an improper prior is used (λ = 0) for the interaction parameters, then some care is needed. Consider the UK data as in Table 1. No cases fall in both lists LA and GP, whether or not in combination with other lists. If the improper uniform prior is used for the corresponding interaction parameter β LA,GP , then we show that the posterior distribution of β LA,GP is concentrated at −∞ and set out the way that the other parameters can be estimated by MCMC. This is an instance where the maximum likelihood approach leads to an extended maximum likelihood estimate of the parameter; see Chan et al. (2019) for further discussion.
In the general MSE model, suppose there is a pair of lists, without loss of generality lists 1 and 2, which contain no case in common. Then N A = 0 for all combinations A of lists containing both 1 and 2. To find the posterior distribution of β 12 , for each combination B of lists, define Because no cases are observed in the overlap of lists 1 and 2, we will have N B = 0 for all B ⊇ {1, 2}. So the conditional likelihood of β 12 given all the other parameters satisfies log L(β 12 |no cases in common between 1 and 2, all other parameters) where C > 0 depends only on the parameters other than β ij . Whatever the value of C, the log likelihood (2) is maximized as β 12 → −∞.
The posterior density of β 12 is proportional to exp(−Ce β ). Although this appears at first sight to be an improper distribution, this function has the properties that, for all y, y −∞ exp(−Ce β )dβ = ∞ and ∞ y exp(−Ce β )dβ < ∞ so that P (β 12 > y)/P (β 12 ≤ y) = 0). This corresponds to the distribution where β 12 = −∞ with probability 1. Since this is true conditional on all the other parameters whatever their values, the unconditional posterior distribution is the same. Hence the posterior distribution of the Poisson parameter for every B that includes lists 1 and 2 is an atom of probability at 0. Given the value −∞ for β 12 , the distribution of every N B for each B ⊇ {1, 2} is then Poisson with parameter 0, in other words the constant value 0, regardless of the other parameters, while for all other B, N B ∼ Poiss(λ B ) with λ B defined as in equation (1) above. So, as asserted above, the likelihood of all the other parameters conditional on β 12 = −∞ is then obtained by simply omitting all combinations of lists which contain 1 and 2.
Returning to the UK data example, where there are 6 lists and hence 63 observable combinations B, we omit the 16 N B for which B included both lists LA and GP, leaving 47 observations from which to estimate the remaining 21 parameters. In fact there is a second pair of lists for which there is no overlap at all, namely LA and NCA, and by the same argument the parameter β LA,N CA is also estimated to be −∞ with probability one. Removing (from the 47 remaining combinations) all combinations of lists containing both LA and NCA leaves 39 observations from which to apply the MCMC approach to the remaining 20 parameters. Within the package modslavmse, the routine removeemptyoverlaps, which is called from MCMCfit, produces the relevant data matrix and also a list of those interaction parameters that take the value −∞ in the posterior.

Results
In this section, the results for the three examples are presented, exploring the effects of using various priors and various thresholds.
The results for the UK data are given in Tables 9, 10 and 11. The first line in each table shows the result of fitting the main effects only, with no β ij considered. Once interactions are considered, the results are not enormously sensitive to the prior, especially if a non-zero threshold is used. If there is no thresholding, so that all interactions are included within the model, then the posterior credible intervals are much larger, but the central estimate is similar.
Computationally, the uniform improper prior is the fastest, though the possibly more plausible prior with variance 1 gives much the same results. A model with every possible interaction is more complicated than the amount of data can bear, and the thresholding at a threshold of 2 is a liberal approach which nevertheless eliminates extraneous complication. This would tend to suggest a point estimate of about 12.3K for the overall prevalence, with an 80% credible interval (rounding to the nearest 500) of about 11.5K to 13.5K and a 95% credible interval, in round terms, of 11K to 14K. This is about a thousand more than the confidence interval obtained from the fixed model, but this is possibly because of the model averaging taking some note of the second group of models exemplified by the model chosen by the BIC criterion. Increasing the threshold to 5 makes little difference for the full data, and, as we see below, homes in on just a single interaction among the lists.   .3 9.7 10.6 11.5 11.9 Uniform 5 9.5 9.9 10.7 11.6 12.1 Variance 10 5 9.5 9.9 10.7 11.6 12.1 Variance 1 5 9.5 9.9 10.7 11.6 12.1 Variance 0.1 5 9.5 9.9 10.7 11.6 12.1  Now turn to the Netherlands data, the results for which are shown in Table 12. Again, and not surprisingly, if all interactions are considered in the model then the posterior intervals are much wider. However, if the thresholding procedure is used to restrict attention to a smaller number of interactions, then the width of the intervals is not dramatically different from the main effects model. Threshold 2 with variance 1 appears to be an exception; however, examination of the results shows that only 5 of the 15 two-factor interactions are thresholded out. As a check, the method was run on the five-list version of the data, with the two smallest lists consolidated. The results for the five-list data were, in general, slightly lower for thresholds 0 and 2 and slightly higher for threshold 5. The only substantially different case was variance 1 threshold 2, where the five-list data results are about 70% of the result for the six-list data.
The New Orleans data are a smaller set of observations, and also consist of eight lists with none of the overlap sets containing more than two cases. Therefore it does not seem appropriate to use more than the main effects model and that approach was adopted in the original analysis (Bales et al., 2019). However it is of interest to see what would happen if we use the Bayesian approach allowing for interactions. Some trials suggest that if the full 8 list data are used, the MCMC algorithm requires both a long burn-in period and then a long run, and possibly other adjustments to the control parameters, to give reasonable mixing in the posterior realisations. For simplicity, therefore, we analyse the five-list version, and the results are given in Table 13. The variance 1 threshold 2 model (and indeed some of the other models) give results identical to the main effects only model, and closer examination of the estimates within the package shows that the thresholding step in fact removes all the interactions leaving main effects only. However, the uniform prior, even with strong thresholding, gives different estimates. To understand why, note that there are 10 two-factor interactions β ij between the five lists. In three of these cases, the observed overlap between lists i and j is zero, and so the corresponding β ij is estimated as −∞ regardless of the thresholding. Even with a moderate threshold all the other interactions are thresholded out, but the model is fitted not just based on the main effects only but with three of the interactions included and estimated to −∞. If the original eight-list data are considered, then the effect is much stronger, with 18 of the 28 possible interaction parameters estimated as −∞.
The Kosovo data are unusual in that all the models allowing for interactions give broadly similar results; see Table 14. The thresholding has little or no effect, even at a threshold of 5, because most of the interactions are very strong.

Choosing the threshold for interactions
The Bayesian approach avoids the necessity of choosing a particular model, but it still contains tuneable prior parameters. The implausibility of very large positive or negative values for the interaction parameters suggests that a prior variance of 1 is a reasonable choice. The standard MCMC software does not allow for the mixed model with an atom of probability at zero for the parameters, a topic for future research, but the thresholding approach gives a simple alternative.
In Table 15 we see the interactions that exceed the threshold at the first stage for both thresholds considered. The three results for the UK data are en-  Table 15 suggest that the less restrictive threshold 2 is probably to be preferred at least as a starting point. Interestingly, reducing the threshold in the Netherlands data to 4.5 yields a similar result to threshold 2. For the New Orleans data, any reasonable level of thresholding leads back to the fitting of main effects only, which is probably the most realistic model given the number of lists and the numbers of cases in the various overlaps. On the other hand, for the Kosovo data, only one of the interaction effects is thresholded out, even at the high threshold. This is not surprising since the data clearly demonstrate strong inter-list correlations, and it is very reassuring that even the high threshold adapts well to data of this kind.
Overall, consideration of these examples suggests that the Bayesian/threshold model with variance 1 and threshold 2 adapts reasonably well to the characteristics of different data sets, although it is advisable not to apply the method completely blindly.

Conclusions
Estimating and keeping track of the numbers of victims is a crucial component of the fight against modern slavery. If multiple systems estimation is to be used as one of the standard methods, then the stability and robustness of point and interval estimation is an important consideration. The most stable method would, of course, be to ignore the possibility of interactions and simply fit main effects, but the Kosovo example shows that this would clearly be inadequate in some practical cases. It would also fail to take account of the correlations which are not unexpected between the lists obtained in the modern slavery context.
The Bayesian approach of this paper, with a threshold of 2 and a prior variance of 1 for the interaction parameters, is at least a candidate. On the data sets considered, it gives results which are stable and robust when smaller lists are combined, and it automatically rules out implausible secondary estimates which are almost certainly spurious. If it is desirable to obtain parsimonious explanations in cases where there may be interactions of particular interest, then the threshold can if necessary be increased. The approach adapts well between data such as the Kosovo data, with strong dependencies between lists, and those situations where few, if any, interactions are clearly present in the data.
One contrast between the modern slavery data sets and the Kosovo data is that the modern slavery data are much sparser, in that not every combination of lists is observed at all. This is not a reflection of the data quality, but is intrinsic to the field. In modern slavery, we will often wish to quantify the number of victims in a fairly constrained geographical area over a reasonably short time period, and so the total population size may be quite small, as in the Greater New Orleans example. Even when we consider larger data sets, such as the Netherlands data, the number of cases actually observed may only be a relatively small proportion of the total population. Sparse data, and lists that do not overlap at all, are the norm rather than the exception, and methods need to take account of that. Of course, it is to be hoped that as public and political consciousness about modern slavery increases, a larger proportion of cases will actually come to light, but this is likely to be a long process. Further recent work on this aspect by the author and colleagues is reported in Chan et al. (2019). It should also be noted that when multiple systems estimation is used for a census of an animal or an easy-to-count human population, attempts can be made to design the surveys or captures to be independent of one another and also to be large enough to avoid the sparsity issues raised by the modern slavery data sets; however, in most human rights contexts, there is no such control over the way that lists arise.
The availability of real data has been an important contribution to the study carried out in this paper, because data on modern slavery and human trafficking will have specific characteristics which need to be taken into account. It is to be hoped that more data sets will be put into the public domain, of course in formats that preserve the privacy of individuals and do not hamper the primary task of rescuing and supporting victims, bringing perpetrators to justice, and discouraging modern slavery in the future.
Multiple systems estimation is not a panacea, but part of the quest for better information and understanding. A key topic for discussion and for future research is how we can build on a whole range of different information and methods to gain a deeper understanding of modern slavery. For example, a promising development is the typology developed by Cooper et al. (2017) and the associated case file coding template. More widely, the important role of research in fighting modern slavery is underlined by the research priorities set out in the Government's 2018 Annual Report on Modern Slavery (HM Government, 2018). A broad discussion of the actual and potential modes of measurement, and how these fit in to the legal, definitional and historical background of modern slavery, is given by Landman (2019).
There are several avenues for future research on the multiple systems methodology set out in this paper. For example, how can it be developed to handle concomitant information and segmentation of populations? What is the best approach when the aim is to discern whether the overall level is different between two time points or between two different sectors or geographical areas? Can the approach be easily extended to the case of fuzzy matching, where it is not quite clear whether cases on different lists are or are not the same? Perhaps most importantly, are there particular patterns in data sets drawn in the context of modern slavery and human trafficking, and can these, as well as the prevalence estimates themselves, contribute to a deeper understanding of the problem itself?