A statistical modelling approach for source attribution meta‐analysis of sporadic infection with foodborne pathogens

Abstract Numerous source attribution studies for foodborne pathogens based on epidemiological and microbiological methods are available. These studies provide empirical data for modelling frameworks that synthetize the quantitative evidence at our disposal and reduce reliance on expert elicitations. Here, we develop a statistical model within a Bayesian estimation framework to integrate attribution estimates from expert elicitations with estimates from microbial subtyping and case‐control studies for sporadic infections with four major bacterial zoonotic pathogens in the Netherlands (Campylobacter, Salmonella, Shiga toxin‐producing E. coli [STEC] O157 and Listeria). For each pathogen, we pooled the published fractions of human cases attributable to each animal reservoir from the microbial subtyping studies, accounting for the uncertainty arising from the different typing methods, attribution models, and year(s) of data collection. We then combined the population attributable fractions (PAFs) from the case‐control studies according to five transmission pathways (domestic food, environment, direct animal contact, human–human transmission and travel) and 11 groups within the foodborne pathway (beef/lamb, pork, poultry meat, eggs, dairy, fish/shellfish, fruit/vegetables, beverages, grains, composite foods and food handlers/vermin). The attribution estimates were biologically plausible, allowing the human cases to be attributed in several ways according to reservoirs, transmission pathways and food groups. All pathogens were predominantly foodborne, with Campylobacter being mostly attributable to the chicken reservoir, Salmonella to pigs (albeit closely followed by layers), and Listeria and STEC O157 to cattle. Food‐wise, the attributions reflected those at the reservoir level in terms of ranking. We provided a modelling solution to reach consensus attribution estimates reflecting the empirical evidence in the literature that is particularly useful for policy‐making and is extensible to other pathogens and domains.

a modelling solution to reach consensus attribution estimates reflecting the empirical evidence in the literature that is particularly useful for policy-making and is extensible to other pathogens and domains.

| INTRODUC TI ON
It is estimated that, annually, bacterial foodborne infections cause ~350 million illnesses and ~187 thousand deaths globally, accounting for ~14.5 million disability-adjusted life years (DALY) (WHO, 2015).
Food reaches the consumer through complex production chains in which many opportunities for (re-)contamination exist. For policymaking, it is crucial to know which fraction of the disease burden is attributable to food and which food products contribute to that fraction. To this end, source attribution is used. Source attribution is defined as the partitioning of human disease burden (DALY, incidence, costs, etc.) for a given pathogen over their animal, food, environmental and human (if any) sources of infection (Pires et al., 2009).
The umbrella term 'source attribution' consists of several methods and data types, a detailed overview of which is available elsewhere (Mughini-Gras et al., 2019;Pires et al., 2009). Briefly, source attribution approaches can be divided into 'top-down' and 'bottom-up'. Top-down approaches attribute the human cases (i.e. the 'top' of the transmission chain) back to their sources of infection (i.e. the 'bottom'), and these approaches can be further divided into epidemiological methods, e.g. analysis of outbreak investigations (Pires et al., , 2019 and case-control/cohort studies (Domingues et al., 2012a(Domingues et al., , 2012b, and microbiological methods (i.e. microbial subtyping (Barco et al., 2013)), or a combination of both Mughini-Gras et al., 2018;Rosner et al., 2017).
Conversely, bottom-up approaches like comparative exposure assessment (Pintar et al., 2016) aim to predict the number of human cases caused by a given pathogen in a source, starting from the level of contamination (prevalence and concentration) and then incorporating effects of food processing, preparation, storage and consumption, and dose-response relationships.
When data are scarce, source attribution is often done using structured expert elicitations (Havelaar et al., 2008). Expert elicitations, however, can only provide an indication of consensus opinions among (some) experts in the field. Several limitations undermine the reliability of expert elicitations, such as the issues of properly accounting for qualitative arguments, cognitive heuristics and overconfidence, the selection/representativeness of experts' backgrounds and agendas, and the uncertainty and diversity in opinions.
Although expert elicitations remain a highly valuable instrument for policy-making, they should be considered a low-cost/low-effort alternative to the generation, analysis and interpretation of empirical data. Expert elicitations should actually build on these data and be undertaken only when knowledge remains inadequate or conflicting (Morgan, 2014). Some authors have therefore called for novel data-driven approaches to source attribution to combine different types of data in a single analytical framework (Havelaar et al., 2008;Koutsoumanis et al., 2019;Pires et al., 2018).
For several foodborne pathogens, numerous source attribution studies based on epidemiological and microbiological methods have been performed (Mughini-Gras et al., 2019). These studies provide empirical data for input into a modelling framework that synthesizes the quantitative evidence at our disposal and constrains it with the estimates from expert elicitations as to make best use of all available evidence. In the Netherlands, expert elicitations are routinely used to attribute the burden of 17 enteropathogens to five major transmission pathways (food, environment, direct animal contact, human-human transmission and travel) and 11 groups within the food pathway (beef/lamb, pork, poultry meat, eggs, dairy, fish/shellfish, fruit/vegetables, beverages, grains, composite foods and food handlers/vermin), but the elicitation in question was conducted in 2008 and has therefore become outdated (Havelaar et al., 2008).
Here, we present a way to integrate the existing expert estimates with available empirical data derived from source attribution anal- • Attribution estimates were robust and biologically plausible and allowed human cases to be attributed in several ways according to reservoirs, transmission pathways and food groups, which is particularly useful for policy-making.
• The four pathogens attributed were all predominantly foodborne, with Campylobacter mostly attributable to chicken, Salmonella to pigs, Listeria and STEC O157 to cattle. Food-wise, the attributions reflected those at the reservoir level.
A similar search strategy has been used in a recent collection of manuscripts on systematic reviews and meta-analyses of risk factors for 11 sporadic foodborne diseases globally (Gonzales-Barron et al., 2021). There were no restrictions in terms of year and type of publication. The search was limited to the languages English and Dutch. Each reference was manually and independently screened for relevance by two reviewers, and the following inclusion criteria were applied: the data reported in the study had to be collected from the year 2000 onwards, and the study had to be performed in the Netherlands. If there was no Dutch study available, the inclusion criterion was relaxed to include also studies performed outside the Netherlands. For this, the global overview of case-control studies provided by the aforementioned collection of systematic reviews and meta-analyses (Gonzales- Barron et al., 2021), and other recent review papers (Koutsoumanis et al., 2019;Mughini-Gras et al., 2019) were also used.
The data extracted from the studies based on microbial subtyping that passed the screening were pathogen, typing method, year(s) of data collection, country of the study, source attribution model, number of human cases attributed, sources which the human cases were attributed to and the per cent attribution estimates (the attribution estimates provided by these studies, by default, sum up to 100% over the sources). For the case-control studies, data extraction included the study characteristics (i.e. year, country, population studied and sample size) the statistical analysis performed, the statistically significant risk factors and the outcomes of the study (odds ratios [ORs] and, if available, the corresponding population attributable fractions [PAFs]). Only case-control studies of sporadic cases (i.e. not those conducted in outbreak settings) performed in the Netherlands from 2000 onwards using multivariable logistic regression and reporting ORs as a measure of association were considered. Following standard epidemiological terminology, sporadic cases are those occurring irregularly/occasionally in time and place (i.e. scattered or isolated) and differ from the outbreak-related cases because they cannot be linked to one another or to a common source of infection. The use of adjusted ORs was done to account for possible confounding that needed to be controlled for within each study through multivariable analysis. All studies included in the analysis were checked for redundancy, i.e. whether they used the same data and methods. The focus was on sporadic cases for different reasons: (a) sporadic cases are the majority of cases and are generally more difficult to attribute to sources, as they are not linked to one another or to a common source of infection (Domingues et al., 2012a(Domingues et al., , 2012b; (b) outbreaks are usually caused by one (or just a few) pathogen strain (as clonality is common within an outbreak) and one (or just a few) exposure; (c) outbreak data are usually synthetized using a specific type of meta-analysis (Pires et al., , 2019; (d) combining attributions for sporadic and outbreak-related cases are not recommended, as outbreaks tend to be investigated disproportionally more often than sporadic cases. The epidemiology of a same pathogen can be quite different when this pathogen appears sporadically or as part of an outbreak, with some sources being disproportionally represented among outbreak-related versus sporadic cases simply because these sources (e.g. drinking water) are more likely to be identified and have the potential to cause larger, albeit extremely rare, outbreaks .

| Analytical approach
For the four pathogens included here, there are different transmission routes from an animal reservoir. For example, if the reservoir is a food-producing animal, transmission can be foodborne, but also environment-mediated or occurring through direct contact with the animal. Source attribution based on microbial subtyping usually attributes human cases at the level of reservoir (or amplifying host) regardless of the transmission pathway, whereas case-control studies usually provide attributions at the exposure level (Mughini-Gras et al., 2019;Pires et al., 2009). To address these methodological differences while providing attributions comparable with those based on the aforementioned expert elicitation (Tables 1 and 2) of Havelaar et al. (2008), which were provided at exposure level, two different methods were applied.

| Attribution across reservoirs
For each pathogen, Bayesian statistical modelling was used to estimate the fraction of human cases attributable to each reservoir by accounting for the heterogeneity arising from the use of different typing methods, attribution models and year(s) of data collection in the different studies. This step provided attribution estimates at the reservoir level, which thereby included the contributions of both foodborne and non-foodborne transmission routes.
For each study s, we define ps i,s as the attribution estimates for each reservoir i (with i = 1,…,R) being R the total number of reservoirs. The reservoirs investigated vary among studies (i.e. not all studies considered the same reservoirs), which leads to missing estimates for some reservoirs that are not necessarily zero. Therefore, the values of ps i,s are partial estimates and would not add up to 1 over all reservoirs in the studies. To circumvent this, we normalized the attribution estimates for all reservoirs by considering the (complete) attribution estimate p i,s calculated using the following formula: to ensure that ∑ R i=1 p i,s is constrained to 1. This is, in fact, equivalent to using a Dirichlet distribution and a multinomial distribution likelihood.
We used the alternative method described above because of the missing data in the dataset, and the Bayesian analysis software JAGS (v4.3) (Plummer, 2003) we used is not able to model a partially observed multinomial distribution.
To ensure that ps i,s values are naturally bounded between 0 and 1, we introduced the parameter pp i,s , which relates to ps i,s by the logistic function The source attribution parameters pp i,s have prior normal dis- . The studies providing data for a certain reservoir were heterogeneous in terms of typing methods and types of source attribution models used, as well as years of data collection, and all these factors might influence the attribution estimate p i,s . To account for this, extra terms were considered: b j,s defined as the contribution (logodds) to the attribution estimate from the various typing methods j (j = 1,…,T; with T being the total number of typing methods); b k,s defined as the contribution (log-odds) to the attribution estimate from the various source attribution models k (k = 1,…,M; with M being the total number of source attribution models).
The log-odds b j,s and b k,s followed normal distributions b j,s ∼ N m j , e j and b k,s ∼ N m k , e k . The parameters m and ϵ followed vague prior normal distributions, m j ∼ N j , j ; m k ∼ N k , k ; and Then we considered L as the total number of unique combinations of i, j, k and s, and we defined the attribution estimate where ilogit(x) = e x ∕(1 + e x ). The estimates were calculated with L = 54 cases for Campylobacter, 110 for Salmonella, 128 for Listeria and 8 for STEC O157. In the absence of heterogeneities between studies, p l = p i,s . For each study-reservoir-typing method-model combination, the number of human cases attributed to each reservoir was modelled as drawn from a binomial distribution: with N eff,s being the effective number of human cases in study s. As mentioned above, the set of reservoirs investigated varies among studies. Therefore, unless a study covered all possible reservoirs when using a specific typing method and model, there are missing data for each study, and a few extra observations are needed to fill in the unobserved reservoirs in each study. Of note, these extra observations would also increase the precision of the estimates.
Consequently, we considered an effective number of human cases  N(0, 10). This allowed the model to fill in the missing reservoirs of each study with a proportion informed by the rest of the studies containing the corresponding reservoirs in question.
Recent studies are more likely to be up-to-date and so provide more relevant information for the current situation. Therefore, we gave more weight to the more recent studies by reducing the precision ( = 1∕ 2 ) of the older studies as follows: where hc = 1∕ (2020 − year) 2 , so if the year is 2020, then hc = 1. Thus, Cases l followed the actual number observed in a study ObservedCases l with increasingly less accuracy according to study's age. We then checked whether the different models and typing methods were predominantly used in older studies, as these studies had less influence on the results. However, even after ignoring the age of the studies, the effects of the model and typing method were minimal.

| Attribution across transmission routes
For each pathogen, we combined the PAFs of the significant risk factors from the case-control studies as to group them, where possible, according to the transmission pathway (Table 1) and the different food groups within the foodborne transmission pathway ( Table 2).
When PAFs were not reported in the study, we calculated them using Miettinen's formula (Miettinen, 1974) based on the study-provided multivariable ORs (a proxy for relative risks) and prevalence of exposure among cases, as follows: where pd is the prevalence of exposure to the risk factor among the cases. If necessary (i.e. adjusted ORs or pd not available), Levin's formula (Levin, 1953) was used for the calculation of the PAFs, as follows: where pe is the prevalence of exposure to the risk factor in the overall population as obtained from the Dutch National Food Consumption

Survey
(https://www.rivm.nl/en/dutch -natio nal-food-consu mptio n-survey) if the risk factors pertained to food consumption, or from a continuous population-based survey conducted in the Netherlands since 2008 from which the exposure to different risk factors in the general population can be derived (Friesema, van Gageldonk-Lafeber et al., 2015). This provided attribution estimates at the exposure level for the five transmission pathways and the 11 food groups, to be combined with the estimates from the expert elicitation (Havelaar et al., 2008). For Salmonella, STEC O157 and Listeria, the fractions attributable to 'travel' as transmission pathway could be derived directly from the cases reported to the Dutch national surveillance system for infectious diseases, which are published annually by the Dutch National Institutes for Public Health and the Environment (RIVM) (Vlaanderen et al., 2019).
In a similar way to how the attributions at reservoir level were estimated, a Bayesian framework was used in which prior information is combined with data to arrive at posterior parameter estimates for the fraction of human cases attributable to the different transmission pathways. For the attribution estimates per transmission pathway (also for each food group within the foodborne pathway), there is a total number of cases per study, N total,s . However, a few of them cannot be classified because the set of studied pathways does not cover all possible existing transmission pathways. Therefore, the number of human cases per pathway reported in a given study s that can provide information is N observed,s , which is less than N total,s , and consequently reduces the accuracy of a study's estimate p q,s (with q = 1,…, Q s ). Note that q denotes the transmission pathway and Q s , the number of studied transmission pathways in study s. Let where CP observed is the cumulative probability of a human case being attributed to one of the transmission pathways observed in study s (note that Q s ≤ Q, the total number of possible transmission pathways).
Therefore, the total number of cases in study s that can be classified into one of the observed transmission pathways can be written as To account for all existing pathways, we used a hypothetical effective number of human cases N effective,s . This effective number is to be used in the stochastic process to correctly calculate the posterior distributions for all pathways, even for those unmeasured in a study.
Note that the use of N effective,s modifies the precision of the estimates. Based on expert elicitation data (Havelaar et al., 2008), we inferred the total proportion of all transmission pathways observed in a given study s by simply adding the expert-provided fractional values of the pathways observed in the study. For this, we used the point estimates. Let us consider CP expert,s , the cumulative proportion (from expert opinions) over all transmission pathways that were observed in study s: Note that attribution estimates from the expert elicitation study (Havelaar et al., 2008) were normalized to sum to 1. We can then write: (5) ObservedCases l ∼ N Cases l , hc , N observed,s = CP observed,s × N total,s (10) CP expert,s = p 1,expert + p 2,expert + ⋯ + p Qs,expert .
(11) N observed,s = CP expert,s × N effective,s , which combined with Equation (9) provides the effective number of cases: The number of human cases observed within transmission pathway q in study s is specified as binomially distributed, allowing inference on the attribution estimate p q,s : To allow the data on the proportion for the various pathways from expert opinions, p q_expert , to provide information on the missing pathways per study, we assumed that these contribute to the likelihood as data points drawn from log-normal distributions.
Here, τ q_expert is the precision (τ = 1/σ 2 ) attributed to the logarithm of experts' fractional estimates, which for this study, we have arbitrarily set to 1, i.e. experts' fractional estimates were then as- The script of the model is available as Data S1.
Ethical approval was not required, as this is a meta-analytical modelling study synthesizing data from other published studies.
Only anonymized, aggregate statistics were available from the primary studies.

| Studies based on microbial subtyping
Eleven studies based on microbial subtyping passed the screening: 4 Dutch studies on Campylobacter Mughini-Gras et al., 2012Smid et al., 2013), three of which used Multilocus Sequence Typing (MLST) data with the Asymmetric Island Model (AIM) (Wilson et al., 2008) and one used core-genome MLST (cgMLST) data with the STRUCTURE algorithm (Pritchard et al., 2000). These studies provided attributions for a total of 3205 human campylobacteriosis cases (in 2002/2003, 2010/2011 and 2018/2019)  using different MLST and cgMLST schemes (Nielsen et al., 2017).
These studies provided attributions for 1115 human cases to eight reservoirs (broiler chickens, turkeys, beef cattle, dairy cattle, small ruminants, pigs, fish/shellfish and other/unknown).
The two studies on Salmonella were based on comparable data from 2002 to 2003, but one of the two studies  also included Salmonella serotypes other than Typhimurium and Enteritidis, which were the focus of the other study (Doorduyn et al., 2006). In total, they included 414 human cases and 3165 frequency-matched controls and provided PAFs for 16 significant risk factors from multivariable analysis.
The study on STEC O157, conducted in 2010-2014, included 342 STEC cases and 2260 controls and provided ORs (from which PAFs could be derived) for five risk factors from multivariable analysis (Mughini-Gras et al., 2018). The study on Listeria (Friesema, Kuiling, et al., 2015), however, did not identify any significant risk factor other than underlying chronic diseases, so it could not be used for the purposes of source attribution. We therefore used the pooled ORs of a recent systematic review and meta-analysis of case-control studies for human listeriosis (Leclercq et al., 2020) to calculate unnormalized PAFs using Levin's formula and the prevalence of exposure p obtained from the population-based surveys mentioned in section 2.2.2.
(12) N effective,s = CP observed,s ∕CP expert,s × N total,s (13) N observed,q,s ∼ Binomial(N effective,s , p q,s ) (14) Log(p q_expert ) ∼ N(Log(p q ), q_expert ). Figure 1a shows the fractions of human campylobacteriosis cases attributable to the different reservoirs projected by combining the estimates from the four studies based on microbial subtyping. The main reservoir was estimated to be chicken, accounting for 38.9% (95%CI 35.9%-42.0%) of human cases, followed by dogs and cats (15.4%, 95%CI
Most of the variability in the attributions at reservoir level was accounted for by the typing method for Campylobacter and Listeria, although generally the variability was limited (Figures S1 and S2).  in the Netherlands (Mulder et al., 2021) found that living in areas with high levels of exposure to small ruminants is associated with increased incidence of human STEC O157 infections, suggesting that environmental exposure to small ruminants is a significant risk factor for STEC O157. Also, for Campylobacter, a relatively large attribution to the environmental transmission pathway was estimated (15%).

| Attribution estimates for the different transmission pathways
Contamination of surface water with Campylobacter is widespread in the Netherlands and mostly concerns wild bird-associated strains . However, the contribution of poultry to surface water contamination with Campylobacter has been found to be associated with the magnitude of poultry production .
While results may be expected to vary over countries due to differences in food production and consumption patterns, these estimates are consistent with the general knowledge on these patho-  (Hoffmann et al., 2007). The attributions to some 'rare' sources tended to shrink as compared to the expert estimates, as these estimates were probably influenced by the considerable (media) attention that outbreaks usually receive, whereas the attributions reported here pertain to the sporadic cases, which are less likely to make the news. An example are the sources that experts are inclined to consider important due to some outbreaks that sparkled interest in past years, such as some campylobacteriosis outbreaks linked to unpasteurized dairy (Heuvelink et al., 2009;Kenyon et al., 2020;MMWR, 2013), although it is now clearer that campylobacteriosis is mostly sporadic and that those outbreaks were more an exception than a rule. That is probably why the attribution for Campylobacter to dairy decreased compared to the expert elicitation.
Our approach was also able to capture trends in the attributions by downweighing the importance of older studies. This is exemplified by the attributions of Salmonella to pork and eggs at the exposure level. Indeed, the expert elicitation was performed in a period (2008) in which the most frequently isolated Salmonella serovar from human cases was Enteritidis, a serovar strongly associated with laying hens (Mughini-Gras, Smid, et al., 2014) and therefore mostly transmitted with eggs (Doorduyn et al., 2006), followed by serovar Typhimurium, associated with pigs (Mughini-Gras, Smid, et al., 2014). However, over the years, Enteritidis has decreased spectacularly in the Netherlands and other European countries, whereas Typhimurium and its mono- Mughini-Gras, Smid, et al., 2014). Since 2011, Enteritidis is no longer the primary serovar, meaning that laying hens are nowadays no longer the primary reservoir and that pigs have taken the first place.
Therefore, the attribution estimates now better reflect the current situation. In general, the synthetized attribution estimates were more similar to those of the original studies than the expert elicitation.
This was particularly evident for the Campylobacter attribution to food as transmission pathway, which increased in importance, particularly for beef/lamb as food product. Accordingly, recent studies have highlighted a more prominent role of cattle as source of human campylobacteriosis Thepault et al., 2018).
For Salmonella, the contribution of travel gained importance as compared to expert estimates, which agrees in magnitude with recent findings of a significant drop in notified (travel-related) salmonellosis cases during the COVID-19 pandemic as a result of travel restrictions . Conversely, Listeria attribution to travel decreased compared to expert estimates, which seems to better reflect the reality since contrary to salmonellosis, listeriosis did not seem to be affected by travel restrictions during the COVID-19 pandemic (Benincà et al., 2021).
The attributions are estimated separately for the reservoirs, the transmission pathways, and the exposures within the foodborne pathway; thus, they provide separate pictures of the relative contributions of different sources at defined points in the transmission chain, without inferring a (quantitative) relationship between these points. Another limitation is intrinsic in the datasets used, as no empirical study is as comprehensive and complete (in terms of number and resolution of the sources) as an expert elicitation. This means that for some sources, only the estimates from expert elicitations are available and cannot be updated with other studies unless these become available in the future. Although our method can also be used with other types of data (e.g. from original studies, unpublished material, surveys, etc.) than data obtained from the literature alone, and a literature review was not in itself the goal of this study, the type and quality of the literature search (or data collection in general) will define the interpretation of the model outcomes. The method presented here may thus be used in other settings (i.e. studies), provided that the necessary data are collected from the literature or other sources of information in a way that suits the scope of the study in question. Indeed, the data collection phase must be tailored to the specific research question, not the method per se, and cannot therefore be taken as such from the present study. Yet, different research questions can be answered with the method presented here. While the literature search of this study is not generalizable to other situations as it was meant to fulfil the needs of a specific goal (i.e. model development with application to four major foodborne pathogens in the Netherlands), it can be amended for the purposes of other studies. The same is true for the way the attribution estimates are presented. For example, for the attributions at the exposure level, we used the source categorization defined by the expert elicitation available in the country (Havelaar et al., 2008) because this offered the most complete overview of source categories and corresponding attribution estimates. However, this could be different in another setting where it is possible that this categorization is inapplicable or the focus of the study is limited to, e.g. the foodrelated sources. In that case, both the data collection phase and the framework (i.e. categorization of sources) will have to be amended.
In conclusion, we developed a meta-analytical model that combines the attribution results from different studies and provides robust, biologically plausible estimates, being a valid solution to reach consensus estimates reflecting the available empirical evidence as to inform policy.

CO N FLI C T O F I NTE R E S T
The authors declare that there is no conflict of interest.