Fitting distributions to microbial contamination data collected with an unequal probability sampling design

Authors


Correspondence

Michael S. Williams, Risk Assessment Division, Office of Public Health Science, Food Safety Inspection Service, USDA, 2150 Centre Avenue, Building D, Fort Collins, CO 80526, USA. E-mail: mike.williams@fsis.usda.gov

Abstract

Aims

The fitting of statistical distributions to microbial sampling data is a common application in quantitative microbiology and risk assessment applications. An underlying assumption of most fitting techniques is that data are collected with simple random sampling, which is often times not the case. This study develops a weighted maximum likelihood estimation framework that is appropriate for microbiological samples that are collected with unequal probabilities of selection.

Methods and Results

A weighted maximum likelihood estimation framework is proposed for microbiological samples that are collected with unequal probabilities of selection. Two examples, based on the collection of food samples during processing, are provided to demonstrate the method and highlight the magnitude of biases in the maximum likelihood estimator when data are inappropriately treated as a simple random sample.

Conclusions

Failure to properly weight samples to account for how data are collected can introduce substantial biases into inferences drawn from the data.

Significance and Impact of the Study

The proposed methodology will reduce or eliminate an important source of bias in inferences drawn from the analysis of microbial data. This will also make comparisons between studies and the combination of results from different studies more reliable, which is important for risk assessment applications.

Introduction

Statistical distributions are useful for modelling many different characteristics that describe microbial contamination. These distributions can be used in a quantitative microbial risk assessment or for computing risk-based metrics such as food safety objectives (FSO) and performance objectives (PO) (Whiting et al. 2006). A recent example is the fitting of Poisson-lognormal and Poisson-gamma distributions to data describing Escherichia coli O157:H7 contamination on carcasses across a collection of beef slaughter establishments (Gonzales-Barron and Butler 2011). These distributions could be used in a risk assessment to estimate the number of human illnesses avoided that result from imposing new production standards on slaughter establishments.

One concern with using data collected across different establishments or producers is that factors, such as the production volume of the establishment, can be related to the characteristic of interest. An example of such a relationship is that the degree of automation and the rate at which carcasses move through the slaughter process is often related to the volume of the commodity produced by the establishment. Another concern is that the sampling fraction across establishments will vary, with large establishments having a lower sampling fraction because of the substantial variability in production volumes.

While samples of food commodities are rarely strict probability samples, the variability in sampling fractions across establishments makes the assumption of simple random sampling untenable in many applications. Ignoring the effect of an unequal probability sample design can induce a bias in the estimated parameters of the statistical distribution that describes contamination because the population and sample distributions can be fundamentally different. In many applications, however, it is possible to specify an unequal probability sampling design that is a reasonable approximation.

Methods to account for differences in production volumes are currently used in the computation of simple descriptive statistics, such as the volume-weighted estimators used by the Food Safety and Inspection Service (FSIS) in the United States (FSIS 2009, 2011a). These methods estimate the proportion of test-positive samples for a variety of pathogens in meat and poultry products. The methodology to relate the population and sample probability density functions stems from weighted distribution theory (Patil and Rao 1978; Gove and Patil 1998; Fuller 2009), but few if any studies describe the application of these methods to food-safety applications.

This study applies weighted distribution methods to microbiological contamination data. The proposed solution draws on results from a number of subdisciplines in statistics and quantitative microbiology, such as unequal probability sampling theory and censored data techniques, and develops a general framework to summarize data when an assumption of simple random sampling would not be appropriate. We apply these methods to an artificial data set and use computer simulation to illustrate the performance of these techniques. This approach demonstrates the bias that occurs when data are not appropriately weighted. We also use these methods to construct a probability distribution describing between-slaughter establishment variability in the proportion of Salmonella test-positive samples in beef trim.

Methods

Methodology development begins with the definition of variables that describe the attributes of the population. This analysis aims to use laboratory-testing data to estimate a population-level attribute, such as the proportion of beef carcasses that would test positive for Salmonella or average concentration of Campylobacter (CFU ml−1) in broiler chicken carcass rinse samples. The next step defines a plausible approximation of the unequal probability sample design. While it is not possible to describe all possible unequal probability sampling designs that occur in practice, the collection of samples across multiple establishments is a common occurrence. The sampling methods used by FSIS in the United States will illustrate the approximation.

A goal of fitting a distribution function to data collected at multiple establishments is to provide a model for contamination at one point in the farm-to-table continuum. Parametric distributions that describe the proportion of contaminated samples and levels of contamination are often continuous distributions, but the data are integer valued. The final step defines the numerical methods to fit a distribution to the unequal probability sampling data using the maximum likelihood estimation techniques typically used in food-safety applications (Chochran 1977; Shumway et al. 2002; Shorten et al. 2006; Busschaert et al. 2010).

Population description

Samples of food commodities are regularly collected and tested for microbial contamination by individual producers, regulatory agencies (Lammerding et al. 1988; Altekruse et al. 2006) and industry groups (Hill et al. 2011). In many countries, sample collection and testing is a mandatory requirement for products, such as meat and poultry, to enter commerce. Given these conditions, we will assume there are M establishments (e.g. slaughter plants) producing the commodity of interest and the total production for establishment j is Vj. For example, Vj could be the total number of broiler chickens slaughtered by establishment j. So, the total annual slaughter across all establishments is inline image. The number of samples collected from each establishment j will be denoted by nj, with the total number of samples being inline image. For an individual sample, indexed by i and collected in establishment j, microbial testing is performed with test outcome yij. The test is often a qualitative screening test that is used to determine if the micro-organism of interest is present in the sample, in which case inline image (Hill et al. 2011). In some situations, the sample will be subjected to a series of tests, where samples undergo a qualitative screening test to determine if the organism is present in the sample and then samples that test positive undergo enumeration by methods such as direct plating or the most probable number technique (Haas 1989). Additional analyses, such as genotyping and biochemistry, also can be performed. It is difficult to provide a concise definition of yij in all situations, given the variety of analyses that can be performed.

Sample design

The sample design assumes that a sample of establishments is selected for testing, and samples of the commodity of interest are collected from each selected establishment. This is a typical application of two-stage cluster sampling, where establishments represent the clusters (Cochran 1977; Särndal et al. 1992). The sample design for selecting a sample from M establishments will define a first-stage probability of selection,

P (establishment j is selected) = π1j, j = 1,…M.

The Horvitz–Thompson estimator (Cochran 1977; Fuller 2009) can be used to estimate a population total and is given by

display math(1)

Where m is the number of establishments sampled, π1j represents the probability of selecting establishment j and inline image is the estimator for the target parameter in the establishment. Note, however, π1j = 1 under the assumption that samples will be collected from all M establishments, as is the case in the FSIS verification program for Escherichia coli O157:H7 in ground beef.

Within an establishment, sampling at regular intervals is typical for many product-pathogen pairs. For example, some large production establishments in the United States will perform testing on every 900 kg bin of beef trim (Bosilevac and Koohmaraie 2008) while Australian exporters (Kiermeier et al. 2011) collect 12 samples from every 700 cartons of beef trim. Regulatory agencies often collect samples, with an example being the FSIS verification testing program for E. coli O157:H7 in ground beef. This program usually collects one sample per week from large production establishments. Sample collection in smaller establishments occurs at lower sampling rates (e.g. monthly).

A common thread to all of these sampling programs is that samples within an establishment are collected at regular intervals, which makes an assumption of systematic or simple random sampling reasonable. Assuming sampling at regular intervals within establishment j yields nj samples, and then, sample unit (e.g. carcass) i has a second-stage probability of inclusion of inline image, where Vj is the total number of units (e.g. carcasses) produced. The Horvitz–Thompson estimator (Cochran 1977; Fuller 2009) of the total for the test outcomes in establishment j is

display math(2)

For example, if yij = 1 when a carcass tests positive for a pathogen and 0 otherwise, then inline image is the estimator of the total number of test-positive carcasses across the entire volume of production in establishment j. Therefore, in the case where yij is binary, the estimator of the proportion of test-positive carcasses is

display math(3)

Alternatively, if yij is the microbial count per sample volume, then inline image is the average microbial count per sample volume across all carcasses produced by establishment j.

The total across all establishments is

display math(4)

Note, πj = 1 under the assumption that samples will be collected from all establishments, so the population total for y is estimated by

display math(5)

and

display math(6)

is an estimator of the mean.

Weighted maximum likelihood estimation

Assume the yi values are realizations derived from the population-level probability distribution, so y~f (y; θ) with θ being the parameter vector for the distribution. Weighted distribution theory traces its development to environmental sampling applications where rather than using simple random sampling, elements are selected with probability proportion to a power transformation of the parameter y (i.e. yξ). The ξ value is a power transformation parameter similar to the family of Box and Cox transformations (Box and Cox 1964). So, for example, an ξ value of 0.5 is a square root transformation.

Let π(yi) ∝ yξ represent the probability of inclusion for population element i based on the power transformation. When y~f (y; θ) describes the underlying population-level contamination distribution, the distribution describing samples selected with unequal probabilities is f *(y; θ), where f * is referred to as the size-biased sampling distribution. These two distributions are related by

display math(7)

where the term E[π(y)] serves as a normalizing constant. Many distributions have the property of being form invariant (i.e. both f(y; θ) and f *(y; θ) belong to the same family of distributions) with examples being the Weibull, lognormal, beta, gamma and exponential (Patil and Ord 1976; Gove and Patil 1998).

When samples are collected from different establishments, sample elements are selected with inclusion probability π(Vj), where Vj and yij are possibly related. In this case, the sampling distribution is

display math(8)

This application differs from size-biased distribution theory in that f(y; θ) and f *(y; θ) need not belong to the same family of distributions, so the analyst selects one or more candidate distributions that are appropriate for the application.

Under the assumption of independence between sample elements ij and ij′, the parameter vector of f(y; θ) can be estimated by minimizing the weighted log-likelihood function. The likelihood function mimics a Horvitz–Thompson estimator and is given by (Fuller 2009)

display math(9)

Note, however, that the full likelihood

display math(10)

must be maximized if a numerical approximation of the variance–covariance matrix is to be derived by inverting the Hessian.

In this application, the data are observed in clusters, and it is reasonable to assume that observations within a cluster could be correlated, while observations between clusters (e.g. establishments) are uncorrelated. The data generated from these types of sampling applications are referred to as cluster-correlated data (Williams 2000). Fuller (2009) provides additional results and points out that it is common to accept the assumption of independence between sample elements, as was carried out in this application. While it is possible to model this correlation using maximum likelihood estimation, the additional complexity usually does little to improve the fit of the distribution (Williams and Reich 1997). It is likely that the number of observations from each establishment will not be sufficient to estimate the parameter(s) describing within-cluster correlation (Shumway et al. 2002). We justify ignoring the cluster correlation by noting that the systematic nature of sampling in many food-safety applications (e.g. the selection of one sample a week from each establishment) makes strong dependency unlikely, it alleviates concerns regarding seasonal patterns in pathogen occurrence and it likely increases the precision of the maximum likelihood estimator.

Examples

The following two examples demonstrate the application of the weighted maximum likelihood method. Computer code, written in the R programming language, is available from the authors.

Example 1: simulation study

The first example is intended to test the proposed weighted maximum likelihood method with a small artificial population. It demonstrates the magnitude of the bias in the parameter estimates that occurs when the sample is treated as an equal probability sample. The population mimics sampling from M = 3 establishments that have different production volumes and levels of a microbial contaminant. The production volume of the three establishments is set at V1 = 200000, V2 = 400000, V3 = 600000. The contamination distribution for each establishment is yij ~ lognormalj, σj), with µ1 = −1.0, µ2 = −0.5, µ3 = 0.0 and σ1 = 0.4, σ2 = 0.25, σ3 = 0.4. To mimic testing for microbial contamination, each yij observation that was less than a limit of detection of 1 is treated as a censored observation (a nondetect), which results in only 26% cent of samples having concentrations above the detection limit (i.e. simulated samples will be heavily censored). The mean and standard deviation of a mixture of the three distributions is calculated using

display math(11)

and

display math(12)

with wj = Vj/V. The resulting values are µtrue = −0.333 and σtrue = 0.516.

The aggregated population of yij values from a mixture of three lognormal distributions is not necessarily lognormally distributed, but it is assumed that a single lognormal distribution (i.e. yij ~ lognormalpop, σpop), j = 1, 2, 3; i = 1,…Vj ) will adequately describe contamination across all establishments. Although such an assumption is typical in many applications because sample sizes are often insufficient to fit individual plant-level distributions (Williams et al. 2010; Gonzales-Barron and Butler 2011), the application to this example with just three establishments is primarily illustrative (i.e. the concept of an overarching industry distribution is best understood if the industry comprises many more than three establishments). The histogram in Fig. 1 demonstrates the shape of the distribution and suggests that a lognormal distribution would be a reasonable candidate for fitting.

Figure 1.

Representations of the distributions in Example 1. The histogram is derived from 120 000 simulated concentration values, with each generated from one of three different lognormal distributions. The lines represent the fits of a lognormal distribution to unequal probability samples from the population. Sample observations that were less than a limit of detection of 1 (vertical line) were treated as censored observations. (—) Weighted maximum likelihood estimates (MLE); (--) Unweighted MLE.

Maximum likelihood estimates (MLE) were computed using the optim routine in the R software package (R Development Core Team 2011). The population-level parameter vector inline image was calculated by maximizing the censored data likelihood function (Shorten et al. 2006). This function is given by

display math(13)

where

display math

with f() and F() being the probability density and mass functions for the lognormal distribution, respectively.

The simulation study selected samples from the population, with the number of samples selected from each establishment being nj = 50, j = 1, 2, 3. The sampling process was repeated 40 000 times. For each simulated sample, a lognormal distribution was fitted to the data using the unweighted maximum likelihood model (Shorten et al. 2006)

display math(14)

as well as the weighting adjustments in equation (1). The optim routine in R provides an estimate of the parameter vector inline image * = unweighted, weighted as well as a numerical approximation of the Hessian at inline image. The variance–covariance matrix is estimated by inverting the Hessian to yield inline image.

The first comparison between the unweighted and weighted MLE methods is an assessment of the bias for the parameters inline image and inline image * = pop, weighted, unweighted. Using the unweighted MLE as an example, the per cent bias is expressed as

display math(15)

Where inline image is the average of the inline image across the samples drawn from the population. In addition, the performance of 95% confidence intervals was assessed where the intervals were constructed using the variance estimates provided by the information matrix inline image  * = weighted, unweighted. For each iteration of the simulation, a test was performed to determine if the confidence intervals contained the µtrue and/or σtrue parameters. The proportion of times the confidence interval covered the population parameters was reported.

Example 2: beef trim case study

Samples collected from a group of establishments are often tested only for the presence/absence of the pathogen in question (Hill et al. 2011). For some food-safety applications, interest lies in estimating the proportion of product produced across all establishments whose proportion of test-positive samples exceeds some threshold value. For example, FSIS sets limits on the number of Salmonella-positive carcass rinse samples collected from broiler chicken slaughter establishments, with the current limit being a total of s = 5 positive samples in a set of nj = 51 samples (FSIS 2011b). FSIS also conducts surveys of slaughter establishments, referred to as ‘baseline’ surveys to assess overall industry performance. A recent example is the FSIS baseline survey for Salmonella and Escherichia coli O157:H7 in beef trimmings (FSIS 2011a). This survey collected 1719 beef trim samples from 157 establishments that produced beef trimmings.

Estimating the proportion of test-positive samples with the beef trim baseline survey data is problematic because of the small number of samples collected in each plant (nj = 11 on average, min (nj) = 1 and max (nj) = 25), the low number of test-positive samples inline image, and a total annual production volume of establishments that ranges from roughly 10 000 kg to 100 000 000 kg. This leads to two problems. Few establishments have any test-positive samples, so a simple empiric estimate of the proportion of test-positive samples is 0 for the majority of establishments. In addition, the small number of samples in each plant leads to a very high estimated prevalence when any samples are positive. For example, one plant had sj = 1 positive out of nj = 3 samples, which leads to an empiric prevalence estimate of inline image.

This analysis uses the limited integer count data to generate a distribution function that describes the proportion of trim volume produced as a function of the underlying test-positive sample fraction. Such a distribution would allow an analyst to determine what percentage of all beef trim produced has a fraction of test-positive samples in excess of some specified value. The analyst could then determine what proportion of beef trim is likely to be affected by a new regulation intended to reduce contamination in establishments that exceed the specified test-positive sample threshold.

A generalized (or mixture) distribution model is used to estimate the distribution of test-positive fractions from the integer count data (Mood et al. 1974). The desired form of the mixture distribution contains a parameter θj to describe the proportion of Salmonella test-positive trim samples in establishment j, so that p(sjj) describes the probability of observing sj -positive samples in a set of nj samples from an establishment with an underlying probability of a positive sample θj. The θ parameter varies across the population of establishments, so that θj ~ f(φ). Averaging p(s|θ) across all possible ϕ values, using the equation inline image, provides the generalized distribution for the integer counts.

For this application, the population size in each plant is large so it is assumed that the integer counts follow a binomial distribution whose probability of a success parameter (θ) varies according to a beta distribution with parameter vector φ = (α, β). This distribution is the beta-binomial distribution, which is also known as a Skellam distribution (Skellam 1948). The probability mass function is

display math(16)

Where B represents the beta function (Mood et al. 1974).

The optim routine in R provides maximum likelihood estimates inline image and inline image for the weighted and unweighted likelihood functions.

Interest lies in estimating the proportion of test-positive samples, given by

display math(17)

which is the mean of the beta distribution and * = unweighted, weighted indicates the MLE method. A comparison of the shape of the beta distributions is also provided.

Results

Example 1 results and interpretation

The histogram in Fig. 1 illustrates the population distribution (i.e. both censored and uncensored elements). The smoothed lines are the lognormal probability density functions with parameters inline image and inline image, * = weighted, unweighted. The lognormal probability density function with parameters derived from the weighted MLE method (i.e. inline image shows very good agreement with the population histogram, even though the majority of the mass lies below the limit of detection. In contrast, the lognormal probability density function derived via the unweighted MLE method is substantially left shifted from the true distribution.

Table 1 summarizes the parameter estimates derived from the different MLE methods. The discrepancies between the estimated inline image and inline image values compared with population parameters µtrue and σtrue illustrate the bias in the parameter estimators due to ignoring the unequal probabilities of selection. The biases of roughly 55 and 8%, respectively, are substantial and lead to confidence intervals that include the true value at a much lower frequency than their nominal coverage value of 95%. Biases between −1.9 and −2.4% are reported for inline image. Given that these four estimators exhibit a similar bias suggests the bias is due to the minor discrepancy between the population distribution (i.e. the mixture of three lognormal distributions) and a true lognormal distribution.

Table 1. Summary statistics for a simulation study where weighted and unweighted maximum likelihood estimators are applied to unequal probability sampling data. The subscripted a indicates the data type and fitting methods
ParameterTruePopulationUnweighted MLEWeighted MLE
  1. a

    The NA values indicate that a statistic is not applicable. For example, the true µ value is a constant so there can be no confidence interval.

inline image −0.333−0.325−0.517−0.327
inline image NANA0.1070.087
inline image 0.5160.5060.5550.503
inline image NANA0.0830.076
inline image NA−2.3855.17−1.91
inline image NA−1.867.51−2.44
Achieved 95% confidence interval coverage inline imageNANA40.6591.92
Achieved 95% confidence interval coverage inline imageNANA89.2489.50a

The coverage rates for the confidence intervals derived from the weighted MLE were somewhat less than the nominal 95% value. Further investigation indicates that increasing the sample size from 150 to 450 samples improved the confidence interval coverage rates derived for the weighted MLE method to 93.5 and 91%, which demonstrates that the asymptotic properties of the MLE method have not been satisfied at the smaller sample size (Mood et al. 1974). This increase in sample size had no appreciable effect on the inline image, but it did magnify the effect of the bias in inline image parameters by reducing the coverage rates of the 95% confidence intervals to 2 and 76%, respectively. This occurs because the increased sample size reduces the width of the confidence interval about the very biased inline image such that its confidence interval is very narrow and rarely contains µtrue.

Example 2 results and interpretation

The parameter estimates were inline image = 1.06, inline image = 115.0 inline image = 1.49 and inline image = 115.0. The estimated proportion of samples that would test positive was inline image = 0.0092 and inline image = 0.0128. While both estimates are quite low, the unweighted estimate is roughly 40% larger than the weighted estimate. The resulting shapes of the beta distributions are shown in Fig. 2, where the distribution derived from the unweighted MLE is left shifted from the weighted MLE distribution.

Figure 2.

Beta distributions describing the proportion of test-positive beef trim samples derived using weighted and unweighted maximum likelihood estimation. The vertical lines are the mean values. (image) Weighted maximum likelihood estimates (MLE); (image) Unweighted MLE.

To illustrate the effect of such the bias in the estimated distribution in a risk assessment application, consider the following hypothetical scenario. Suppose a regulatory agency plans to impose regulations intended to reduce the contamination of beef trim so that the test-positive proportion (θ) produced by all establishments would be < 0.02. A risk assessment could use the cumulative distribution for the beta distribution to assess the proportion of all product produced that is likely to be affected by the new regulation. This value is given by

display math(18)

Where inline image is the beta probability density function and * = weighted, unweighted indicates the method of parameter estimation. The estimated values are inline image = 0.198 and inline image = 0.109. This example demonstrates that improper weighting of the data would nearly double the estimated proportion of beef trim production that would be affected. If the regulation aims to reduce contamination in the affected product, then it would be reasonable to expect that an estimate of the reduction in illnesses, derived from the unweighted parameters, would be roughly twice the true number.

Discussion

Reliable estimation of a distribution describing microbial contamination requires consideration of how samples are collected. Failure to account for unequal sampling rates can lead to substantial biases in the parameter estimates in situations where factors such as production and processing practices can differ as a function of establishment production volume.

Sample collection for food-safety applications rarely follows a strict probability sampling design because enumeration of the elements to construct a sampling frame is impractical. The motivation for regulatory sampling rarely stems from a formal sampling design. It is common for regulators to want to sample all establishments to provide regulatory supervision across all production. Regulatory samples are often collected using a pseudo-random or systematic approach. Although this approach may not be ideal, if sample selection is not intentionally purposive [i.e. sample selection is noninformative in the sense the sample design is not dependent on the y values in the sample (Särndal 1978)], then inferences based on a model (e.g. an assumed distribution) are appropriate (Gregoire 1998).

This study examines how variability in production volume can influence inference, but other factors also should be considered when designing a study or analysing data. One particular concern is the possibility of temporal patterns in pathogen occurrence and levels. For example, numerous studies have showed seasonal patterns in E. coli O157:H7 and Salmonella, with the peaks occurring during the summer (Hill et al. 2011). Spreading the sampling effort across seasons in each establishment ensures that seasonal patterns are adequately captured. A further benefit is that a systematic design across seasons should improve the precision of the estimated distribution.

As is common with many sampling and estimation problems, there is interest in determining guidelines for the optimal allocation of samples within and across establishments (clusters) and/or strata. Formulas for optimal allocation are available for many sampling designs (Chochran 1977). When sampling costs are not considered, a general rule is that sampling efforts should be focused on the segments of the population that exhibit the highest degree of variability, which in many of the applications considered here would be establishments with the highest levels of contamination. Nevertheless, these optimal allocation results are based on minimizing the variance of a simple univariate estimator, such as the population mean or total (Chochran 1977). These results may not be directly applicable for the MLE application considered here because the distribution function has multiple parameters (e.g. inline image, and these parameters are correlated. For this application, finding an optimal allocation of sample resources would require the minimization of the determinant of the information matrix (e.g. inline image with respect to different sample allocations. To our knowledge, this application of weighted distribution theory has received only limited attention (Patil 2002) and any solution would generally be unique to each distribution function (i.e. an optimal allocation for the gamma distribution would differ from the one for the lognormal).

While optimal allocations of sampling resources may not be possible, other practical problems should be considered. For example, the MLE in Example 1 is sensitive to the proportion of samples that exceed the LOD of the assay. A summarization of the performance of the MLE with censored data (Helsel 2005) suggests that a distribution function cannot be reliably fitted to a dataset where fewer than 20% of samples are enumerated unless sample sizes are very large. Given that few food commodities have levels of contamination that would lead to 20% of samples being test positive, an analyst planning a survey should focus sampling efforts on segments of the population where contamination is most likely to occur.

While this study looks specifically at the common problem of collecting samples from establishments with different production volumes, other factors could lead to the selection of samples with unequal probabilities. This study should encourage analysts to consider how the sample was collected and to at least attempt an approximation of the inclusion probabilities. If parameter estimates derived from the weighted and unweighted maximum likelihood methods are similar, then concerns of bias in the parameter estimates are lessened.

Ancillary