Are transient environmental agents involved in the cause of primary biliary cirrhosis? Evidence from space–time clustering analysis


  • Potential conflict of interest: Nothing to report.


The cause of primary biliary cirrhosis (PBC) is unclear. Both genetic and environmental factors are likely to contribute. Some studies have suggested that one or more infectious agents may be involved. To examine whether infections may contribute to the cause of PBC, we have analyzed for space–time clustering using population-based data from northeast England over a defined period (1987–2003). Space–time clustering is observed when excess cases of a disease are found within limited geographical areas at limited periods of time. If present, it is suggestive of the involvement of one or more environmental components in the cause of a disease and is especially supportive of infections. A second-order procedure based on K-functions was used to test for global space–time clustering using residential addresses at the time of diagnosis. The Knox method determined the spatiotemporal range over which global clustering was strongest. K-function tests were repeated using nearest neighbor thresholds to adjust for variations in population density. Individual space–time clusters were identified using Kulldorff's scan statistic. Analysis of 1015 cases showed highly statistically significant space–time clustering (P < 0.001). Clustering was most marked for cases diagnosed within 1–4 months of one another. A number of specific space–time clusters were identified. In conclusion, these novel results suggest that transient environmental agents may play a role in the cause of PBC. (HEPATOLOGY 2009.)

The cause of primary biliary cirrhosis (PBC) is unclear.1 Both genetic2, 3 and environmental factors are likely to be involved. Some studies have suggested that certain infectious agents may be involved in the cause. These agents have included Escherichia coli, mycobacteria, and a retrovirus.4–8

If infections contribute to cause, then the distribution of cases may exhibit space–time clustering. However, such space–time clustering would only happen under very specific conditions. First, the infection would not be ubiquitous or endemic. Second, the latent period from exposure to diagnosis would be relatively constant. Third, the onset of PBC would result as a rare consequence of exposure to a common transient environmental agent (for example, an infection) or as a consequence of exposure to a rare transient environmental agent.

Our previous analysis of data from northeast England has found spatial clustering in the occurrence of PBC.9 This supported the involvement of spatially heterogeneous environmental factors in causing PBC but could not distinguish whether static (noninfectious) agents or possibly transmissible (infectious) agents might underlie this spatial clustering.

The current analysis focuses on space–time clustering. Space–time clustering is distinct from spatial clustering and is said to occur when an excess of cases is observed within small geographical areas over limited temporal periods, and this pattern cannot be explained by general excesses in those areas or at those periods.

The aim of the current study was to test predictions of space–time clustering that might arise as a result of a transient environmental (and especially an infectious) origin.


AMA, antimitochondrial antibody; E, number of cases expected to be in close proximity; ICD, International Classification of Diseases; NN, nearest neighbor; O, number of cases observed to be in close proximity; PBC, primary biliary cirrhosis; S, strength of clustering.

Patients and Methods


Case Definition.

For this study, we included cases defined as both “definite PBC” and “probable PBC” in our original case finding study.10 Definite PBC is all three of: antimitochondrial antibody (AMA)-positive titer at least 1 in 40, cholestatic liver blood tests, and diagnostic or compatible liver histology. “Probable PBC” is any two of these three criteria (usually AMA positive at least 1 in 40 and cholestatic liver blood tests in the absence of liver biopsy). Since our original case definitions, these have been widely accepted.1 For this reason, we refer to all cases with the above criteria—either “definite” or “probable”—as “cases.”

Timeframe and Study Area.

The study included all cases incident between January 1, 1987 and December 31, 2003 and who were resident in an area of northeast England (Northumberland, Sunderland, North Durham, South Durham, Newcastle upon Tyne, North Tyneside, South Tyneside), defined by postal (zip) code. The total population of the area at the 2001 census was less than 2.05 million.11

Case Finding.

The methods have been described.12 Briefly, they were as follows:

  • 1Requests were made to all gastroenterologists and hepatologists in the region to identify all cases of PBC under their care. 2. Hospital admission data on Regional Information Systems for all 13 hospitals in the region using the International Classification of Diseases (ICD)-9 code 571.6 (to April 1994), and ICD-10 code K74-3 thereafter were examined. 3. All hospital immunology laboratory data for patients with positive AMA of 1 in 40 or greater by indirect immunofluorescence were examined (more than 500,000 laboratory records examined). 4. All listings from Office for National Statistics of deaths within the region and study period in which PBC, ICD-9 code 571.6, or (subsequently) ICD-10 code K74.3 appeared anywhere on the death certificate were examined.

Case selection was approved by local ethical committees. After initial identification hospital records of all cases were reviewed.


Ordnance Survey four-digit grid references (Easting and Northing) were allocated to each case with respect to the centroid of the postcode (zip code) of the residential address at the time of diagnosis. In the United Kingdom there are approximately 1.7 million alphanumeric postcodes. These unique identifiers are used for postal delivery. Typically, they may include approximately 15 to 20 houses, a smaller number of multiple occupancy residences, or a solo commercial address.13 Thus, addresses were geo-referenced to the nearest 0.1 km.

Prior Hypotheses

The following etiological hypotheses were tested: (1) a primary factor influencing geographical or temporal heterogeneity of PBC is related to exposure to an infectious or other similar environmental agent occurring close to diagnosis or at similar times before diagnosis; and (2) geographical or temporal heterogeneity in the occurrence of PBC is modulated by differences in patterns of exposure that are related to level of population density.

Statistical Methods

Global space–time clustering was analyzed using a method based on K-functions,14 which may be regarded as a generalization of the Knox test.15 These methods have been used extensively for analyzing spatiotemporal patterning in the distribution of childhood cancer, type 1 diabetes, and congenital anomalies.16–18 The Knox test considers a pair of cases to be in “close proximity” if dates of diagnosis and residential addresses at time of diagnosis are close. The number of pairs of cases observed to be in close proximity is obtained (denoted O), and the number of pairs of cases expected to be in close proximity is calculated (denoted E). If O is greater than E, a formal significance test determines whether there is evidence of space–time clustering. The “strength of clustering” is estimated by calculating the quantity S = ([OE]/E) × 100.

One problem with the Knox test is the arbitrary choice of thresholds. Use of a number of different thresholds would lead to multiple testing. A simplified version of a K-function method was used to partly overcome the arbitrary choice of thresholds and to avoid multiple testing.14 This method involved a set of 225 Knox-type calculations. The boundaries changed over a prespecified set of values (for close times, t = 0.1, 0.2,….., 1.5 years and for close in space, s = 0.5, 1, 1.5,……,7.5 km). The observed value of the K-function was calculated, and the unknown distribution of the K-function was simulated for a total of 999 random permutations of time. Each simulation involved a random reallocation of the dates of diagnosis to each of the cases in the analysis, and hence a realization of the K-function was obtained. Statistical significance was determined by comparison of the observed value and the simulated distribution.

The K-function method does not give a measure of the size of the clustering effect. Thus, S (derived from the Knox test, with critical values for distance: 0.5,…..,7.5 km and for time: 0.1,…..,1.5 years) determined the critical values at which the effect was most pronounced. ΣS (summed over all 225 combinations of s and t) was calculated to give an overall indication of strength of clustering.

It should be noted that analysis based on a nearest neighbor (NN) approach is likely to be more appropriate when both urban and rural areas are included. To allow for such heterogeneity in population densities, we repeated the K-function analyses, replacing fixed geographical distances by variable distances to the (N − 7)th,….., (N + 7)th NNs. N was chosen such that the mean distance was approximately 5 km and by inspection was found to be N = 26.

This is similar to the method that was first suggested by Jacquez.19

The distribution of distances between the 26th NNs was highly skewed, with distances varying from 1.5 to 70.6 km. The median distance between the 26th NNs was 3.3 km. To test whether population density was associated with space–time clustering, cases were split into two groups: 50% allocated to a “more densely populated” group and 50% allocated to a “less densely populated” group, depending on whether the 26th NN was nearer or further away than the median distance (3.3 km) of the 26th NN. Analysis by population density then proceeded by considering clustering pairs that included at least one case from the “more densely populated” category and clustering pairs that included at least one case from the “less densely populated” category.

Individual clusters were identified using Kulldorff's scan statistic based on a space–time permutation model.20 The complete study area and time period was covered by a cylindrical moving window with variable base and height (centered on each postcode centroid). A maximum of 10% of the entire time period or geographical area could have been included in this cylindrical moving window. This method has previously been applied to data on childhood leukemia and cerebral palsy.21, 22 Using a Bernoulli model, the scan statistic was used to test for differences between levels of population density.23

The analyses were based on space–time interactions between date and place of diagnosis. Overall K-function and Knox analyses were performed using routines written in FORTRAN 90. Analysis using Kulldorff's scan statistic was implemented using SaTScan v7.0.24

Statistical significance, assessed using one-sided tests and 999 simulations (for both the K-function analyses and the scan statistic), was indicated if P < 0.05.


The population-based register contained details of 1032 cases of PBC who were diagnosed during the period January 1, 1987 to December 31, 2003 within the specified region. However, 17 had missing or incomplete addresses at the time the diagnosis of PBC was first made12 or had entered the region after diagnosis or were living outside the study area and so were excluded from analyses. The study analyzed 1015 cases of PBC that had complete data available (916 female cases, 99 male cases, 509 cases with a residential address in a “more densely populated area,” and 506 cases with a residential address in a “less densely populated area”). Table 1 shows the number of subjects initially identified by each case finding method in the two studies.

Table 1. Methods of Patient Detection
 Part 1: 1987–1994Part 2: 1995–2003
  1. In Part 1, most cases were initially identified by more than one method, whereas in Part 2, only <15% cases were apparently identified by more than one method. This is because of change in recording of method of case initial identification on database. In Part 2, study cases were initially sought via clinician identification.

Total included468547
Clinician identification215 (46%)344 (63%)
Hospital data220 (47%)71 (13%)
Hospital immunology318 (66%)203 (37%)
ONS deaths51 (11%)14 (2.5%)

The results of the K-function analyses are presented in Table 2. Overall there was statistically significant evidence of space–time clustering (P < 0.001, and P < 0.001, using the geographical distance and NN threshold versions of the K-function method, respectively). The strength of clustering (S) was summed over all 225 combinations of space and time (ΣS) and was calculated as 1288.8 using the geographical distance and 2063.4 using the NN threshold versions of the Knox test, respectively. Furthermore, for 178 of 225 combinations of space and time, S was greater using the NN threshold version of the method compared with the geographical distance version. Thus, space–time clustering was more marked using the NN threshold metric.

Table 2. Results of K-Function Analyses of Space-Time Clustering for Cases of PBC Diagnosed in Northeast England During the Period 1987–2003
GroupGeographical Distance*NN Threshold
  • Cases are close in time if dates differ by <t, where t is in the range 0.1 year to 1.5 year.

  • I = ∫R(s,t)dsdt, where R(s,t) = [K(s,t) − K1(s)K2(t)]/√[K1(s)K2(t)].

  • K(s,t) = proportion of pairs of cases whose distances apart are ≤t in time and ≤s in space, K1(s) = proportion of pairs whose distance apart is ≤s, and K2(t) = proportion of pairs whose distance apart is ≤t.

  • P-value obtained by simulation (999 runs) with dates of diagnosis randomly reallocated to the cases in the analysis.

  • *

    Cases are close in space if distances between their locations differ by < s, where s is in the range 0.5–7.5 km.

  • Cases are close in space if either is within the distance to the Nth nearest neighbor of the other, where N is in the range 19–33.

  • P < 0.05.

All case pairsP < 0.001P < 0.001
“More densely populated: any” case pairsP < 0.001P < 0.001
“Less densely populated: any” case pairsP = 0.008P = 0.001

Analyses by two shorter time periods (cases diagnosed between 1987 and 1994; and cases diagnosed between 1995 and 2003) found that space–time clustering was still present in both of these time periods. ΣS, calculated using the geographical distance version of the Knox method, was equal to 659.0 for cases diagnosed between 1987 and 1994, and 1182.8 for cases diagnosed between 1995 and 2003. Furthermore, for 136 of 225 combinations of space and time, S was found to be greater in the later time period compared with the earlier period. Thus, space–time clustering was more marked during 1995 to 2003 (and was also more statistically significant: P = 0.09, P = 0.02 for cases diagnosed between 1987 and 1994; P = 0.04, P < 0.001 for cases diagnosed between 1995 and 2003).

The strength of clustering (calculated using the geographical distance version of the Knox method) was most marked for cases diagnosed within 0.1 to 0.3 year (1–4 months) of one another (Table 3). Furthermore, the Knox test showed that a large number of moderately sized case aggregations (where an individual case had at least 10 other cases in close spatiotemporal proximity) were geographically widespread (using critical values for space and time of 5 km and 12 months, respectively). Kulldorff's scan statistic found 14 statistically significant (P < 0.05) space–time clusters. These significant clusters included between 12 and 19 cases. The most statistically significant occurred during 2003 (O = 12, E = 1.91, O/E = 6.29, P = 0.003).

Table 3. Strength (S) of Clustering Results from Knox Tests with Range of Spatial and Temporal Thresholds
Distance (km)        
Time (year)0.5 km1 km1.5 km2 km2.5 km3 km3.5 km4 km
  1. Strength (S) = {(observed − expected)/expected} × 100 counts of pairs that are close in time and space. Cases are close in space if distances between their locations differ by less than 0.5,…,7.5 km. Cases are close in time if dates of diagnosis differ by less than 0.1,…,1.5 year. The maximum value of S for each critical spatial distance has been emboldened.

0.1 year7.
0.2 year30.
0.3 year14.36.36.811.817.511.88.79.3
0.4 year9.
0.5 year8.
0.6 year6.
0.7 year6.
0.8 year4.58.08.710.
0.9 year7.19.310.
1.0 year7.510.09.28.511.
1.1 year8.812.09.08.310.
1.2 year9.
1.3 year7.
1.4 year6.710.
1.5 year5.
Distance (km)        
Time (year)4.5 km5 km5.5 km6 km6.5 km7 km7.5 km
0.1 year3.
0.2 year7.
0.3 year6.
0.4 year5.
0.5 year5.
0.6 year5.
0.7 year4.
0.8 year5.
0.9 year5.
1.0 year5.
1.1 year5.
1.2 year5.
1.3 year5.
1.4 year5.
1.5 year5.

Analysis by level of population showed that there was evidence of clustering both for pairs of cases that included at least one from a “more densely populated area” (P < 0.001 using both the geographical distance and NN threshold approaches) and for pairs of cases that included at least one from a “less densely populated area” (P = 0.008 using the geographical distance threshold and P = 0.001 using the NN threshold). A comparison using a Bernoulli-based model found no difference in the propensity to cluster between cases from more or less densely populated areas.


This study has found highly novel evidence of space–time clustering among cases of PBC. Rigorous statistical methods have been used to analyze high-quality population-based data from a well-defined geographical region. The study area has very low inward or outward migration rates.11, 25, 26 The clustering was observed using both geographical and NN threshold critical values for distance. Thus, the presence of space–time clustering cannot be explained by variations in population density.

The statistical methods include two distinct types of geographical threshold. Space–time clustering based on fixed distance thresholds provides support for the role of transient geostationary exposures in cause, whereas clustering based on variable NN thresholds is more suggestive for the role of an agent that is transmitted by person-to-person contact. Because space–time clustering was present using both metrics, it is unlikely to be an artefactual occurrence. Furthermore, because clustering was more marked using the variable NN metric (as demonstrated by greater strength of clustering), the evidence overall is more supportive of an infectious causative agent than of an agent originating from a fixed geographical source.

Our earlier study of the same study region found evidence of spatial clustering.9 This was interpreted as providing support for an environmental cause in general without indicating whether this was geostationary or infectious. Additionally, there was a methodological shortcoming to the way in which population case-control data were derived. It should be noted that there is no such problem with the methodology pertaining to the current analyses, because they are reliant only on case data.

The finding of space–time clustering from the current study is consistent with the role of a transient agent in etiology. Because the population is relatively stable, the results suggest the possible role of an infectious agent. Furthermore, the temporal ranges over which the space–time interactions were most marked (1-4 months) are consistent with the involvement of such factors. Because we and many others have shown that there can be a long latency between development of AMA positivity in an individual and presentation of overt disease, we were somewhat surprised by these findings. However, the fact that we have demonstrated space–time clustering despite the potentially “diluting” effects of this latency in some individuals may make the findings even more powerful.

The methods for analyzing global clustering (K-function and Knox) are systematic and provide evidence of overall space–time clustering. They do not provide any information on individual clusters. We have identified a number of statistically significant space–time clusters using a scan statistic. Overall, there was geographically widespread evidence of space–time clustering, with moderately sized case aggregations. Some authors have suggested that shifts in the underlying population may artefactually cause space–time clustering.27, 28 We are not able to investigate population shifts, because this would require small area level population data for short time periods, which are not available in the United Kingdom. However, because space–time clustering was present in both the earlier and later parts of the study period (1987–1994 and 1995–2003), population shifts are unlikely to explain our results.

Previous studies have suggested a role for both acute and chronic bacterial and viral infections. Putative agents include Mycobacterium gordonae, Escherichia coli, and a retrovirus.4–8 Our findings are consistent with the possible involvement of a geographically widespread, but transient, causative agent, such as an acute infection (and not with a chronic infection). Addresses at residence were very stable. Thus, our results are consistent with either: (1) a very short “lag time” between exposure to an etiologically relevant agent and subsequent diagnosis or (2) a longer, but relatively constant, “lag time” between exposure and diagnosis.

In conclusion, this is the first study to find space–time clustering among cases of PBC. This clustering was highly statistically significant. These novel results suggest a role for transient environmental agents in causing PBC. Further research should consider putative transient causative agents, such as acute infections.