Local genes for local bacteria: Evidence of allopatry in the genomes of transatlantic Campylobacter populations

Abstract The genetic structure of bacterial populations can be related to geographical locations of isolation. In some species, there is a strong correlation between geographical distance and genetic distance, which can be caused by different evolutionary mechanisms. Patterns of ancient admixture in Helicobacter pylori can be reconstructed in concordance with past human migration, whereas in Mycobacterium tuberculosis it is the lack of recombination that causes allopatric clusters. In Campylobacter, analyses of genomic data and molecular typing have been successful in determining the reservoir host species, but not geographical origin. We investigated biogeographical variation in highly recombining genes to determine the extent of clustering between genomes from geographically distinct Campylobacter populations. Whole‐genome sequences from 294 Campylobacter isolates from North America and the UK were analysed. Isolates from within the same country shared more recently recombined DNA than isolates from different countries. Using 15 UK/American closely matched pairs of isolates that shared ancestors, we identify regions that have frequently and recently recombined to test their correlation with geographical origin. The seven genes that demonstrated the greatest clustering by geography were used in an attribution model to infer geographical origin which was tested using a further 383 UK clinical isolates to detect signatures of recent foreign travel. Patient records indicated that in 46 cases, travel abroad had occurred <2 weeks prior to sampling, and genomic analysis identified that 34 (74%) of these isolates were of a non‐UK origin. Identification of biogeographical markers in Campylobacter genomes will contribute to improved source attribution of clinical Campylobacter infection and inform intervention strategies to reduce campylobacteriosis.

recombined to test their correlation with geographical origin. The seven genes that demonstrated the greatest clustering by geography were used in an attribution model to infer geographical origin which was tested using a further 383 UK clinical isolates to detect signatures of recent foreign travel. Patient records indicated that in 46 cases, travel abroad had occurred less than two weeks prior to sampling, and genomic analysis identified that 34 (74%) of these isolates were of a non-UK origin. Identification of biogeographical markers in Campylobacter genomes will contribute to improved source attribution of clinical Campylobacter infection and inform intervention strategies to reduce campylobacteriosis. with Neolithic human hosts (Comas et al., 2013), lineages of Mycobacterium tuberculosis populations can be classified into geographical groups based upon local genetic diversification of DNA sequences (Achtman, 2008, Gagneux andSmall, 2007). Phylogeographical structure has also been observed in the human gastric bacterium Helicobacter pylori, where a rapidly evolving genome with high levels of horizontal gene transfer (HGT) allows the reconstruction of recent human migrations to the extent that genetic admixture among the bacteria reflects interactions among human populations (Falush et al., 2003, Moodley et al., 2009).
M. tuberculosis and H. pylori are primarily human pathogens.
However, in the foodborne pathogen Campylobacter, animals are the principal reservoir for human infection. International trade, particularly in agricultural animals including chicken and poultry products, provides a vehicle for global spread. In this case, local phylogeographical signals can be weakened not only by the rapid movement of lineages around the world, but also by genomic changes that occur within the reservoir host. This may make it difficult to attribute the country of origin based on the Campylobacter isolate genome alone. Sequence-based analyses have shown that populations of the main human disease-causing Campylobacter species, C. jejuni and C. coli, are highly structured into clusters of related lineages, which can be identified by MLST as clonal complexes (CCs). Members of CCs share four or more MLST alleles with a predefined central genotype, which gives the CC its name; for example, ST-21 defines CC-21 (Dingle et al., 2005, Sheppard et al., 2010b. In C.
jejuni, host-associated clonal complexes can be identified based upon the frequency with which particular genotypes are isolated from different hosts (Sheppard et al., 2011. Many of these lineages are globally distributed (Sheppard et al., 2010a)  Horizontal gene transfer in recombining bacteria, such as Campylobacter , Wilson et al., 2008, Sheppard et al., 2013a, can provide information about ecological differences between lineages. For example, when a Campylobacter lineage transfers to a new animal host it may acquire DNA from the resident population by HGT. This has been shown in host generalist C. jejuni lineages isolated from chicken that sometimes contain alleles that originated in chicken specialist genotypes (McCarthy et al., 2007, Wilson et al., 2008. In this study, we applied comparable approaches to investigate whether HGT can lead to signatures of recombination that discriminate between isolates from North America and the UK using genomic data.
Using matched pairs of North American and UK isolates, we identify genes that are prone to recombination, and will therefore pick up local DNA more rapidly, and hypothesize that these genes may acquire a biogeographical signal.

| Bacterial isolates and genome sequencing
A total of 294 sequenced isolates were analysed, of which 131 genomes were generated in this study and augmented by 163 previously published genomes , Sheppard et al., 2013a, Sheppard et al., 2013b. Sequencing reads for all genomes sequenced in this study are available from the NCBI short read archive associated with BioProject: PRJNA312235. All assembled genomes used in this study can also be downloaded from FigShare (https://doi.org/10.6084/m9.figshare.4906634).

| Canadian isolates
Isolates were collected from chicken and bovine faecal samples between July 2004 and July 2006 from farms at diverse locations in Alberta. Samples were placed on ice and processed within 6 h as previously described (Jokinen et al., 2010). Approximately 5 g of faecal matter was mixed with 5 ml of phosphate buffered saline (PBS) to form uniform slurry. One-millilitre aliquots of the PBS-faecal samples were added to 20 ml of Bolton broth containing 5% (v/v) lysed horse blood and selective supplement (Diergaardt et al., 2004) and incubated at 42°C for 24 h under microaerobic conditions prior to plating 20 ll onto supplemented charcoal cefoperazone deoxycholate agar (CCDA). The plates were incubated for a further 48 h at 42°C. Human samples were acquired from clinical laboratories in three Canadian provinces. These were replated from frozen glycerol stocks and the DNA extracted as described below.
Presumptive Campylobacter colonies were cultured onto blood agar plates and tested using biochemical oxidase and catalase tests.

| US isolates
DNA was extracted from a pure culture colony using the Wizard Genomic DNA Purification Kit (Promega, Madison, WI).
Campylobacter species was identified by 16S rDNA sequencing, using the primer pairs as described by Lane (1991). Genome sequencing was performed on an Illumina MiSeq sequencer using the KAPA Low-Throughput Library Preparation Kit with Standard PCR Amplification Module (Kapa Biosystems, Wilmington, MA), following the manufacturer's instructions except for the following changes; 750 ng DNA was sheared at 30 psi for 40 s and size selected to 700-770 bp following Illumina protocols. Standard desalted TruSeq LT and PCR primers were ordered from Integrated DNA Technologies (Coralville, IA) and used at 0.375 and 0.5 lM final concentrations, respectively. PCR was reduced to 3-5 cycles.
Libraries were quantified using the KAPA Library Quantification Kit (Kapa), except with 10 ll volume and 90-s annealing/extension PCR, and then pooled and normalized to 4 nM. Pooled libraries were requantified by ddPCR on a QX200 system (Bio-Rad, Hercules, CA), using the Illumina TruSeq ddPCR Library Quantification Kit and following the manufacturer's protocols, except with an extended 2-min annealing/extension time. Libraries were sequenced using a 2 9 250 bp paired end v2 reagent kit on a MiSeq instrument (Illumina, San Diego, CA) at 13.5 pM, following the manufacturer's protocols. Genomes were assembled using the Roche Newbler assembler (version 2.3).

| Published isolates
We augmented the collection of isolates sequenced in this study with 163 previously published Campylobacter isolate genomes from Canada, the United States and the UK collected between 1980 and 2012 from a range of sources, including cattle (54), chicken (80), pig (9), environmental (49), wild bird species (12) and human clinical cases (73) (Table S1) , Sheppard et al., 2013a, Sheppard et al., 2013b.

| UK clinical test isolates
In addition to this collection of sequenced and publicly available Campylobacter genomes, we used a further 383 clinical samples collected from the John Radcliffe Hospital in Oxford between June and October 2011 as a test data set to attribute source according to geography (Table S2) (Cody et al., 2013). These genomes were downloaded from http://pubmlst.org/campylobacter/.

| Population structure
Isolate genomes were archived on an open-source BIGSdb database which identifies gene presence and allelic variation by comparison with a reference locus list (Jolley and Maiden, 2010. This list comprises 1,623 locus designations from the annotated genome of C. jejuni strain NCTC11168 (GenBank accession no. NC_002163.1) (Gundogdu et al., 2007, Parkhill et al., 2000. Reference loci were identified in each of the 294 isolate genomes using BLAST. Loci were recorded as present if the sequence had ≥70% nucleotide identity over ≥50% of the gene length. Each gene was aligned individually using MAFFT (Katoh et al., 2002) and concatenated into a single multi-FASTA alignment file for each isolate for a total alignment of 1,585,605 bp. Phylogenetic trees were constructed from a whole-genome alignment of PASCOE ET AL. | 4499 C. jejuni (n=229) and C. coli (n=55) isolates based on 103,878 and 806,657 variable sites, respectively, using FastTree (version 2) and an approximation of the maximum-likelihood algorithm (Tamura et al., 2013, Kumar et al., 2016.

| Selection of isolate pairs
To minimize the effect of host adaptation and maximize the opportunity of identifying genetic signatures of geographical separation, a subset of 15 isolate pairs were chosen based upon their phylogenetic clustering. In each case, isolate pairs contained one Canadian and one UK isolate of the same clonal complex sampled from the same host species. Paired isolates shared 1,378 genes resulting in a core-genome alignment of 1,287,560 bp.

| Analysis of co-ancestry and inference of recombination hot regions
The co-ancestry of the paired isolates was inferred based on wholegenome sequences using chromosome painting and fineSTRUCTURE (Lawson et al., 2012), as previously described (Yahara et al., 2013).
ChromoPainter (version 0.02) was used to infer the number of DNA "chunks" donated from a donor to a recipient for each recipient haplotype, and the results summarized in a co-ancestry matrix indicating average isolate similarity across the entire genome. fineSTRUCTURE was then used for 100,000 iterations of both the burn-in and Markov chain Monte Carlo (MCMC) chain to cluster individuals based on the co-ancestry matrix. The results are visualized as a heat map with each cell indicating the proportion of DNA "chunks" a recipient receives from each donor.
The time to the most recent common ancestor (TMRCA) of each pair was estimated using the model described in Didelot et al. (2013) and summarized here briefly. Pairs of genomes share a common ancestor t years ago and have been subject to mutation at a rate l and recombination at rate q. The mutation rate of 2.9x10 -5 per site per year was used as reported in (Sheppard et al., 2010b), which is similar to the rates estimated in Wilson et al. (2008Wilson et al. ( , 2009). The effect of recombination is to introduce a high density of polymorphism similar to the ClonalFrame model Falush, 2007, Didelot andWilson, 2015) but with the advantage that this density can vary between recombination events to reflect differences in evolutionary distance between donors and recipients (Morelli et al., 2010, Didelot et al., 2013. In each pairwise comparison, the TMRCA and recombination rate parameters are estimated based on a coregenome alignment, with 95% credibility intervals.

| Epidemiological markers of geographical clustering
Neighbour-joining phylogenetic trees were constructed for all genes that demonstrated an average of above 1% pairwise nucleotide diversity across all 15 pairs of isolates. Individual gene phylogenies were constructed in MEGA for all 57 genes. Isolates were assigned to a putative source population based on the seven highly recombining genes that showed the greatest level of clustering by geography.
Probabilistic assignment of geographical source is based on the allele frequencies in the reference population data sets for each of the seven loci. This analysis was performed using Structure, a Bayesian model-based clustering method designed to infer population structure and assign individuals to populations using multilocus genotype data (Sheppard et al., 2010a, Pritchard et al., 2000. Canadian and US isolates were combined as a North American population for comparison with UK isolates.

| Attribution of clinical isolates to country based on seven geographically segregating genes
The source attribution model was tested with isolates of a known source. Self-assignment of a random subset of the comparison data set was conducted by removing a third of the isolates from each candidate population (n = 73). The remainder were used as the reference set (78 North American isolates to compare with 68 UK isolates). Structure was run for 100,000 iterations following a burn-in period of 10,000 iterations using the no admixture model to assign individuals to putative populations. The assignment probability for each source was calculated for each isolate individually and were attributed to origin populations when the attribution probability was greater than 0.50.

| RESULTS
Core genomes of isolates from North America and the UK were  Table S1).
3.1 | Matched isolates share more common ancestry with isolates from the same country To minimize the effect of host adaptation and maximize the opportunity of identifying genetic signatures of geographical separation, a subset of 15 isolate pairs were chosen based upon their phylogenetic clustering with less than 1,200 bp difference in 1,378 coregenome loci. In each case, isolate pairs contained one Canadian and one UK isolate of the same clonal complex sampled from the same host species (Table 1). The co-ancestry of the paired isolates was inferred based on core-genome alignments using chromosome painting and fineSTRUCTURE (Lawson et al., 2012) (Yahara et al., 2013) ( Figure 2). The total proportion of DNA "chunks" in a recipient from isolates within the same country (median 0.59) was significantly higher than that from isolates from different countries (median 0.33) (p<10 -9 , Wilcoxon rank-sum test).

| Matched isolates share recent common ancestors but have since experienced significant recombination
The estimated time since the most recent common ancestor (TMRCA) was calculated for each UK/American pair of genomes as previously described (Didelot et al., 2013), using the mutation rate of 2.9x10 -5 per site per year reported in Sheppard et al. (2010b), which is consistent with estimates in Wilson et al. (2009). In each pairwise comparison, the level of divergence along the genome (Figure 3) was used to estimate the TMRCA and recombination rate, with 95% credibility intervals around these parameters (Table 2). All pairs were estimated to have shared ancestors between one and five years ago, with two exceptions, namely the two C. coli pairs, for which the TMRCA was around 25 years ago. The ratio r/m of rates at which recombination and mutation introduce polymorphism was estimated to be around 20-30 except in the two C. coli pairs with larger TMRCA, for which a smaller value was estimated around r/ m=4. Most existing r/m estimates have been calculated using seven MLST housekeeping genes (Vos and Didelot, 2009, Wilson et al., 2008). Other estimates have been derived through comparison of relatively small numbers of Campylobacter genomes (Llarena et al., 2016). Estimates of r/m can vary considerably depending on the isolate collection and the genes used in the analysis, for example, ranging from 0 to 100 among Helicobacter isolates within a human population from within a single settlement in South Africa (Didelot et al., 2013). Given the potential for sampledependent variation, the r/m estimates in this study are consistent with previous estimates. Further, variation in TMRCA estimates and r/m between C. coli compared to C. jejuni pairs in this study may reflect differences between the species, but more sampling of the C.
coli population is necessary to investigate this further.

| Highly recombining genes as markers of geographical attribution
A pairwise comparison of the matched pairs was used to quantify the level of divergence in each gene within the core genome (1,147 genes) of the paired isolates. Most genes showed low diversity, indicative of closely related pairs. Polymorphisms in genes with less than 1% divergence between pairs (white and red in Figure 3) are likely to be the result of mutation or recombination with a tract of DNA with high nucleotide identity, so that only one or two substitutions are visible. Genes with greater than 1% divergence between pairs are likely to have recombined as numerous substitutions have been introduced (blue in Figure 3). Fifty-seven genes (e.g., Cj0034c and Cj0635) had a high level (>1%) of nucleotide divergence and high probability of recombination in all 15 pairs. This result did not arise just by chance: overall recombination was inferred in around 25% of the genes in each pair and so if recombination was random, the probability that all 15 pairs had recombined for a given gene would be extremely small (0.25 15 =9.3x10 -10 ).
Individual gene trees were generated for these 57 genes from which the most recombination could be identified (Fig. S1). The

(a) (b)
F I G U R E 1 Population structure of Campylobacter isolates used in this study. Phylogenetic trees were constructed from a whole-genome alignment of (a) C. jejuni (n=229) and (b) C. coli (n=55) isolates based on 103,878 and 806,657 variable sites, respectively, using an approximation of the maximum-likelihood algorithm (Tamura et al., 2013, Kumar et al., 2016. Leaves on the tree are coloured by source country, the UK (green circles), Canada (red) and the United States (blue). Ancestral C. coli clades (1, 2 and 3) (Sheppard et al., 2010b) are annotated and common clonal complexes (CC) based on four or more shared alleles in seven MLST housekeeping genes (Dingle et al., 2005) PASCOE ET AL.
| 4501 seven genes that gave the clearest geographical clustering were used for further analysis of geographical attribution using Structure as previously described (Sheppard et al., 2010a, Pritchard et al., 2000. A self-test was performed on a subset of our isolate collec-  F I G U R E 3 Pairwise comparison of nucleotide diversity in the core genome. Above: Estimated values of the per-nucleotide statistic reflecting relative intensity of recombination at each site plotted along the NCTC11168 reference genome. Left: Core-genome phylogeny of selected paired isolates (matched by CC and source host), with clonal complex indicated. Centre: Matrix of gene-by-gene pairwise comparison along the NCTC11168 reference genome of our selected pairs. Each row represents a pairwise comparison of selected paired of isolates. Each column is a gene from the NCTC11168 reference genome. Panels of the matrix are coloured based on nucleotide divergence for that gene in each pair: from no nucleotide diversity (0%, white), through some nucleotide diversity (~1%, red) to high levels of nucleotide diversity (up to 2%, blue). The per-nucleotide scan of relative intensity of recombination is aligned with our gene-by-gene pairwise comparison of nucleotide diversity, and the location of seven putative epidemiological markers for geographical segregation is indicated nucleotide substitutions in multiple lineages (that reflect adaptation to the host and drift) accumulate in physically isolated populations (Sheppard et al., 2013b). This host-associated genetic structure can be informative for understanding the evolution of C. jejuni (Dearlove et al., 2016), but can also be used in a more practical way to identify the source of isolates causing human infection by identifying genomic signatures (resulting from adaptation or drift) in the infecting isolate that are associated with populations in particular reservoir hosts (Sheppard et al., 2009, Wilson et al., 2008. Quantitative source attribution models, based upon the probability that a particular clinical isolate originated in different reservoirs, have been widely used to estimate the risk of human infection from different food production animals and other sources (Colles et al., 2008, French et  T A B L E 2 Shared ancestry analysis and estimation of pairwise recombination rates. The time to the most recent common ancestor (TMRCA) for each selected pair was estimated with 95% confidence intervals (TMRCA-CI). The ratio of rates at which recombination and mutation introduce polymorphism (r/m) was also calculated with 95% confidence intervals (r/m-CI). In addition, the number of recombined genes (probability > 95%) is also shown.  Griekspoor et al., 2013, Viswanathan et al., 2017, Thepault et al., 2017 and have informed intervention strategies and public health policy (Cody et al., 2013, Cody et al., 2012.
The accuracy of probabilistic source attribution models is influenced by the degree to which indicative markers in the isolate genome, such as MLST locus alleles, can be placed within a source population. This is relatively straightforward for markers that segregate absolutely by source, but in C. jejuni and C. coli it is common that alleles are present in more than one population, but at different frequencies.
In simple attribution models using MLST data, C. jejuni and C. coli isolates from chickens in the Netherlands, Senegal and the United States have been more closely related to UK chicken isolate populations rather than to populations from other host species in the same country (Sheppard et al., 2010a). While genomic signatures of host association can transcend geographical structuring within C. jejuni and C. coli populations, there can be differences in the genotypes that are isolated from different countries (Mohan et al., 2013, Asakura et al., 2012, Kivisto et al., 2014, Islam et al., 2014, Prachantasena et al., 2016. This presents challenges, not only for attributing the source of infections among travellers returning from foreign locations (Mughini-Gras et al., 2014), but also for understanding disease epidemiology in the context of a global food industry.
Following the occupation of a new niche C. jejuni and C. coli can acquire DNA signatures through recombination , Sheppard et al., 2013a and local DNA signatures via HGT, from resident strains. To quantify the extent to which isolates from the same country share DNA sequence, we compared 15 isolate pairs from different countries that, to minimize the effect of clonal inheritance and host-associated variation, were matched by both clonal complex and source. The predicted ancestry of co-inherited SNPs was nearly twice as high among isolates from same country compared to those from different countries. While this represents a relatively weak signal of geographical association, compared to host association, there was a quantifiable local (national) signal that can be used to investigate geographical clustering.
As recombination introduces more nucleotide substitutions than during mutation in C. jejuni and C. coli (Webb and Blaser, 2002, Morelli et al., 2010, genes with evidence of elevated recombination rates, which share a gene pool, will more rapidly acquire local signals of sequence variation than genes with lower recombination rates. These genes represent potential targets for use as biogeographical epidemiological markers. Pairwise isolate comparison revealed that nucleotide divergence was <1% across the majority of the genome (Table S3); however, some genes consistently had more sequence variation in multiple isolate pairs, potentially indicating enhanced recombination at these loci.
Several of these genes have been annotated with functions associated with DNA processing, transcription, repair and maintenance.
This may reflect the mechanisms of recombination and horizontal gene transfer. Other genes with evidence of elevated recombination included those associated with surface exposed proteins with roles in glycosylation, motility and secretion which would form part of an initial interaction with the host/environment (Table S3). Variation in recombination rate could be influenced by differential selection pressure. The C. jejuni N-acetyltransferase PseH (Cj1313) plays a key role in O-linked glycosylation, which contributes to flagellar formation, motility and pseudaminic acid biosynthesis (Song et al., 2015, McNally et al., 2006 and is important in host colonization . The variable outer membrane protein gene porA, which has been used as part of extended MLST schemes , Cody et al., 2009 was also among those genes with evidence of elevated recombination. This may explain why weak allopatric signals have been associated with sequence variation in the porA gene in addition to source attribution signals (Sheppard et al., 2010a, Smid et al., 2013, Mughini-Gras et al., 2014. Three efflux pump genes (Cj0034c, Cj0619 and Cj1174) which have been implicated in fluoroquinolone resistance, showed elevated recombination and phylogeographical variation (Table S3) (Luangtongkum et al., 2009, Ge et al., 2005. Clinical and agricultural prescription of broad-spectrum antibiotics such as quinolones varies worldwide. Since the late 1990s, the agricultural use of fluoroquinolones has declined following governmental intervention in Europe and North America (Chang et al., 2015, Nelson et al., 2007; however, resistant isolates remain common and the level of resistance can vary from country to country (Pham et al., 2016). Higher levels of fluoroquinolone resistance have been observed among isolates from infected individuals who have recently returned from foreign travel (Gaudreau et al., 2014). This is consistent with the higher levels of use in other parts of the world (Zhong et al., 2017).  Supplementary Table S1.

CONF LICT OF I NTEREST
The authors declare no competing interests.