Applied phyloepidemiology: Detecting drivers of pathogen transmission from genomic signatures using density measures

Abstract Understanding the driving forces of an epidemic is key to inform intervention strategies against it. Correlating measures of the epidemic success of a pathogen with ancillary parameters such as its drug resistance profile provides a flexible tool to identify such driving forces. The recently described time‐scaled haplotypic density (THD) method facilitates the inference of a pathogen's epidemic success from genetic data. Contrary to demogenetic approaches that define success in an aggregated fashion, the THD computes an independent index of success for each isolate in a collection. Modeling this index using multivariate regression, thus, allows us to control for various sources of bias and to identify independent predictors of success. We illustrate the use of THD to address key questions regarding three exemplary epidemics of multidrug‐resistant (MDR) bacterial lineages, namely Mycobacterium tuberculosis Beijing, Salmonella Typhi H58, and Staphylococcus aureus ST8 (including ST8‐USA300 MRSA), based on previously published, international genetic datasets. In each case, THD analysis allowed to identify the impact, or lack thereof, of various factors on the epidemic success, independent of confounding by population structure and geographic distribution. Our results suggest that rifampicin resistance drives the MDR Beijing epidemic and that fluoroquinolone resistance drives the S. aureus ST8/USA300 epidemic, in line with previous evidence of a lack of resistance‐associated fitness cost in these pathogens. Conversely, fluoroquinolone resistance measurably hampered the success of S. Typhi H58 and non‐H58. These findings illustrate how THD can help leverage the massive genomic datasets generated by molecular epidemiology studies to address new questions. THD implementation for the R platform is available at https://github.com/rasigadelab/thd.


| INTRODUC TI ON
In a globalizing world where re-emerging diseases and antibiotic-resistant pathogens threaten the health systems of numerous countries (Wirth, 2018), there is an urgent need for simple and efficient tools to quantify pathogen transmission. Correlating this information with ancillary parameters such as antibiotic resistance profiles, pathogen demogenetic measures or host characteristics (e.g., vaccine status) can then help to determine the driving forces behind the epidemic success of a pathogen. Finally, understanding these driving forces allows us to identify modifiable drivers of an epidemic and to inform intervention strategies against it (Heesterbeek et al., 2015).
The concepts of epidemicity and endemicity are central to infectious diseases epidemiology. Endemic areas for an infection are where the microbe and its related clinical cases persist over years, displaying a continuous incidence. The continuity rather than the intensity of the incidence is the effective indicator of endemicity. This concept was illustrated by the calculation of an index of endemicity relying on the study of cholera mortality over long periods of time in Bengal (Kamal, 1963). Contrasting with endemicity, epidemicity characterizes the dynamics of the infection. Epidemicity can reflect, for instance, the spread of a pathogen in an empty or naïve niche with a sudden burst followed by an abrupt ending, as exemplified by the Vibrio cholerae outbreak in Haiti in 2010 (Eppinger et al., 2014), or a novel zoonosis like HIV in the 1980s (Faria et al., 2014). Other situations are less clear-cut, such as when local outbreaks occur in otherwise endemic areas. When linked with environmental factors, such outbreaks can appear as seasonal or completely stochastic. Importantly, epidemicity is tightly linked to the concepts of epidemic success and fitness. In the following, we use the term epidemicity to qualify the spread of a disease in an epidemiological context, while the term epidemic success relates to the success of the causative pathogen from an evolutionary standpoint. Finally, the fitness of the pathogen qualifies the biological characteristic that underlies its epidemic success in specific conditions. These distinctions are meant to emphasize the chain of consequences in which a pathogen's fitness enhances its epidemic success and, in turn, the epidemicity of the disease.
Our understanding of the dynamics of infectious disease epidemics has been facilitated by the continued improvement of modeling techniques (Cauchemez, Hoze, Cousien, Nikolay, & Ten Bosch, 2019), based on two major paradigms: (a) epidemiological models (Anderson & May, 1992;Keeling & Rohani, 2008) that consider the dynamics of an entire population going forward in time and (b) coalescent theory (Kingman, 1982), which considers small samples of an infected population and operates backward in time until the common ancestor has been reached. Although historical data on transmission dynamics are frequently lacking, the genotypes of extant pathogens are increasingly available. Therefore, using these genetic data to infer the epidemic success of a pathogen or group of pathogens can greatly improve our understanding of the drivers of pathogen transmission and spread.
Here, we illustrate applications of the time-scaled haplotypic density (THD), a recently developed measure of epidemic success Rasigade et al., 2017), to model the driving forces behind major pathogen epidemics, using rich genetic datasets on three exemplary human pathogens. After discussing the motivation for the THD method and its mathematical foundations, we first revisit the Mycobacterium tuberculosis Beijing epidemic in Eurasia (Merker et al., 2015) and establish links between patterns of antimicrobial resistance and epidemic success. In a second example, we apply THD to infer whether fluoroquinolone resistance impacts fitness in Salmonella Typhi populations, including the pandemic, multidrug-resistant H58 lineage (Wong et al., 2015). Finally, we examine whether the epidemic success of distinct lineages of community-acquired methicillin-resistant Staphylococcus aureus was measurably influenced by two previously suspected success-associated traits, namely fluoroquinolone resistance and the arginine catabolic mobile element ACME (Glaser et al., 2016).

| MOTIVATI ON FOR AN IND IVIDUAL-BA S ED CORREL ATE OF EPIDEMI C SUCCE SS
In the following section, we relate the epidemic success of a pathogen to the density of transmission events in its ancestry. If a pathogen's evolution rate is fast enough, each transmission event results in a genetic divergence event that can be inferred as a node in the phylogeny reconstructed from the genotypes of the descendants of the transmitted pathogen. Based on this assumption, the phylogenetic signature of epidemic success is the density of divergence events in the pathogen's ancestry.
Several demogenetic and phyloepidemiology techniques estimate quantities that reflect the density of divergence events in the ancestry of a population. For instance, the expansion rate of a pathogen population r, or Malthusian parameter, can be estimated under the exponential growth model N(t) = e rt N 0 where N(t) is the population size after t epochs and N 0 is the initial population size.
More specialized phylodynamics models have been adapted to the epidemic setting to explicitly consider host infection and recovery and the detection probability of a pathogen (Popinga, Vaughan, Stadler, & Drummond, 2015;Volz, Kosakovsky Pond, Ward, Leigh Brown, & Frost, 2009). Such models can infer the basic reproduction number R 0 of an epidemic, which is the number of secondary cases of infection generated by an index case during the infectious period.
Importantly, correlates of epidemic success such as the Malthusian parameter or the basic reproduction number are defined at the level of a group of pathogens, that is, a population. These correlates are useful for comparing the epidemic success of distinct pathogen species or lineages. Because of their aggregated nature, however, such correlates of epidemic success cannot be used to infer conclusions at the individual level, as such conclusions can be contaminated by cross-sectional effects (Schechner, Temkin, Harbarth, Carmeli, & Schwaber, 2013). Correlation between group-level features, such as between a correlate of epidemic success and other pathogen characteristics, does not imply correlation at the individual level because between-group correlation can result from correlation between the features in distinct individuals of the same group. These limitations of group-level correlates of epidemic success in correlation analysis can be circumvented by constructing models that directly relate pathogen characteristics with the rate of divergence events in a phylogenetic tree. Popular models that build upon this idea belong to the category of state-dependent diversification models, such as the binary-state speciation and extinction (BiSSE) model (Maddison, Midford, & Otto, 2007) or the Bayesian analysis of macroevolutionary mixtures (BAMM) (Rabosky, 2014), whose validity was criticized by several authors (Rabosky, Mitchell, & Chang, 2017). Because of their sophistication and computational demand, however, state-dependent diversification models can be difficult to use with large datasets (typically, n > 1,000) and the multiple-feature hypotheses that are increasingly common in pathogen epidemic studies.
To circumvent the statistical drawbacks of group-based correlates of epidemic success and the complexity of state-dependent diversification models, our group proposed a simpler, alternative approach to epidemicity analysis that provides individual-based correlates. The expected benefit of defining correlates of epidemic success at the individual level is to facilitate the detection of factors associated with epidemic success using regression models that can control for multiple confounders (Box 1).

| REL ATI ON S HIPS B E T WEEN EPIDEMI C SUCCE SS AND DEN S IT Y IN THE S PACE OF G ENE TIC D IS TAN CE S
All else being equal, the epidemic success in a group of pathogens increases its prevalence faster than its diversity, resulting in a more uniform (i.e., less diverse) genetic population structure compared to other groups in the sample. Lower diversity results in smaller genetic distances between isolates. From a statistical standpoint, both the prevalence of and pairwise genetic distances between isolates in a group can be jointly quantified by a measure of density in the space of genetic distances, suggesting that density correlates with success.
Contrary to the aggregated measures discussed above, density is defined at all points in the space of genetic distances, hence on the level of individuals in the population.
Time-scaled haplotypic density defines the density measure using an application-specific adaptation of kernel density estimation (KDE) (Parzen, 1962). In the general case, KDE computes density based on distances between points and a kernel function with a bandwidth parameter to control the smoothness of the estimate.
In the THD setting, points are haplotypes represented as vectors of markers, distances are the pairwise numbers of allelic differences and the kernel function is based on the geometric distribution. To control the bandwidth of the analysis in an interpretable fashion, the THD bandwidth is expressed in units of time rather than genetic distance.
The THD computation proceeds as follows. Let X be a sample of n haplotypes defined over m markers, represented as an (n × m) data matrix, and let y be a haplotype of interest not in X, for which the density is to be computed. For each haplotype x i in X, let h i be the Hamming genetic distance from y to x i , that is, the number of differences between x i and y. A distance h is associated with a kernel density (formally, a probability) under the truncated geometric distribution with bandwidth b (formally, the failure probability of a Bernoulli trial) and truncation limit m. This distribution has probability mass function k (h|b,m) = 1−b 1−b m+1 b h where the bandwidth b is a real number in [0, 1). Remark that the density is proportional to b h , which illustrates how the bandwidth controls the influence of the distance h on the density: For each additional difference between y and x, the density is multiplied by b. Reducing b, thus, accelerates the decrease of the density for larger numbers of differences. Finally, the haplotypic density K (y | X, b, m) of y with respect to X is the average of the n densities associated with the distances from y to each x i in X,

BOX 1 Applied relevance
Fitness is a key concept in the field of evolutionary biology and describes how good a genotype is at spreading its genes to the next generations. This handy concept is however more difficult to grasp in the field of microbiology. Microbial pathogens' success is often associated with a higher epidemicity of the corresponding disease, but pertinent individual-based measures of epidemic success are still lacking. Here, we illustrate the application of a recently developed measure of epidemic success, the time-scaled haplotypic density (THD), to model the driving forces behind major epidemics of multidrug-resistant bacterial pathogens.
The THD index has the potential to be implemented in genome wide association studies (GWAS). Treated as a quantitative trait, THD could identify SNPs or more complex genetic architectures associated with epidemic success.
The widespread adoption of THD-based GWAS might help expand the spectrum of the candidate genes or mutations linked to pathogen success, thus accompanying a shift from the current, paradigmatic antibiotic resistance-targeted GWAS approaches to more general questions pertaining to pathogen ecology and evolution. From a more applied point of view, the THD might allow us to pinpoint SNPs of major importance linked to virulence and spread in pathogens as well as to extend our knowledge concerning the adaptive landscape of those germs.
Because b is a dimensionless constant, its choice is not intuitive.
To circumvent this issue, we exploit the existence of a bijective relationship between the genetic distance h and the maximum-likelihood estimate t of the time to the most recent common ancestor (TMRCA) under the infinite alleles model (IAM) (Kimura & Crow, 1964;Walsh, 2001). The IAM assumes that the m haplotype markers lie on a non-recombining DNA segment, that they evolve independently with a common evolutionary rate µ, and that at most one change per marker has occurred in both lineages since their MRCA. Assuming that µ is known, the IAM allows us to replace the bandwidth with a more intuitive timescale parameter t 50 , which is the TMRCA such that haplotypes with shorter TMRCAs account for 50% of the density. Practically, we solve the IAM relation it follows that h 50 is the median of a truncated geometric distribution whose bandwith b * must be determined. From the cumulative probability function of the truncated geometric distribution with param- 2 . While this last equation has no analytic solution, it is easily solved for b * numerically using a root-finding algorithm over the [0, 1) interval.
From a computational complexity standpoint, the THD method scales linearly with the size of the input matrix of genetic distances, hence quadratically with the sample size. The single-threaded computation time is ~0.1 s for a sample size of 1,000 isolates on a standard computer (Intel I7 CPU at 3.4 GHz). As a comparison point, this computation is two orders of magnitude faster than a typical, fast phylogeny reconstruction method (BIONJ) applied on the same dataset ( Figure S1).
The THD computation steps can be summarized as follows: (a) determine parameters m (number of markers), µ (evolutionary rate) and t 50 (timescale); (b) associate the timescale with a median distance h 50 ; (c) determine the corresponding bandwidth b * ; and (d) for each haplotype of interest, compute THD as the average kernel density under the truncated geometric distribution with bandwidth b * and truncation limit m. The resulting vector of THD values is typically used as the response variable of linear regression models using potential predictors of epidemic success as the explanatory variables.
A summary of the model estimation steps is shown in Figure 1.
Practical guidelines on parameter selection and model interpretation are proposed in Box 2. A reference implementation of THD for the R platform is available at https://github.com/rasig adela b/thd.

| G ROUP-LE VEL CORREL ATIONS B E T WEEN THD, THE MALTH US IAN PAR AME TER , AND D IVER S IT Y IND I CE S
Several aggregate measures can be interpreted as proxies of epidemic success at the level of a group of isolates: (a) the group's prevalence, which reflects possible dominance but provides no information on population dynamics; (b) the genetic diversity of the group, which is comparatively lower in recently expanding lineages; and (c) the Malthusian parameter r.
We examined the correlation patterns of THD with these aggregate measures in a well-documented dataset of M. tuberculosis genotypes from a collection of 4,987 isolates of the Beijing lineage F I G U R E 1 Overview of the THD method's inputs, estimation, and modeling steps. THD estimation is a special case of kernel density estimation where the observations are haplotypes and the kernel bandwidth b is calibrated relative to a timescale parameter t 50 defined as the median of the kernel distribution. The explicit estimation formula emphasizes that THD estimation has O (n 2 ) space and time complexity. THD values, used as proxies for the epidemic success of each haplotype, are typically regressed on a set of predictor variables. TMRCA, time to the most recent common ancestor between two haplotypes. ϵ is the error term of the regression model (Merker et al., 2015). This lineage is a major driver of the ongoing epidemic of multidrug-resistant tuberculosis (MDR-TB) in Eurasia (Merker et al., 2018;Wirth et al., 2008) and

BOX 2 Guidelines for THD modeling and interpretation
THD modeling seeks to examine associations between the epidemic success and the characteristics of a pathogen or its host. We propose a synthetic overview of the construction and interpretation of linear regression models that use THD is as the response variable.

Selection of explanatory variables
Potential predictors of epidemic success can relate to the host, such as the sociodemographic or clinical characteristics of infected patients (Rasigade et al., 2017), or to the pathogen, such as its phenotype or genotype, including single nucleotide polymorphisms.
Special care should be taken to identify potential confounders to be included as model covariates.

Timescale selection
The timescale t 50 controls how fast the contribution of isolates to the density decreases with the age of their divergence (the time to the most recent common ancestor, TMRCA). Technically, t 50 is the TMRCA such that isolates with a TMRCA at most equal to t 50 account for 50% of the density. Practically, the timescale reflects the time window of interest depending on the study question. A sensible choice for t 50 is the time elapsed since the onset of the pathogen spread. Shorter timescales can be used to model smallerscale phenomena such as the influence of host-related predictors that do not depend on pathogen evolution.

Controlling for population structure
THD values exhibit a strong phylogenetic autocorrelation (that is, closely related isolates tend to have similar THDs) because THD depends by design on the population structure. Comparing THD models with and without controlling for population structure provides information on the relationship of the predictor under study with epidemic success. In THD models, population structure is treated as a potential confounder of this relationship.
A predictor-THD association that remains significant in the controlled model is independent of population structure, that is, the association is not explained by a shared evolutionary history of the predictor and the pathogen's fitness. Population structureindependent associations typically arise when: (a) the predictor characterizes the pathogen's host or environment rather than the pathogen itself; and (b) the predictor is a pathogen trait that evolved repeatedly in different lineages or within them. This latter case indicates convergent evolution, which supports a causal role of the predictor in increasing the epidemic success.
If the predictor-THD association is significant in the uncontrolled model but not in the controlled model, the predictor is said to be conditionally independent of THD given population structure. A population structure-dependent association arises for instance when the predictor evolved in a unique lineage whose epidemic success differs from other lineages. In such a case, the predictor might or might not play a causal role because its evolutionary history is intertwinned with that of the variation of epidemic success.
In other words, the predictor can be considered a marker of the successful lineage but not necessarily the cause of this success.

Practical methods for population structure control
Controlling for population structure can be achieved using several methods (reviewed in Sul, Martin, & Eskin, 2018). We focus on the simplest ones, namely random-effects grouping and principal coordinates analysis. Random-effects grouping can be applied when the population structure can be described by grouping isolates in several well-defined clades, such as major lineages or clonal complexes. Random-effects models use between-group variance internally; hence, a sufficient number of groups (typically >5) must be present in the sample for the variance estimate to be reliable. If isolates cannot be grouped in a meaningful fashion, population structure can be accounted for by introducing genetic principal coordinates (PCs) as model covariates (Li & Yu, 2008). These PCs are obtained using multidimensional scaling of the matrix of genetic distances between the isolates. The PCs represent isolates as coordinates in a (virtual) space in which the genetic distances are approximately preserved, such that closely related isolates exhibit similar coordinates. Used as model covariates, these PCs represent the position of each isolate in the population structure and effectively account for its influence on the THD response variable.
Beijing lineages, is confined to the Korean and Japanese regions. The Beijing lineage has diversified during its geographical spread. CC1 is present in Central Asia, Russia, and Eastern Europe and exhibits ~30% MDR phenotypes. CC2 is mostly found in Russia and Belarus and exhibits the highest MDR prevalence at 75%. CC5 has spread to the Pacific islands and did not develop an MDR phenotype, probably because of specific treatment practices in these countries.
For each CC, we derived the Malthusian parameter r from previously established estimates of the current population size N 0 , the length of time t since the beginning of the last expansion, and the ancestral population size N t prior expansion, as r = −log(N 0 /N t )/r (Merker et al., 2015). The estimates for N 0 , N t , and t had been ob-  Table 1. The measure that most strongly correlated with THD was the Malthusian parameter, which is also the most direct aggregated correlate of epidemic success (Figure 2b). Among diversity measures, THD exhibited the strongest (negative) correlation with the Shannon entropy index. The weakest correlation was found between THD and prevalence. Details of the distribution of THD relative to CC prevalence and Shannon index are shown in Figure 2c,d, respectively. Collectively, these results indicate that the average THD mostly reflects population expansion in this dataset, as measured through the Malthusian parameter of each CC.

| EPIDEMI C SUCCE SS AND ANTIB I OTIC RE S IS TAN CE IN THE M . TUB ERCULOSIS B EIJ ING EPIDEMI C
To illustrate the use of an individual-based correlate of epidemic success, we examined the relationships between drug resistance and success in M. tuberculosis Beijing. The wide variations in drug resistance between CCs (Table S1) and regions (Table S2) make it difficult to disentangle the possible effects of each resistance on the success of Beijing lineages. Moreover, the patterns of associations between resistance, CCs and THD were intricate, as illustrated for streptomycin and rifampicin resistance using genetic principal coordinate (PC) visualization ( Figure 2e). Therefore, we used mixed-effects modeling to illustrate how THD analysis can isolate the confounding effect of geographic and population structure from the effect of drug resistance.
Time-scaled haplotypic density was regressed on resistance to major anti-TB drugs isoniazid, rifampicin, streptomycin, ethambutol, and pyrazinamide, which were coded as binary predictors. Drug resistances were either included in separate models or together in a multiple regression model (simple versus multiple regression models, respectively). In each case, the model was controlled or not for confounding by geographical and genetic structure by including the CC and the geographic region of the isolate as random effects in a mixed-effects model (uncontrolled versus controlled models, respectively).
Model coefficients and their significance levels are shown in Table 2. The comparison of bivariate and multiple regression models, along with the comparison between uncontrolled and controlled models, illustrates the patterns of the association between success and resistance. Bivariate models capture all marginal associations between drug resistance and THD, without considering correlation between the resistances. In multiple regression models, the effect of drug resistance is independent of other resistances. Finally, controlled models examine associations independent of the effect of between-CC and between-region variations.
In bivariate, uncontrolled models, THD was most strongly associated with streptomycin and isoniazid resistance (Table 2), which reflected the comparatively higher rates of resistance to these drugs in the highly successful CC1 and CC2 ( streptomycin resistance was the only independent predictor of success, most likely due to its enrichment in CC1 and CC2 (Figure 2e).
Interestingly, in the controlled multiple regression model, rifampicin resistance was found as an additional predictor of success.
Overall, the THD analysis suggests that drug resistance, especially to streptomycin and rifampicin, is a driver of transmission success in Beijing lineage isolates. This association was robust to confounding by geographical and genetic structure of the sampled isolates. This indicates that the emergence of drug resistance in the Beijing lineage is not measurably counteracted by a fitness cost, a finding that brings additional support to the need for susceptibility test-driven anti-tuberculosis therapy in areas where this lineage is prevalent (Merker et al., 2018).

| CHROMOSOMAL FLUOROQUINOLONE RE S IS TAN CE IN MULTIDRUG -RE S IS TANT T YPHOID FE VER
Typhoid caused by Salmonella Typhi is endemic in many countries and the emergence of MDR lineages further amplifies its burden.
The epidemic of MDR typhoid is mainly driven by the dominant   (Wirth, 2015;Wong et al., 2015). The accumulation of resistance determinants in the 1970s led to the emergence of strains resisting all first-line drugs including ampicillin, chloramphenicol and trimethoprim-sulfamethoxazole. This first wave of resistance was followed by the acquisition of chromosomal mutations conferring additional resistance to fluoroquinolones such as ciprofloxacin that have been increasingly used to treat typhoid since the 1990s (Menezes, Harish, Khan, Goessens, & Hays, 2012).
Understanding and quantifying the fitness impact ( In this context, we illustrate the use of THD to address the question of whether fluoroquinolone resistance has a measurable impact on pathogen success at population scale. We used a genomic dataset generated from an international study of H58 and non-H58 S. Typhi (Wong et al., 2015). The dataset is comprised of 1,832 whole-genome sequences of isolates from 68 countries, mainly in Asia and Africa. The genomic correlate of chromosomal fluoroquinolone resistance was the presence of FQ-R-associated mutations gyrA, gyrB, parC, or parE. Of note, non-chromosomal fluoroquinolone resistance conferred by plasmid-borne qnr genes was exceptional in the collection (n = 7). THD estimates were computed from the matrix of pairwise SNP distances using a timescale of 20y, an effective genome size of 4.6 × 10 6 bp and an average per-nucleotide substitution rate of 1.4 × 10 −7 year −1 as estimated previously using Bayesian coalescent-based analysis (Wong et al., 2015).
This pattern is compatible with the hypothesis that, at the population scale, chromosomal fluoroquinolone resistance entails a measurable fitness cost in both H58 and non-H58 isolates, but that widespread fluoroquinolone use sustains a high prevalence of resistance. To challenge this hypothesis while illustrating the use of THD in more complex applications, we constructed additional models controlling for population structure and geographic distribution. In our setting, the geographic origin can capture potential differences in fluoroquinolone use between regions, which might also correlate with the distribution of fluoroquinolone resistance, H58 prevalence and its transmission success (Figure 3b). To control for confounding by geographic origin, isolates were assigned to 6 regions whose areas were arbitrarily defined to avoid sample sizes <30. Details of the distribution of isolates in countries and regions can be found in the Table S3. The region of origin was included as a random effect in the controlled model.

| DRIVER S OF EPIDEMI C SUCCE SS IN COMMUNIT Y-ACQUIRED ME THICILLIN -RE S IS TANT S . AUREUS
In the late 1990s and early 2000s, an epidemic of methicillinresistant S. aureus (MRSA) stormed North America, driven by a highly clonal, dominant lineage, USA300, that belongs to the multilocus sequence type (ST) 8 (Diep et al., 2006). The USA300 lineage is thought to have diverged from ancestral methicillin- The ACME element, whose presence enhanced the virulence and persistence of S. aureus in animal models of skin infection (Thurlow et al., 2013), may have contributed to the successful spread of the USA300-NA variant. A few years after their emergence, USA300 isolates evolved chromosomal mutations conferring resistance to fluoroquinolones. In Bayesian coalescent-based demographic analyses, the successive acquisition of ACME and fluoroquinolone resistance coincided with two phases of expansion of the effective population size of USA300 (Glaser et al., 2016). These findings suggested that both acquisition events contributed additively to the success of USA300-NA. However, the associations of ACME and fluoroquinolone resistance acquisition with population expansion phases were coincidental in nature and do not provide a definitive evidence of a role of these traits in the epidemic success of USA300 relative to its ST8 siblings.
Interestingly, the role of fluoroquinolone resistance in the success of MRSA is supported by several lines of evidence. Fluoroquinolone resistance, which has emerged repeatedly in the history of USA300 (Glaser et al., 2016), has also been associated with the expansion of other MRSA lineages in other geographic regions (McAdam et al., 2012). Moreover, limiting prescription of fluoroquinolones was shown in independent studies to correlate with a decrease in MRSA prevalence (Charbonneau et al., 2006;Parienti et al., 2011).
Finally, fluoroquinolone concentrations as small as 1/100 of the F I G U R E 3 Distribution of the THD success index in Salmonella Typhi H58 and non-H58 isolates depending on chromosomal fluoroquinolone resistance (a) and geographic origin (b). (c) Genetic principal coordinate plots illustrate lack of overlap between higher THD values (left column) and fluoroquinolone resistance (right column) both in H58 and non-H58 lineages (upper and lower rows, respectively). ***p < .001, Mann-Whitney U test wild-type minimum inhibitory concentration allowed a FQ-R variant to outcompete its susceptible progenitor in co-culture experiments (Gustave et al., 2018).
Compared with fluoroquinolone resistance, the role of ACME in the success of USA300 is less supported. The role of ACME in the virulence of infection was not consistent in animal models (Montgomery, Boyle-Vavra, & Daum, 2009), and an epidemiological link of ACME with more severe disease in humans is still lacking. ACME acquisition coincided with the emergence of USA300-NA, but this acquisition was a unique event and ACME is very seldom found in other S. aureus lineages. Finally, phylogenetic analyses have shown that ACME was lost on several occasions by USA300 isolates (Jamrozy et al., 2016).
In this context, we leveraged the THD framework to examine the relative contributions of ACME and fluoroquinolone resistance to the epidemic success of USA300 and other representatives of the ST8 lineage. We used the genome sequences of 498 ST8 isolates from the United States and France that were previously analyzed by our group to examine the emergence of USA300 from the ST8 genetic background (Glaser et al., 2016). Chromosomal fluoroquinolone resistance was inferred based on mutations in gyrA and parC, and the presence of ACME was determined based on sequence alignment.
THD estimates were computed from core-genome SNP distances using a timescale of 20y, an effective genome size of 2.7 × 10 6 bp, and an average per-nucleotide substitution rate of 1.3 × 10 −6 year −1 as determined previously using Bayesian coalescent-based analysis (Glaser et al., 2016).
The median THD value was 1.6× higher in the 385 (77.3%) ACME-positive ST8 isolates compared to the 113 (22.7%) other ST8 isolates (2.7 × 10 −3 versus 1.5 × 10 −3 , respectively, p < 10 -6 , Mann-Whitney U test; Figure 4a). Interestingly, the median THD was very close in the 322 (64.7%) FQ-R and the 176 (35.3%) FQ-S isolates (2.8 × 10 −3 versus 2.6 × 10 −3 , respectively); however, the THD distribution in FQ-R isolates was stochastically greater (p < 10 -16 ). In separate subgroup analyses of ACME-positive and ACME-negative isolates, the THD distribution was stochastically greater in the FQ-R members of both groups (Figure 4a; p = 3.3 × 10 −8 in ACME-positive and p = 2.2 × 10 −8 in ACME-negative isolates). Using similar subgroup analyses in FQ-R and FQ-S isolates, the THD distribution of ACME-positive isolates was stochastically greater in the FQ-S group (p = 6.1 × 10 −6 ) but not in the FQ-R group (p =.46), suggesting that ACME positivity does not add to the success of FQ-R isolates. two geographic origins were present, we used a fixed effect rather than a random effect. Using a similar model-building procedure as F I G U R E 4 Distribution of THD success index in S. aureus ST8 isolates depending on the presence of ACME and of chromosomal fluoroquinolone resistance (FQ-R). (a), fluoroquinolone resistance is associated with increased THD values in ACME-negative and -positive isolates. (b) genetic principal coordinate plots of ST8 isolates in the two first PC planes identifies partial overlap between higher THD values (left column), ACME positivity (center column) and fluoroquinolone resistance (right column). ***p < .001, Mann-Whitney U test (a) (b) described for S.Typhi, we first identified the relevant genetic PCs predicting THD in a complete model (n=7 PCs with p<.01). In the final, controlled model with 7 PCs, fluoroquinolone resistance was predictive of higher THD values, with a coefficient of 6.1×10 −5 (95% CI, 4.4×10 −6 to 1.2×10 −4 , p=.03). The presence of ACME, however, did not predict THD (p=.25) and, of note, had a negative coefficient.
Collectively, these findings provide additional support for the hypothesis that fluoroquinolone resistance drives the success of MRSA lineages including USA300. We failed, however, to identify an independent contribution of ACME to the success of USA300.

| CON CLUS IONS
Our main goal here was to present applications of a fast and efficient technique to estimate the epidemic success of pathogens. As illustrated in the previous sections, the THD method provides researchers with a flexible index of epidemic success, which can be used in regression analysis in the same manner as a quantitative trait.
To resolve these questions with greater accuracy, density-based approaches such as THD bear the potential for being implemented in genome wide association studies (GWAS) (Chen & Shapiro, 2015;Jaillard et al., 2018;Power, Parkhill, & de Oliveira, 2017). Treated as a quantitative trait, THD might be associated with specific SNPs or with more complex underlying genetic architectures, as exemplified in a recent study of the impact of vancomycin resistance-associated SNPs on the success of the opportunistic pathogen Staphylococcus capitis . The widespread adoption of THD-based GWAS might help to widen the spectrum of the candidate genes or mutations linked to pathogen success, thus accompanying a shift from the current, paradigmatic antibiotic resistance-targeted GWAS approaches to more general questions pertaining to pathogen ecology and evolution (Farhat et al., 2019;Hicks, Carey, Yang, Zhao, & Fortune, 2019).

ACK N OWLED G EM ENTS
The authors thank J.T. Shapiro for fruitful discussion and editing support. This work was supported in part by FINOVI Foundation Grant R18037CC to JPR, Institut Convergences Migrations and the French grant ANR-16-CE35-0009 to TW. The funder had no role in study design, data collection and analysis, manuscript preparation, or decision to publish.

CO N FLI C T O F I NTE R E S T
None declared.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from the corresponding author upon reasonable request.