Multiple plasmid origin‐of‐transfer regions might aid the spread of antimicrobial resistance to human pathogens

Abstract Antimicrobial resistance poses a great danger to humanity, in part due to the widespread horizontal gene transfer of plasmids via conjugation. Modeling of plasmid transfer is essential to uncovering the fundamentals of resistance transfer and for the development of predictive measures to limit the spread of resistance. However, a major limitation in the current understanding of plasmids is the incomplete characterization of the conjugative DNA transfer mechanisms, which conceals the actual potential for plasmid transfer in nature. Here, we consider that the plasmid‐borne origin‐of‐transfer substrates encode specific DNA structural properties that can facilitate finding these regions in large datasets and develop a DNA structure‐based alignment procedure for typing the transfer substrates that outperforms sequence‐based approaches. Thousands of putative DNA transfer substrates are identified, showing that plasmid mobility can be twofold higher and span almost twofold more host species than is currently known. Over half of all putative mobile plasmids contain the means for mobilization by conjugation systems belonging to different mobility groups, which can hypothetically link previously confined host ranges across ecological habitats into a robust plasmid transfer network. This hypothetical network is found to facilitate the transfer of antimicrobial resistance from environmental genetic reservoirs to human pathogens, which might be an important driver of the observed rapid resistance development in humans and thus an important point of focus for future prevention measures.

Consequently, the interaction between conjugative relaxase enzymes and their DNA origin-of-transfer (oriT) substrates facilitates the majority of all AMR transfers in nature (Alekshun and Levy, (2007); Wintersdorff et al., 2016) and is especially important for ones related to human infection complications (San Millan, 2018).
The standard approach for characterization of plasmid mobility involves the classification of conjugation and mobilization genes , especially typing of relaxase enzymes into the respective mobility (Mob) groups (Garcillán-Barcia et al., 2009;Garcillán-Barcia et al., 2020). However, besides the possibility of yet unidentified enzymes and mobility groups (Coluzzi et al., 2017;Garcillán-Barcia et al., 2009;Guzmán-Herrador & Llosa, 2019;Ramachandran et al., 2017;Soler et al., 2019;Wisniewski et al., 2016), multiple new processes have recently been uncovered that might confer additional mobility to plasmids and involve the oriT substrate. These include (aa) broadened relaxase-binding specificities to multiple different oriT sequence variants (Chen et al., 2007;Fernández-González et al., 2016;Fernández-López et al., 2013;Jandle & Meyer, 2006;Kishida et al., 2017), which, according to the evolutionary theory of such DNA regions (Becker & Meyer, 2003;Parker et al., 2005;Zrimec & Lapanje, 2018), indicates the possibility of plasmids carrying multiple functional secondary oriTs, and (b) trans-mobilization of plasmids carrying oriTs triggered by relaxases from co-resident plasmids acting in trans on the non-cognate oriTs (Moran & Hall, 2019;O'Brien et al., 2015;Pollet et al., 2016;Ramsay & Firth, 2017). The latter mechanism demonstrates that oriT regions are the only elements of the conjugation machinery required in cis (Guzmán-Herrador & Llosa, 2019) and suggests that many plasmids classified as non-mobile due to the absence of putative relaxases may be mobilizable (Ramsay & Firth, 2017). However, although typing the oriT enzymatic substrates instead of the genetic scaffolds might present improvements to the current understanding of plasmid mobility, no systematic studies of oriTs across sequenced plasmids have yet been performed, likely due to the lack of available data and tools that would enable such oriT typing.
A major problem with uncovering oriT regions is that, apart from being experimentally laborious, it is computationally challenging due to multiple molecular mechanisms and a variety of DNA sequence elements present and coevolving in the DNA substrate (Zrimec & Lapanje, 2018), even among plasmids belonging to a single species such as Staphylococcus aureus (O'Brien et al., 2015). The oriT region contains recognition and binding sites for the relaxase enzyme as well as accessory proteins that help to initiate mobilization. These sites include inverted repeats and hairpins (Frost et al., 1994;Sut et al., 2009;Williams & Schildbach, 2007) as well as the nicking site nic, where relaxase cleaves the DNA to initiate plasmid transfer (Frost et al., 1994). They are characterized by specific DNA physicochemical and conformational features that underpin key protein-DNA readout and activity mechanisms (Kolomeisky, 2011;Rohs et al., 2009Rohs et al., , 2010Zrimec & Lapanje, 2015, 2018 as well as define conserved niches of structural variants that enable good resolution between Mob groups and subgroups (Zrimec & Lapanje, 2018). OriT typing thus requires algorithms beyond simple sequence-based alignment (Altschul et al., 1990;Li et al., 2018) that can recognize and process the more complex molecular motifs and underlying DNA physicochemical and conformational (i.e., structural) features.
The use of DNA structural representations has indeed led to improvements in algorithms for the identification of other regulatory regions, such as promoters and replication origins (Abeel et al., 2008;Bansal et al., 2014;Chen et al., 2012;Dao et al., 2018;Samee et al., 2019). Despite this, instead of using tools that probe the actual relaxase-oriT interaction potential by identifying molecular properties that are the basis of such interactions, conventional approaches for oriT analysis still rely on sequence alignment-based methods (Altschul et al., 1990;Li et al., 2018;O'Brien et al., 2015).
Here, we prototype a DNA structure-based alignment algorithm for finding oriT variants, which enables finding and also Mob-typing oriT regions across thousands of sequenced plasmids. Based on the newly uncovered oriT variants, since they can facilitate both in cis and in trans plasmid transfer, the amount of putative mobile plasmids and putative mobile plasmid-carrying host species is re-analyzed.
We then evaluate if and how the uncovered fraction of oriTs might help to overcome the known barriers to horizontal gene transfer, by reconstructing and analyzing a hypothetical network of potential AMR transfers between different species and habitats, especially those from the environmental reservoir to the human microbiota.

| M1. Datasets used for alignments
The full query dataset with known nic sites comprised 112 distinct oriT regions from 118 plasmids, where a single oriT sequence was selected to represent oriTs with sequence similarity below 15%, and 6 Mob groups {F,P,Q,V,C,T} (Table A1-1, Dataset S1: https://github. com/JanZr imec/oriT-Stras t/blob/maste r/data/Datas et_S1.csv). The For initial development and testing of the structural alignment algorithm, due to the lack of a sufficient number of elements from | 3 of 31 ZRIMEC Mob groups C, H, and T for correct testing (below 10 elements per group), a 4 Mob group {F,P,Q,V} version of the query dataset with 106 elements was used (Table A1-1). The balanced dataset from 4 Mob groups {F,P,Q,V} used for s-distance testing was a subset of the query dataset containing approx. 16 elements from each Mob group (Zrimec & Lapanje, 2018;  were selected randomly from a region 200 to 800 bp upstream and downstream from experimental nic sites, thus containing different non-oriT coding and non-coding regions with low sequence similarity (p-distance >0.6). The testing datasets included (a) 51 plasmids with known oriT locations and Mob groups but unknown nic sites (Dataset A2: https://github.com/JanZr imec/oriT-Stras t/blob/maste r/data/Datas et_S2.csv) and (b) 13 plasmids with 14 experimentally determined nic/oriT sites but unknown Mob groups, obtained from the OriTDB database (Li et al., 2018; Table A1-3).

| M2. Development and testing of alignment algorithms
A DNA structure-based alignment algorithm, termed Strast, was developed and tested. The algorithm: (a) takes as input a set of query and target DNA sequences, (b) encodes the input query and target DNA sequences into structural representations ( Figure A1-1b), and (c) finds and returns the most similar segments of target sequences to query sequences based on a structural distance measure (s-distance, Figure A1-1c: algorithm pseudocode). The practical implementation of the algorithm uses precomputed parameters for structurally encoding the DNA sequences as well as a precomputed distance matrix for computing the s-distance function.
To compute DNA structural representations, 64 models of physicochemical and conformational DNA properties important for protein-DNA interactions, such as those occurring in oriT regions, were compiled (Table A1-3). Next, to obtain the precomputed parameters for structurally defined groups of k-mers, termed s-mers ( Figure A1-1b), structural properties of all permutations of k-mers of size s = 7 bp (3 neighboring regions around a specific nucleotide) were computed, after which dimensionality reduction and clustering were performed.
Dimensionality reduction was performed using principal component analysis (PCA), and the number of used principal components was 18 (out of 64) to capture over 0.99 of the data variance. The k-means clustering algorithm was used (MATLAB), where the number of clusters k was 128, and clusters with the lowest total sum of distances were chosen from 10 runs of up to 1000 iterations at default settings. The s-mer size s and number of clusters k were chosen by comparing the algorithm performance using s = {3, 5, 7, 9} and k = {4, 8, 16, 32, 128, 256} (Zrimec, 2020), respectively ( Figure A1-7). Finally, the structural representation of a DNA sequence is obtained by encoding its k-mers into s-mers ( Figure A1-1b), where the length of the structural representation equals the length of the nucleotide sequence minus the leftover nucleotides at the borders (3 bp) due to the neighboring nucleotides in s-mers.
The s-distance between two DNA substrates was the sum of where C ni = c n1 , c n2 , …, c nk are the cluster centroids of the s-mer at position i of the first and second sequences, respectively. For algorithmic efficiency, the distances between all s-mers were precomputed and stored in a distance matrix. The p-distance was equal to the Hamming distance corrected for sequence length. The Jaccard distance between two DNA sequences was defined as the intersection over the union of sets of either their unique k-mers, with nucleotide sequence representation, or s-mers, with structural representation, respectively.
The performance of the alignment algorithm for typing oriTs in target sequences was tested by evaluating the correctness of both (a) oriT and nic location finding to within ±1 bp (Francia et al., 2004) and ( (Altschul et al., 1990;www.ncbi.com) was used with default settings (word size = 11, expectation threshold = 10, nucleic match/mismatch score = 2/−3, gap opening/extension costs = 5/2), where the same query and target data as with Strast were used to obtain alignment hits. The specific capability of Strast for locating oriT and nic regions was compared against the tool OriTfinder (Li et al., 2018; https://bioin fo-mml.sjtu.edu.cn/oriTf inder/), where the web-based version was used with default settings (Blast E-value = 0.01) by uploading fasta files of the target sequences and relying on the built-in query sequences.

| M3. Statistical analysis and machine learning metrics
The F-test was performed using PERMANOVA (Anderson, 2001) with sequence bootstraps. The statistical significance of s-distance scores was evaluated using permutational tests, where bootstrap resampling (n_bootstraps =1e6 per sequence) of randomly selected query oriT sequences (n_seq = 10) was used to estimate the s-distance scores at different p-value cutoffs (from 1e-6 to 1e-1). Next, to obtain a mapping function of s-distance to permutational p-values in the whole range of 1e-132 to 1e-1 ( Figure A1-1d), the least-squares curve fitting to a second-order polynomial function was performed, where the theoretical limit of ~1e-132 was set to correspond to an s-distance of 0. For additional statistical hypothesis testing, the Python package Scipy v1.1.0 was used with default settings.
The following machine learning performance metrics were used to assess alignment algorithm performance: Precision, Recall/Sensitivity, Specificity, Accuracy, F1-score, and Matthews correlation coefficient (Table A1-6). To calculate these metrics, true-and false-positive and true-and false-negative counts were obtained from the alignment tests (Methods M2) by considering only the most significant hit per alignment. A true-or false-positive value was assigned if the result was above a specified significance cutoff and corresponded or did not correspond, respectively, to the known value (nic location, Mob group, or subgroup), and alternatively, a false-or true-negative value was assigned to results below the significance cutoff that corresponded or did not correspond, respectively, to the known value.

| M5. Simulations of plasmid mobility
To estimate the results that would be obtained with a larger oriT query dataset, the following procedure was applied. The oriT alignment results with the dataset of 4602 target plasmids were diluted according to 10-fold dilutions of the 102 query regions used to identify the hits (10 repetitions were used). Least-squares curve fitting was performed (Python package Scipy v1.1.0) using a linear function and the dataset dilutions-specifically between the size of the query oriT dataset and the variables corresponding to the numbers of oriT hits, putative mobile plasmids, putative mobile plasmid-carrying host species, and overlap with relaxase-typed plasmids.

| M6. Network analysis
To study the co-occurrence of different oriT regions or Mob groups as nodes, shared across the putative multi-oriT plasmids as edges, an undirected multi-edged graph was constructed. The graph contained a total of 79,004 connections and the number of unique oriT nodes was 102 since each oriT hit was characterized by its closestassociated query oriT.
To study the potential for plasmid transfer between different habitats, host species of the oriT alignment results within the subset of multi-oriT plasmids were mapped across 9 habitat supertypes (Table A3-4) according to published data on environmental (Pignatelli et al., 2009) and human microbiomes (Dewhirst et al., 2010;Escapa et al., 2018;Forster et al., 2016;Human Microbiome Project Consortium, 2012;Lloyd-Price et al., 2017). This retained 43% (227 of 532) of the unique species carrying multi-oriT plasmids, where habitat sizes reflected those of the full habitat dataset (according to the number of unique species, on average 939 species) but were on average eightfold smaller (on average 119 species) varying less than 22% around this value. The habitat taxonomy was further expanded to include human commensal and pathogen types (Human Microbiome Project Consortium, 2012) as well as tissue subtypes (Pignatelli et al., 2009). Next, a directed graph representation of habitat nodes connected by potential plasmid transfers as edges was constructed, where habitats of donor hosts carrying the putative mobile plasmids (outbound connections) connected to habitats of potential acceptor hosts deduced from the query oriTs (inbound connections). The network comprised 141,395 connected habitat node pairs, with a total of 1,600,978 plasmid connections between the habitats.
For network analysis, the Python package NetworkX v2.2 was used. For typing antimicrobial resistance genes in the plasmids, the webserver version of ResFinder v3.2 (Zankari et al., 2012) was used with default settings.

| Structural alignment algorithm improves oriT typing performance
A DNA alignment algorithm performs multiple comparisons between a query and a target sequence by evaluating a distance function. We thus developed a structural distance function, This was corroborated with the Jaccard distance, which significantly (ranksum p < 1e-16) decreased by 40% with increasing oriT region size when using structurally encoded k-mers (Methods M2, Figure A1-2), whereas it increased with nucleotide k-mers (ranksum p < 1e-9). The results suggested that our structural encoding approach leveraged the chemical information in longer query regions and could thus improve multiple sequence comparisons with alignments by increasing the statistical depth ( Figure A1-3).
We next prototyped an alignment framework ( Figure 1d) that employed the s-distance measure to find target hits to query oriTs,  (Table A1-1, Methods M1). The algorithm's performance was first tested by assessing the oriT location and Mob type of the highest-scoring alignment hits using the testing dataset of Mob-typed plasmids (Methods M2). By using full-length 220 bp query regions, on average, 19% more significant (permutation test p < 1e-13) oriT hits were recovered, and 25% more Mob groups were correctly predicted compared to using a 40 bp query size (Figure 1e, Figure A1-4). This corroborated that the use of longer queries indeed led to improved algorithm performance ( Figure 1b,c). Furthermore, compared to Blast (Altschul et al., 1990), our approach uncovered on average 45% more significant (permutation test p < 1e-13) oriT hits and correctly predicted 30% Comparison of the amount of correctly identified elements between our algorithm (Strast) and Blast, and by using 220 or 40 bp oriT subsets, for oriT typing as well as discrimination of MOB groups and subgroups. Error bars denote 95% confidence intervals. (f) Comparison of machine learning performance metrics between our algorithm (Strast) and Blast, and by using 220 or 40 bp oriT subsets. Error bars denote 95% confidence intervals. test p < 1e-12) 6 oriT regions in 5 plasmids with 100% sequence identity and aligned to within ±1 bp of the nic sites (Francia et al., 2004; see Table A1-3). In contrast, the tool OriTfinder (Li et al., 2018) was able to correctly identify the approximate locations of 10 oriT regions; however, it correctly determined the nic locations in only 5 of these oriTs, to within ±1 bp (Table A1-4). The results indicated that due to the lack of diversity in the query dataset our algorithm altogether missed certain oriTs in the testing datasets, which was also confirmed by using smaller query datasets that lowered the algorithm's performance especially for locating oriT regions ( Figure A1-6). Nevertheless, despite the limited oriT data availability, the results experimentally verified the algorithm's capacity for oriT typing.

| OriT typing reveals a twofold increase in the number of putative mobile plasmids
The structural alignment algorithm was used to explore the diversity of oriT regions in natural plasmids. To cover all available oriT regions, the query dataset was expanded to 112 unique oriTs from 6 Mob groups that included, besides oriTs from the major Mob groups F, P, Q, and V, also 3.6% and 0.9% of oriTs from groups C This also corresponded to a 1.4-fold increase in the number of distinct Phyla, with putative mobile plasmids representing 19 out of the 23 Phyla compared to 14 with relaxase typing (Figure A2-1).
Furthermore, out of the 907 plasmids where both oriTs and relaxases were identified, the same Mob group, indicating that the oriT was cognate to the relaxase, was identified in 75% of cases (Figure 2e). In the remaining 25% of these plasmids, the oriT hits could have been secondary oriTs (Becker & Meyer, 2003;Parker et al., 2005) or corresponded to either unknown or in trans acting (O'Brien et al., 2015) relaxases. The distribution of the oriT-identified Mob groups was found to be comparable to the one expected according to relaxase  starting from an initial value of 500 with 10 oriT queries) and 250 more putative mobile plasmid-carrying host species were uncovered ( Figure A2-4c). Additionally, to achieve a full overlap with the relaxase-typed plasmids, a considerably larger query dataset than is currently available would be required, comprising 415 oriTs (95% lower and upper bounds were 328 and 532, respectively, Figure A2-4d). The demonstrated limitations of the query data suggest that the present published results (Shintani et al., 2015; and our findings might still be an underestimation of the true plasmid mobility present in nature.

| The presence of multiple putative oriT regions might aid plasmid transfer between habitats
A large part of the newly uncovered oriTs were additional regions to the primary ones that corresponded to the plasmid cognate relaxases (Figure 2f), resulting in 1331 multi-oriT plasmids (54% of the putative mobile plasmids) that carried on average 5 oriT hits ( Figure   A3-1). First, we analyzed the co-occurrence network between the different putative oriT regions (nodes), when they were carried by the same multi-oriT plasmids (edges; Figure 3a, Methods M6). Since each oriT hit was characterized only by its closest-associated query oriT, the actual oriT node diversity was limited to the 102 query oriTs  Figure A3-2). Indeed, specific oriT regions acted as hubs and co-occurred with multiple other regions across the Mob groups ( Figure 3a), with the most highly connected pNL1-, BNC1 Plasmid 1-and pBBR1-like oriTs from Mob F, Q, and V, respectively, co-occurring with over 50 unique oriTs from all 6 Mob groups (Table A3-1).
We next investigated the co-occurrence of Mob groups (Figure 3c,

Methods M6) and measured a 75-fold increase in the amount of Mob
group co-occurrences compared to relaxase typing. Over 90% of the multi-oriT plasmids contained on average 2 unique Mob groups and 3 unique Mob subgroups ( Figure A3-1). The most frequently co-occurring Mob groups were F, Q, and V, where 35,062 co-occurrences were measured within Mob Q, 15,081 between Q and F, and 12,637 between Q and V ( Figure 3c, Table A3-2). with the main co-occurring subgroups Mob Qu with Q2, Fu, V2 (Table A3-3).
The above results suggested that each multi-oriT plasmid might contain the initial means for mobilization by conjugation systems belonging to different MOB groups, and could, under specific con- bacteria can act as an interface for horizontal uptake of genes from the environment, which they might then disseminate to the pathogens within the human body (Forsberg et al., 2012;Marshall et al., 2009;Sommer et al., 2010).  (Hu et al., 2016;Pal et al., 2016).
Compared to established tools like OriTfinder, our method performs similarly, though with some complementarity (Tables A1-3 and A1 -4), suggesting that it is a useful complement to the existing methods. However, its main advantage is the capability to determine Mob groups from mere oriT regions (accuracy >90%, Table A1-2) without the requirement for relaxase typing, which also enables typing oriTs in plasmids without a (known) relaxase ( Plasmids are vehicles for the transfer and long-term storage of 'common goods' that include, besides AMR, also virulence, heavy metal resistance, and other genes (Bukowski et al., 2019). Based on the usefulness of this cargo, one can expect that the global plasmid transfer network possesses at least some properties of a robust fault-tolerant system that would increase the guarantee for transfer and information storage (Gillings, 2013;Han et al., 2007). the putative oriT network topology via the closest-associated query oriTs is reminiscent of scale-free and even hierarchical networks ( Figure 3a) and thus displays robust fault-tolerant properties. As sparsely connected nodes without many direct neighbors are linked to highly connected hubs, even in case of absence of a large number of nodes, the remaining ones are likely still well connected (Barabási & Oltvai, 2004;Seyed-Allaei et al., 2006). Moreover, plasmids bearing multiple putative oriTs that could be mobilized by different conjugative systems (Figure 3c) possess at least the initial means that could enable them to transcend some of the horizontal transfer barriers (Gillings, 2013;Haaber et al., 2017;Siefert, 2009). In this view, one can hypothesize that certain conjugative transfer mechanisms and their corresponding hosts might act as transfer hubs that help to ensure the flow of genetic information among the different global microbiomes (Manaia, 2017;Perry & Wright, 2013;Tamminen et al., 2012). Interestingly, following these expected properties, the amount of identified plasmid-borne AMR genes is found to be proportional to the assessment of putative plasmid mobility ( Figure 3f and Figure A3-6).
Care should be taken with interpretation of the hypothetical network of plasmid transfers between different hosts and ecological habitats (Figure 3d), due to key limitations in its analysis. Actual plasmid transfer is not dependent merely on the correct combinations of oriT and relaxase but constrained by additional genetic context due to plasmids being highly modular systems (Acman et al., 2020;Nishida, 2012;Shintani et al., 2015;, which was not accounted for here. Nevertheless, the hypothetical network displays interesting properties, not at all different from ones that can be expected based on current knowledge (Haaber et al., 2017;Lopatkin et al., 2017;Manaia, 2017;Marshall et al., 2009). For instance, the considerably larger influx of plasmids to humans and animals compared to other environmental habitats (Figure 3d) might be a consequence of the increased amount of AMR transfers to these organisms (Bengtsson-Palme et al., 2018;Dolejska & Papagiannitsis, 2018;Wintersdorff et al., 2016). In accordance with published findings (Forsberg et al., 2012;Forslund et al., 2013;Marshall et al., 2009), human commensals might act as the main interface for horizontal uptake of genes from the environment in general (Figure 3e), whereas the transfer of the specific widespread AMR genes might be more highly targeted at pathogens ( Figure A3-6).
Despite the hypothetical nature of the network analysis based merely on first principles (Figure 3a,c,d), the potential increase in putative plasmid mobility that it shows could potentially be an important driver of the observed rapid resistance development in humans (Dolejska & Papagiannitsis, 2018;Manaia, 2017) and thus an important point of focus for further research as well as the development of prevention measures.

CO N FLI C T O F I NTE R E S T
None declared.

E TH I C S S TATEM ENT
None required.

DATA AVA I L A B I L I T Y S TAT E M E N T
All data are provided in full in this paper, except for datasets S1, S2, and S3, the software and code that are available in GitHub: https://github.com/JanZr imec/oriT-Strast, as well as the ac-

Figure A1
Overview of the structural alignment framework. (a) Depiction of the plasmid conjugation process, which can be divided into 4 steps: (i) formation of a conjugative pilus that connects the donor and recipient cells for transmission of mobile DNA, (ii) expression of enzymes (e.g., relaxase) and accessory proteins, which recognize the binding sites at the DNA origin of transfer (oriT), where plasmid transfer is initiated, (iii) relaxase cuts into the oriT at the nic site, exposes the single-stranded DNA and, with the help of the protein transport system, transfers DNA to the recipient cell, (iv) in the recipient, either the missing DNA strand is synthesized and then circularized, in case of plasmid transfer, or the mobile DNA is integrated into the chromosome by recombinant mechanisms, whereas in the donor cell reconstruction of the missing DNA occurs. (b) Depiction of the encoding of DNA into structural representations, where consecutive k-mers of the DNA (of length 7 bp) are encoded with clustered DNA structural property embeddings (marked s. dim. 1, s. dim. 2, …, s. dim. n; n = 18 such embeddings used) into a compressed structural representation termed 's-mers'. To compute the s-mers, 64 DNA structural properties were predicted for all permutations of nucleotide k-mers, after which principal component analysis (first 18 components with >99% of data variance were used) and clustering (number of clusters 128) were performed. (c) Pseudocode giving an outline of the sequence alignment framework, which allows the use of the s-distance measures between the target and query sequences. The s-distance is the Euclidean distance between all respective embeddings of two such structurally encoded DNA sequences. The algorithm takes as input a set of query and target sequences, and for each query and target sequence, encodes them into structural representations, and returns the regions in the target sequence with the lowest s-distance to the query sequence. (d) Mapping of s-distance scores to p-values obtained using permutational (bootstrap) tests, where bootstraps of the query oriT sequences were used to estimate p-values at cutoffs from 1e-6 to 1e-1. These points together with the theoretically predicted limit ~1e-132 were then used to fit to a second-order polynomial function (f = p 0 . x 2 + p 1 .x + p 2 ; p 0 = −3.045e-05, p 1 = 0.128, p 2 = −133.000).
(a) ( b) (c) ( d) Figure A2 Distribution of pairwise Jaccard distances between oriTs, using structurally encoded k-mers (Methods M2) or nonencoded nucleotide k-mers, with the subsets of different oriT sizes. Figure A3 Schematic diagram of the estimated maximum statistical depth achievable with different sequence lengths, where an over 1e100-fold difference is observed between the 40 bp and 220 bp sizes of oriT.

Figure A6
The effect of query dataset size on the performance of the structural alignment algorithm, where a diluted set of 48 elements was compared to the full query dataset. Figure A7 The effect of s-mer size and number of clusters on the combined F1-score of locating oriTs and Mob typing with the structural alignment algorithm. The combination of s-mer size 7 and the number of clusters 2^7 (128) resulted in the best performance. Figure A8 Distributions of phyla in the query dataset as well as in the target dataset obtained by relaxase and structural alignment-typing. Figure A9 Correlation analysis between the sequence homology and structural similarities (s-distance) among oriT hits and their closestassociated query sequences. All p-values were below 1e-16.