Predicting S. aureus antimicrobial resistance with interpretable genomic space maps

Increasing antimicrobial resistance (AMR) represents a global healthcare threat. To decrease the spread of AMR and associated mortality, methods for rapid selection of optimal antibiotic treatment are urgently needed. Machine learning (ML) models based on genomic data to predict resistant phenotypes can serve as a fast screening tool prior to phenotypic testing. Nonetheless, many existing ML methods lack interpretability. Therefore, we present a methodology for visualization of sequence space and AMR prediction based on the non‐linear dimensionality reduction method – generative topographic mapping (GTM). This approach, applied to AMR data of >5000 S. aureus isolates retrieved from the PATRIC database, yielded GTM models with reasonable accuracy for all drugs (balanced accuracy values ≥0.75). The Generative Topographic Maps (GTMs) represent data in the form of illustrative maps of the genomic space and allow for antibiotic‐wise comparison of resistant phenotypes. The maps were also found to be useful for the analysis of genetic determinants responsible for drug resistance. Overall, the GTM‐based methodology is a useful tool for both the illustrative exploration of the genomic sequence space and AMR prediction.


| INTRODUCTION
The discovery of antibiotics marked a revolution in bacterial infection treatment, leading to a significant decrease in associated mortality and morbidity.However, the ability of pathogenic bacteria species to adapt and resist the treatment, on par with inappropriate practice of antibiotic usage, lead to an increase in the number of deaths from bacterial infections.According to recent research, more than 1 million deaths were attributed to antimicrobial resistance (AMR) in 2019 [1].One of the important measures that can be taken to slow down the spread of AMR is the optimization of antimicrobial drug use and stewardship [2].
To establish a diagnosis and prescribe proper treatment for patients with bacterial infection, clinicians usually rely on in vitro antimicrobial susceptibility testing (AST) [3].AST is considered to be a 'golden standard' for choosing optimal antibiotics to which the pathogen in question is susceptible [4].However, AST is a time-consuming and labour-intensive process [5].Hence, the development of rapid front-line diagnostics tools for better clinical decision-making is urgently needed.
Due to increased availability and rapidity, genotypic methods represent a cost-effective alternative to phenotypic AST methods.Genotypic methods consist in the sequencing of the pathogenic organism's genome and subsequent application of rule-or ML-based approaches for resistance predictions.The rule-based approaches predict AMR based on preliminary knowledge of AMR genes or resistance-inducing mutation patterns.Hence, the performance of such approaches is especially good for well-studied organisms [6].Machine learning (ML) methods, on the other hand, correlate genomic features with phenotype; they are trained on raw genomic data associated with experimentally measured AMR phenotypes.In recent years, numerous approaches for MLbased antibiogram predictions were suggested [7,8].Various ML algorithms including AdaBoost [9], CatBoost [10] and XGBoost [11], Random Forest (RF) [10,12], Support Vector Machine (SVM) [13], Neural Networks (NNs) [12,14] were applied to predict resistance/susceptibility labels or minimum inhibitory concentration (MIC) values of antimicrobial agents for bacterial isolates.The genomic data used as an input to these algorithms can be represented in numerous ways: as wholegenome sequences, genome contigs, AMR genes, largescale pangenome data, short sequencing reads, etc.It is usually encoded by the means of alignment-based and alignment-free methods, e. g. sets of single nucleotide polymorphisms (SNPs) and k-mers respectively [15].For a detailed description of the approaches, the reader is referred to the recent review by ref. [16].
Besides high precision, the interpretability of ML plays an important role in the incorporation and adoption of the algorithm into the clinical diagnostics pipeline.To address the issue of clinical utility, several approaches to interpretable ML methods for AMR prediction were suggested [17,18].For example, some of the current interpretability-focused approaches rely on rule-based learning algorithms that are combined with sample compression theory [17] or lasso-penalized logistic regression model [18].Numerous ML tools were also suggested for genomic data visualization and analytics, namely, dimensionality reduction techniques such as tdistributed stochastic neighbour embedding (t-SNE) [19], generative topographic mapping (GTM) [19], stochastic cluster embedding (SCE) [20], principal components analysis (PCA) [21,22], uniform manifold approximation and projection (UMAP) [21], topology data analysis methods [23,24], etc.These methods enable gaining insight on population structure via illustrative exploration of the genomic space.
Ideally, methods should provide both data visualization and performance guarantees.For this purpose, we suggest a methodology for the analysis of bacteria genomic data and predicting AMR with emphasis on the model's explainability and examination of markers of resistance.The methodology is based on the non-linear dimensionality reduction method -Generative Topographic Mapping (GTM) [25,26].Previously, the GTM was shown to be effective for the analysis of large-scale chemical [27] and biological data [19,28,29].Recently, GTM was found to be particularly effective in predicting HIV drug resistance, visualization of sequence space of HIV proteins and analysis of the HIV isolates' mutation patterns [28].
Herein, the GTM was applied for genomic space visualization and AMR predictions for Staphylococcus aureus (S. aureus), one of the six most pathogenic bacteria species ESKAPE (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa and Enterobacter spp.) recognized by WHO.The uniqueness of the herein presented GTM-based methodology lies in its simultaneous ability of visualization of genomic sequence space and prediction of antibiotic resistance.Albeit the performance of the GTM was somewhat lower as compared to other state-of-the-art ML methods, the GTM-based models appeared to be useful for predictions (BA values � 0.75 for all antibiotics) and illustrative analysis of S. aureus genomic space in the context of the resistance.

| Data acquisition and pre-processing
Pathosystems Resource Integration Center (PATRIC) database [30] was used to retrieve bacterial genomes of S. aureus with available AMR profiles, i. e. labelled as resistant or susceptible according to experimental minimal inhibitory concentration (MIC) breakpoints.A contigs-based representation of S. aureus genomes was downloaded from the PATRIC database.The final number of pre-processed S. aureus isolates was 5773, in association with 11 different antibiotics.Drug-specific subsets with highly imbalanced AMR records distribution (class ratio > 3.5) were subjected to random undersampling.The detailed pre-processing protocol is provided in Text S1, Supporting Information.The precise list of antibiotics covered in this study and the corresponding number of resistant and susceptible data points after undersampling are reported in Table S1, Supporting Information.

| Descriptors
Two commonly used approaches for genomic information representation are single nucleotide polymorphism (SNPs) matrices and k-mer profiling.The former, capturing alterations of base pairs in a DNA among collections of aligned sequences, is multiple sequence alignment (MSA)-dependent, is computationally costly and can be error-prone due to lateral gene transfer events frequently encountered in bacteria [17].The latter approach, being alignment-free, describes each genome by the subsequences of nucleotides of a defined length (k-mers) that are present within a given DNA sequence.In this way, sequences can be described by counting the number of times each k-mer occurs in a sequence or kmer presence/absence profile within the sequence, thus avoiding computationally expensive MSA [17].Furthermore, equivalent k-mers can be combined into "representative k-mers" denoted as unitigs.In this work, bacterial genomes were represented using alignment-free approaches, namely by computing k-mer counts and unitigs.The procedure used for the generation of unitigs [31] is explained in Text S2 and Figure S3, Supporting Information.

| Generation of k-mers
Given the lengths of bacterial genomes, the number of generated k-mers can soar up to tens of millions, depending on k-mer length.Such datasets with an extremely small number of samples relative to the number of features, commonly denoted as 'fat data', are reputedly difficult ML challenges [17].Therefore, an additional pre-processing procedure was applied to reduce the number of features.Namely, PCA and kernel PCA (kPCA) were used for feature space reduction.In more detail, k-mers (10-, 15-, and 30mers) were generated using two programs: dsk for 10-mers [32] and Kmer-db for 10-, 15-, and 30-mers [33].Herein, Kmer-db was used to calculate k-mer counts and estimate distances between sequences using different metrics (i.e. Jaccard, cosine, max, min, ani, Mash).The Kmer-db approach was applied for the generation of kPCA pre-processed descriptors, whereas for PCA pre-processed descriptors the dsk tool was used since it enabled direct extraction of k-mer counts in the text format.More details on libraries used for PCA and kPCA, as well as a list of all kernels tried herein are given in Text S3, Supporting Information.

| Generative topographic mapping
Generative Topographic Mapping (GTM) first proposed by Bishop [25] is a non-linear dimensionality reduction technique.The GTM principle consists in the insertion of a flexible hypersurface denoted « manifold » into the high-dimensional descriptor space to attain the maximum coverage of densely populated zones of the sequence space.The optimal manifold geometry allows to encompass as many items from the frame set (a set of representative genomes taken for the study) as possible.The manifold is defined by a set of Gaussian Radial Basis Functions (RBFs).The RBF is used as the centre of a Gaussian distribution that serves to estimate the likelihood of a sequence to be projected onto the manifold, and its responsibility (probability of a sequence to be assigned to a given node).Hence, through responsibilities, each sequence in the multi-dimensional space is associated with nodes of the manifold [27].The geometric centre of the responsibilities of a given sequence defines its position on the manifold-the latent space.The major GTM advantage over other non-linear dimensionality reduction methods such as Kohonen Self-Organizing Maps (SOM) [34] is its « fuzzy logics ».An item's projection is a probability (responsibility) vector.Namely, the sequence is not located in a particular node, but associated to several nodes, with varying "responsibilities" that sum up to 1.0.Such projection of items allows for the creation of GTM landscapes locally coloured by mean values of properties (density or drug resistance) associated to them.Multiple landscapes can be accommodated on one manifold, i. e., one sole manifold can be coloured by resistance values of bacterial isolates to different drugs, hence generating distinct resistance landscapes for each of the antibiotics on a same map.Manifold fitting is controlled by four parameters: map resolution, number of Radial Basis Functions (RBF), regularization coefficient and width of an RBF.Foremost, GTM quality will depend on the initial descriptor space being mappedtherefore, the "hyperparameters" defining the GTM construction recipe must include descriptor choice in addition to the four manifold control parameters above.More details on the GTM algorithm are given in Text S4, Supporting Information.
There are several reasons for using GTM to visualize bacterial population structure and attempt antibiotic resistance predictions.Topological methods are believed to be especially useful for bacterial genomic data analysis and visualization.Events commonly occurring in bacteria such as recombination and horizontal gene transfer (HGT) complicate the sequence analysis by tree-based algorithms such as phylogenetic trees [24].The advantage of the GTM, as opposed to other non-linear dimensionality reduction methods such as t-SNE [35] relies in its reproducibility and ability to update the model with new data, without retraining of the manifold [26].In addition to that, its fuzzy-logical nature allows for capturing subtle differences between sequences by modulating sequence node responsibilities, as opposed to self-organizing maps (SOM), where subtle variations of an item would likely not alter its unique node of residence [27,34].

| Genetic algorithm
GA is a stochastic evolutionary algorithm allowing one to choose the best set of GTM hyperparameters [36].Herein, the GA was used for Pareto-front driven optimization of balanced accuracy (BA) values for a set of drug-specific models.The quality of the GTM (i.e., predictive model) with the given hyperparameters is estimated by its predictive propensity.Herein, the generated landscapes were evaluated by the averaged BA (average proportion of sequences predicted correctly for each class: susceptible and resistant) for each drug in a 5-fold cross-validation (CV) procedure repeated 5 times.

| Assessment of prediction accuracy of models
To assess the predictive accuracy of the models, the fivefold cross-validated averaged balanced accuracy was estimated upon five repeats: TP f -the number of truly resistant sequences predicted as resistant in the fold f, TN f -the number of truly susceptible sequences predicted as susceptible in the fold f, N -the total number of sequences actually belonging to the susceptible class, P -the total number of sequences actually belonging to the resistant class.
In more detail, herein, the drug-specific resistance landscapes were created using one common manifold.Each of the landscapes represents a classical single-task model aimed to predict the property (resistance) of genomes.Since each of the drug-specific landscapes possesses the same manifold hyperparameters, the quality of the landscape-based models relies on the inherent characteristics of the common manifold they are built on [37].The maps were considered as optimal if the GTM hyperparameters allowed for simultaneous maximization of BA values of drug-specific landscape-based models.
In summary, the hyperparameters for the manifold describing S. aureus sequence space were selected during cross-validation with respect to the mean performance in predicting the resistance of S. aureus isolates toward nine antibiotics, as indicated by BA values.Two more drugs (oxacillin and sulfamethoxazole/trimethoprim combination antibiotic (SMZ-TMP)) were chosen to validate the ability of the manifold to host predictive resistance landscapes for drugs not used during manifold selection and ensure that the model's validity expands beyond training cases.

| Comparison to other machine learning methods
GTM-driven antibiotic resistance predictive ability was compared with other state-of-the-art ML algorithms such as RF [38], SVM [39] and Gradient boosting (GB) [40] implemented in the scikit-learn library (v.1.1.1)[41].The hyperparameters tuned during optimization by grid search are reported in Text S5, Supporting Information.Evaluation of the model performance was made using five-fold cross-validation repeated five times (Equation ( 1)).

| Benchmarking of GTM and state-of-the-art ML methods
The performance of the GTM was benchmarked against state-of-the-art ML methods using consistent pipeline suggested in this work.GTM-based multi-task models (models optimized simultaneously for 9 antibiotics) are expectedly slightly weaker than state-of-the-art ML methods such as RF, GB, and SVM in single-task mode (comparison reported in terms of five-fold cross-validated BA evaluations averaged over five repeated estimations with reshuffled sets -Figure 1 and S9-S12, Supporting Information).Note that the two antibiotics not contributing to the manifold selection process have no special status in the single-task ML models (in which they did serve for training) and yet they do not display any significant disadvantage in multi-task GTM-driven prediction.
As opposed to other multi-task quantitative genotypephenotype relationship predictive models such as the one by Ren et al. [42], GTM does not require retraining of the manifold to accommodate new data.In the case of conventional multi-task learning, inclusion of new end points would require rebuilding of the models.In contrast, the GTM embeds a representative set of samples into a single manifold, which can support mapping of properties never considered during its selection process.Herein, this capability of GTM was demonstrated by using the manifold optimized for 9 drug molecules in order to build predictive landscapes for 2 more drugs unseen during the training, namely oxacillin and SMZ-TMP.
In more detail, reasonable BA values (BA > 0.75) along with corresponding clear delineation of zones populated by resistant and susceptible sequences in the map, validates the use of the GA-chosen parameters for creation of GTM-based models (landscapes) to predict resistance to the drugs unseen by GA during the parameter optimization phase.The landscapes generated for the two of the antibiotics (oxacillin, sulfamethoxazole/ trimethoprim) that were not used in the optimization of the GTM-based models are shown alongside other landscapes in Figure S5, Supporting Information.
The lower predictive performance of the GTM as opposed to other ML methods benchmarked herein (Figure 1) partly stems from dimensionality reduction and the subsequent loss of information.Also note that GTM-landscape-based predictors are actually zero-parameter, basic neighbourhood models in which "similar" sequences are predicted to be similarly susceptible or resistant.The non-linear manifold, bent to encompass the relevant high-dimensional sequence space encoded by kmer PCs is defining the functional form, and implicitly the weight each feature is contributing to this inter-sequence similarity.On the other hand, GTM provides a possibility to interpret maps in terms of phenotypes associated with genomes.In this context, the GTM-based models possess a distinct advantage: they enable a direct comparison of clusters of sequences in terms of resistance.In addition, the resistance landscapes can be combined with AMR-gene-specific landscapes to associate genetic biomarkers with resistance inherent to some sequence clusters on the map.Yet, one should note that such association is non-bijective, since more than one gene can contribute to the resistance development.To elaborate further, the maps will highlight any gene-resistance associations, whether mechanistic or coincidental.Overall, gene-specific landscapes provide additional and intuitively interpretable means for analysis of relationships between antimicrobial resistance and AMR genes.

| Cartography of S. aureus sequence space and antibiotic resistance profiling
The herein created GTM-based resistance landscapes encompass 5773 mutant sequences of S. aureus isolates and are coloured according to the mean resistance status of F I G U R E 1 Mean values of BA obtained in 5-fold cross-validation repeated five times for multi-task GTM (yellow) and single-task Random Forest (blue), Support Vector Machine (green), Gradient Boosting (purple) for 11 antibiotics.The descriptors used for building the models were kPCA pre-processed 15-mers' counts with 'max' kernel function.SMZ-TMP stands for sulfamethoxazole/trimethoprim. the isolates residing in each node of the map to the antibiotic of interest (Figure 2).The position of each isolate (a unique genomic sequence) is not determined by two dimensions (e. g., x-and y-axes).Instead, it is defined by a probability distribution across all the nodes (i.e., squares on the map).Namely, the position of each sequence is defined by a so-called responsibility (probability) vector, whose individual values reflect the probabilities of a sequence to reside in each node of the map.
The maps allow to easily delineate the zones containing exclusively genomes of resistant isolates (red zones) or susceptible ones (blue zones) from the zones jointly populated by both categories (colours between red and blue).White zones on the resistance landscapes correspond to the empty regions, where no sequences with drug-specific resistance profiles were found.Hence, the landscapes for drugs with the smallest number of labelled sequences contain more white zones.Saturated colours correspond to highly populated regions, while high transparency defines sparsely populated zones.Global density (technically referred to as "cumulated responsibility") is proportional to the number of mapped sequences while the map areas covered is an expression of the relative diversity of mapped items.Both global density and diversity increase for "classical" antibioticsmethicillin, penicillin for which a lot of (and, implicitly, diverse) data is available.
The highest prediction performance of GTM models was attained with Jaccard kPCA pre-processing for 10-and 30-mer counts and unitigs, and kPCA with 'max' kernel function for 15-mers.kPCA was hence the preferred pre-processing strategy.The GTMs built using pre-processed k-mers with different lengths (10-, 15-, 30mers) and unitigs have comparable predictive accuracy (Table S2-S5, Supporting Information).We have chosen the ones based on 15-mers for further analysis.The landscapes based on pre-processed k-mers with different lengths (10-, 15-, 30-mers) and unitigs are available in Figure S4-S7, Supporting Information.
Equivalently to other methods aimed to analyze the genomic space and diversity of S. aureus isolates [22,43], the GTM highlights the existence of several clusters of S. aureus genomes.This is visible on all maps (Figure S5, Supporting Information).Since all the landscapes rely on the same manifold, drug-specific landscapes can be thus directly compared.For example, the resistance of S. aureus isolates to β-lactam antibiotics such as penicillin and semisynthetic β-lactamase resistant penicillins, i. e., methicillin are notably different.Indeed, one can expect these differences since methicillin was developed to circumvent penicillin resistance [44].In both landscapes for these drugs, resistants (red) and susceptibles (blue) are well separated.However, the penicillin landscape (Figure S8, Supporting Information) displays some more red areas (resistants) which in the neighbouring methicillin map appear as blue-these are the penicillin-resistant genomes of S. aureus that were successfully targeted by methicillin.However, several nodes in the right F I G U R E 2 Resistance landscapes for five antibiotics against S. aureus built on kPCA pre-processed ('max' kernel function) 15-mers' counts of 5773 genomic sequences.Each node is coloured by the weighted average of the antibiotic resistance profiles of the residing S. aureus isolates.Red zones are occupied by the genomes of the resistant isolates, while blue zones contain the genomes of the susceptible ones.All colours in between correspond to mixed zones containing both of them.The transparency reflects the genomes' population density.
middle area and right bottom corner (Figure S8, Supporting Information) of the aforementioned landscapes accommodate methicillin-resistant S. aureus (MRSA) isolates that maintain susceptibility to penicillin, which is an example of penicillin-susceptible methicillin-resistant S. aureus.The incidence of such kind of resistance profile was already registered by ref. [45].
The MRSA isolates exert resistance not only to methicillin but also to all of the other β-lactam class antibiotics, i. e., cephalosporins and carbapenems [46] although they can be susceptible to the newest classes of cephalosporins [47].The β-lactam antibiotics such as penicillins e. g., penicillin, methicillin and cefoxitin, macrolides such as erythromycin, and fluoroquinolone antibiotics e. g., ciprofloxacin possess landscapes with resistant clusters that are widely distributed over the entire area of the map compared to the landscapes of other antibiotics considered here (Figure 2 and S5, Supporting Information).This testifies about a wider variety of genetic determinants that can influence the development of resistance to the antibiotics of these three classes given the current data.
Macrolides such as erythromycin are usually and solely prescribed in case of trivial MSSA infections [47,48].This tendency is visible upon comparison of erythromycin and methicillin resistance landscapes (Figure 2).Namely, almost all isolates susceptible to methicillin also appear to be susceptible to erythromycin, which is in accordance with literature data.All this insight can be easily gained by "overlapping" the landscapes and focusing on residents in relevant areas.

| Analysis of resistance determinants using genome features landscapes
Besides the comparative analysis of the resistance profiles, the GTM can be used to analyse the difference in feature distribution between zones (clusters) of resistant and susceptible sequences, thus offering further insights into the importance of specific features.For this purpose, data on genomic features, which are defined segments of genome usually related to protein and RNAcoding genes, present in the PATRIC database were used [30].The genomes from PATRIC are annotated by these features using either the original information provided in GenBank or annotations assigned using a RAST pipeline [30,49].In this context, the maps can be used either to visualize the distribution of sequences containing the specific genetic feature and compare it with the resistance landscapes or select a specific part of the map and determine which genetic features are abundant in this particular zone on the map.
To show the utility of described analysis we identified the zone on drug-specific landscapes accommodating sequences with different resistance profiles to penicillin and methicillin (Figure 3A).This zone of four GTM F I G U R E 3 A) Resistance landscapes for penicillin and methicillin.B) Gene features' landscapes (PBP2a, β-lactamase, Helicase CTD) for 5773 S. aureus genomes from the frame set.Colour scheme for the resistance landscape is the same as in the Figure 2. Colour scheme for gene features landscapes reflects the percentage of genomes with specified gene features.Red zones are predominantly occupied by the genomes with specified gene features, while the blue zones contain genomes without.All colours in between correspond to mixed zones with both types of genomes in various proportions.
nodes contained 40 genomes of isolates conferring resistance to penicillin but being susceptible to methicillin and other drugs.It appears in blue on the penicillin binding protein 2a (PBP2a) landscape, meaning that isolates populating this zone do not contain PBP2a and therefore it is not the reason of penicillin resistance development in them.From the analysis of other gene feature landscapes presented here, the penicillin resistance in some of the isolates from the selected zone can be related to, e. g., the expressed β-lactamase.
The PBP2a is however present in other red zones of the map populated by genomes prevailingly resistant to penicillin and methicillin.Hence, the expression of PBP2a alone or in combination with other genetic determinants such as β-lactamase may be important for protecting S. aureus isolates from β-lactam antibiotics action such as penicillin and methicillin [50].The PBP2a encoded by mecA gene is known to possess low affinity to β-lactam antibiotics, hence not allowing them to inhibit the bacterial cell wall biosynthesis [51].It is worth noting that the genomes from particular zones contain numerous other features, such as DEAD-box RNA helicase (Figure 3B) or epidermin biosynthesis protein (epiC) (Figure S13B, Supporting Information) that are not directly related to resistance, but might be related to bacteria adaptation in different environments [52] and selective advantages when competing with other bacteria [53] respectively.While one cannot directly attribute the specific resistance profile based on such observations, complex genotype-phenotype relationships involving the analysis of several resistance landscapes and gene features landscapes on the same map is a perspective tool for deciphering resistance profiles in the context of evolution.
In addition to the observations made for the combination of the resistance landscapes and gene feature landscapes, one can use the latter to investigate the structure of the bacterial population.HGT frequently occurring in bacteria leads to complex population structure: descendants of the same clone may acquire a diverse set of auxiliary genes.This complicates the analysis of the genomic space with the most common tool-phylogenetic trees.In this regard, network-based approaches and dimensionality reduction techniques represent a perspective alternative.Indeed, while some genomes comprising specific genome features resided in tight clusters (e. g., epiC) Figure S13B, Supporting Information, others appear at various zones on the map (e. g., PBP2a, β-lactamase, helicase CTD protein, CAAX Proteases and Bacteriocin-Processing (CPBP) metalloprotease) (Figure S13B, Supporting Information).In this way, the genome features landscapes can help to reveal complex evolutionary paths present in the dataset.
In summary, GTM allows for illustrative exploration of the genomic space rendering the sequences with genetic determinants responsible for drug resistance development in clusters that are in accordance with the predicted resistance profiles.In the context of this study, the conventional GTM model has been employed for AMR prediction.Nevertheless, it is worth noting that various GTM-specific techniques, including meta-GTM [54] and s-GTM [55], may hold promise for further in-depth investigation of genotype-phenotype relationships and validating model accuracy with respect to population structure [11].The thorough exploration of these techniques, their integration with feature selection methodologies [8], their applicability to diverse bacterial species as well as systematic GTM performance comparison with other models available to date constitute topics for the future research.

| CONCLUSION
This paper introduces a methodology for genomic space exploration and genotype-phenotype modelling in the context of AMR prediction.This methodology allows one to represent complex genomic data in the form of interpretable maps serving both as a visualization tool and a predictive ML model.The clusters (zones) on the map can be compared in terms of resistance/susceptibility and the presence of features (genes) that may be important for resistance development.While ML models can also elucidate features crucial for drug resistance development as well as multi-drug resistance patterns, we believe that the use of GTM for such tasks is more straightforward and facilitates pattern recognition.The fact that resistance can be reasonably well predicted on hand of similarity (neighbourhood)-based algorithms, not required to "learn" any specific feature weighing scheme is an interesting result per se, highlighting the potential biases in the so-far recorded data.Herein, the introduced methodology was tested for the analysis of S. aureus genome space, but it is readily amenable to other bacteria.Furthermore, the suggested approach is big data-compatible and is able, in principle, to accumulate much larger amount of data.At the same time, it is also capable to operate well with limited quantity of data for building models.In this context, GTM can be used as a tool for visualization and analysis of large bacterial genomic data being complementary to standard tools such as phylogenetic trees and network-based approaches.