HMSS2: An advanced tool for the analysis of sulphur metabolism, including organosulphur compound transformation, in genome and metagenome assemblies

The global sulphur cycle has implications for human health, climate change, biogeochemistry and bioremediation. The organosulphur compounds that participate in this cycle not only represent a vast reservoir of sulphur but are also used by prokaryotes as sources of energy and/or carbon. Closely linked to the inorganic sulphur cycle, it involves the interaction of prokaryotes, eukaryotes and chemical processes. However, ecological and evolutionary studies of the conversion of organic sulphur compounds are hampered by the poor conservation of the relevant pathways and their variation even within strains of the same species. In addition, several proteins involved in the conversion of sulphonated compounds are related to proteins involved in sulphur dissimilation or turnover of other compounds. Therefore, the enzymes involved in the metabolism of organic sulphur compounds are usually not correctly annotated in public databases. To address this challenge, we have developed HMSS2, a profiled Hidden Markov Model‐based tool for rapid annotation and synteny analysis of organic and inorganic sulphur cycle proteins in prokaryotic genomes. Compared to its previous version (HMS‐S‐S), HMSS2 includes several new features. HMM‐based annotation is now supported by nonhomology criteria and covers the metabolic pathways of important organosulphur compounds, including dimethylsulphoniopropionate, taurine, isethionate, and sulphoquinovose. In addition, the calculation speed has been increased by a factor of four and the available output formats have been extended to include iTol compatible data sets, and customized sequence FASTA files.

cycle is of great importance because the decomposition of organic sulphur compounds affects human health, bacterial virulence in infection (Dhouib et al., 2021), global warming, bioremediation processes such as wastewater treatment (Schäfer et al., 2010), and is linked to the biogeochemical cycling of sulphur between habitats (Koch & Dahl, 2018).Sulphonated compounds can range from small size with only a C 1 carbon skeleton up to sulphonated lipids with long-chain alkanes, amino acids such as cysteine, or sulphur-containing cofactors with complex structures such as lipoate (Boden & Hutt, 2019;Goddard-Borger & Williams, 2017;Moran & Durham, 2019).Although new sulphonated compounds are constantly being discovered, the metabolic function, synthesis or degradation pathways are often not yet clear (Thume et al., 2018).Only the most abundant sulphonated compounds, such as sulphoquinovose, dimethylsulphoniopropionate (DMSP), taurine, isethionate, cysteine and methionine, have been studied biochemically in terms of synthesis and degradation pathways.
In aquatic environments, the antistress molecule DMSP is the most well-known organosulphur compound (Kiene et al., 2000).
Mainly produced by macroalgae and phytoplankton, it is emitted by around 600 million tonnes per year.Bacterial DMSP degradation in the oceans, salt marshes and coastal regions is the major source of dimethylsulphide (DMS), which is released at a rate of about 300 million tonnes per year (Moran & Durham, 2019).As a volatile compound, DMS affects atmospheric chemistry and global warming by forming cloud condensation nuclei that increase the reflection of solar radiation (Schäfer et al., 2010).In the context of the global sulphur cycle, DMS acts as a link between the terrestrial, atmospheric and aquatic environments (Lovelock et al., 1972).DMS-derived carbon and sulphur are used as electron acceptors or donors during dissimilation, or are assimilated via the intermediates dimethlysulphone and methanesulphinate (Figure 1).Sulphonated lipids are estimated to be the largest reservoir of sulphur in terrestrial ecosystems (Goddard-Borger & Williams, 2017).
Sulphoquinovose is a sulphonated glucose derivate and the most common part of the head group of sulpholipids which are integral part of thylakoid membranes of chloroplasts and photosynthetic systems.Mainly produced by plants, algae and cyanobacteria, its turnover rate has been estimated at around 10 billion tonnes per year (Goddard-Borger & Williams, 2017).The bacterial decomposition of sulphoquinovose involves several different pathways similar to the degradation of glucose (Figure 2a), with the exception that smaller sulphonated compounds are often released, since complete utilization with release of free sulphur by a single organism is often not possible (Wei et al., 2022).Release and scavenging of sulphonated intermediates is achieved by various transport systems (Figure 2b).
Sulphoquinovose decomposition and release of inorganic sulphur is then completed by pathways linked to taurine, isethionate and/ or sulphoacetate (Figure 2c).In summary, prokaryotic utilization of these organic compounds as sources of sulphur, carbon and energy is far from being a uniform process and new metabolic pathways for the degradation of sulphonated compound are constantly being discovered (Boden et al., 2010;Koch & Dahl, 2018;Sharma et al., 2022;Wolf et al., 2022).
These processes are also closely linked to the availability of inorganic sulphur as the released sulphur is either assimilated or excreted as sulphate (Ruff et al., 2003), sulphite (Koch & Dahl, 2018;Li et al., 2023;Sharma et al., 2022), thiosulphate (De Zwart et al., 1997), tetrathionate (Boden et al., 2010) or sulphide (Peck et al., 2019).Indeed, the complete consumption of the volatile sulphonated C 1compound DMS coupled with the oxidation of the thiosulphate formed as an intermediate, has been reported for a single organism, providing a new link between the organic and inorganic sulphur cycles (Koch & Dahl, 2018).However, the fate of the sulphur released from sulphonated compounds is often not known or assumed to be the same as in dissimilatory sulphur oxidation or reduction.The physiology and interactions of bacterial communities that release sulphur from sulphonated carbon compounds have been sparsely explored and the few existing studies are based on, or assume, sulphur cycling via dissimilatory sulphite reductases (Burrichter et al., 2021;Hanson et al., 2021;Wolf et al., 2022).
Ecological studies of organic sulphur compounds are difficult because their metabolism is poorly conserved across bacterial phylogeny and can even vary between strains of the same species.Thus, even within a species, predictions based on taxonomic assignment are not possible (Schäfer et al., 2010).As the functional annotation pipelines of public databases mainly focus on the synthesis of methionine and cysteine, the enzymes involved in the metabolism of organic sulphur compounds are usually not correctly annotated.
Inaccurate annotation in public databases is exacerbated by the fact that several proteins involved in the conversion of sulphonated compounds are related to proteins involved in sulphur dissimilation or the turnover of other compounds, for example, the DMSO reductase family (Leimkühler & Iobbi-Nivol, 2016) or quinone oxidoreductase complexes (Duarte et al., 2021).For these reasons, the abundance of microbes utilizing organic sulphur compounds is likely to be underestimated (Carrion et al., 2019) and the role of sulphonated compounds is understudied (Wolf et al., 2022).Thus, there is a knowledge gap of the link between inorganic and organic sulphur cycling in ecological systems.
To fill this gap, we have extended HMS-S-S (Tanabe & Dahl, 2022).This tool was originally developed for rapid detection and annotation of inorganic sulphur dissimilation in prokaryotic genomes.With the substantial extension presented here, it now includes not only inorganic sulphur metabolism enzymes but also enzymes with characterized or at least strongly indicated function in the metabolism of sulphonated sulphur compounds.These include sulphoquinovose synthesis and degradation pathways, DMSP metabolism, taurine and isethionate conversion, and transport systems for various sulphonated compounds.For all these pathways, we developed individual profiled hidden Markov Models (HMM) and validated score thresholds by cross-validation and with an independent test data set.HMS-S-S itself has been completely redesigned, improving usability and output formats, and extending the file manipulation tool.By optimizing the underlying algorithms, the overall computing speed has been increased by a factor of four.Due to the complete overhaul, we have renamed the tool "HMSS2".HMSS2 now covers the known metabolism of inorganic and organic sulphur compounds, facilitating the exploration of the microbe-driven natural sulphur cycle.

| HMSS2 improvements and workflow
Algorithmic improvements were made on the speed and userfriendliness by process optimization and the implementation of additional features.HMSS2 algorithms are now completely written in Python and precompiled versions are available.In this way, the number of dependencies required to be installed by the user has been greatly reduced to just two external programs.HMMER and Prodigal are still required but installing and configuring of MySQL is no longer necessary.The installation was further simplified by preparation of a precompiled executable that will run directly on a Unix system.HMSS2 includes the basic design of HMS-S-S with further automation.User-supplied input requires a directory containing files in FASTA nucleotide format, consisting of scaffolds or contigs.Alternatively, it is possible to provide amino acid sequences in FASTA files and the corresponding features in GFF3 formatted files.All files in the directory will then be processed in consecutive  that allows to save multidomain proteins with all domains.In the next step, the detected proteins are searched for genetic colocalization.This is done via the genomic features and a maximum nucleotide distance between two genes to be syntenic.Syntenic gene clusters are then compared with a set of predefined and named gene patterns.A new feature of HMSS2 is the detection of co-linear gene clusters.This is a special type of synteny where the genes occur in exactly the same order as the gene pattern.
Gene clusters that are similar to the pattern(s) provided are then named by characteristic keywords.NCBI, GTDB taxonomy files or custom files with a similar format can be used to assign taxonomic information.As the taxonomy may change over time, it is recommended that the user updates this information locally as required.
Results can be retrieved from the local database filtered by protein domains and/or keywords via HMSS2.The standard output now includes FASTA formatted files and iTol data sets.

| Training data set generation, annotation and HMM development
Data sets were generated from genomic data downloaded from NCBI RefSeq (Haft et al., 2018) or GenBank (Sayers et al., 2019) as of September 2022.The HMM training data set contained all assemblies from the NCBI RefSeq database with an assembly level of a complete chromosome.The independent test data consisted of assemblies originating from GenBank, again with an assembly level of the complete chromosome.GenBank covers a greater number of phyla and a wider range of quality and is, therefore, not entirely similar to the training data from RefSeq.Sequence annotation for Hidden-Markov-model generation was performed using the training data set and list of reference proteins for organic sulphur metabolism (Table S1).Methods for annotating the training and independent test data sets and for HMM generation were used as described previously (Tanabe & Dahl, 2022).

| Performance metric calculation
Performance was determined using balanced accuracy (Brodersen et al., 2010), F1-score (Forman & Scholz, 2010), and the Matthewcorrelation-coefficient (MCC; Chicco & Jurman, 2020).The metric values were additionally corrected for the data set's skewness (Jeni et al., 2013;Table S2).Values for each Hidden Markov Model were calculated from a confusion matrix obtained by comparing the annotation of the training/test data set and annotation assigned by the HMMs.Matching assignments were considered as true positives (TP), while mismatching assignments were considered as false positives (FP), if the HMM recognized a sequence unrelated to the HMM training sequences.All sequences that were not recognized by the HMM but matched the annotation were counted as false negative (FN), and all other sequences were recorded as true negatives.

| Thresholding and cross-validation
Thresholding and cross-validation were executed as previously described (Tanabe & Dahl, 2022).For each HMM, bit scores for noise cutoff, trusted cutoff, and an optimized threshold were determined prior to cross-validation.The noise cutoff corresponded to the score of the lowest scoring TP hit.The trusted cutoff corresponded to the score of the highest scoring FP hit.The optimized cutoff was computed during a nested cross-validation procedure with a 10-fold outer loop and a five-fold inner loop (Varma & Simon, 2006).The optimized cutoff corresponded to the median of the thresholds with the highest F1 scores across all inner folds.Outer folds were analysed after all thresholds were set.
Each cross-validation fold was generated from the HMM training data.Sequences were randomly sorted into the 10 outer folds of equal size, followed by the equal deviation of each outer fold into five inner folds.A cross-validation procedure was then performed on all folds.The inner folds were used to determine the optimized thresholds.The overall performance of each HMM was then done with a confusion matrix created for the outer folds using the optimized thresholds as a cutoff.Balanced accuracy was calculated as the average of all accuracies from each fold.F1 score and MCC were calculated as the sum of the confusion matrices from all folds (Forman & Scholz, 2010).The same procedure without fold generation was performed for the independent test data set (Chicco, 2017).

| Performance testing
The performance of HMSS2 was compared with that of HMS-S-S version 1 (Tanabe & Dahl, 2022).The HMM library included all 164 HMMs of the original library, detecting dissimilatory sulphur metabolism.A quadratic increasing number of randomly selected genomes ranging from 2 to 64 were chosen from the training data set described for version 1 and used as input for the performance comparison.The input data were in FASTA nucleotide format.Each run was repeated three times with newly randomized input data to reduce performance bias caused by the input data.Both program versions were benchmarked for the execution time required for the workflow from data entry to the final annotated hits with appropriately named gene clusters, but without taxonomy assignment.Time was measured as the required wall-clock runtime when running HMS-S-S or HMSS2 with four parallel threads on an Intel Core i7-6700 CPU.

| RE SULTS
Here, we created a comprehensive database of reliable HMMs based on archaeal and bacterial proteins associated with organic sulphur metabolism.The same approach has already been used for the enzymes of dissimilatory metabolism of inorganic sulphur compounds (Tanabe & Dahl, 2022).Not only sequence similarity but also integrated synteny was considered to assign a protein to a specific functional group.The HMMs created here focus on the most abundant organic sulphur compounds in terrestrial and aquatic environments.The compounds covered here include DMSP, dimethyl sulphide (DMS), dimethyl sulphoxide (DMSO), dimethyl sulphone (DMSO 2 ; Figure 1), 2,3-dihydroxypropane-1-sul phonate (DHPS), isethionate, taurine and membrane sulpholipids (Figure 2).The HMMs for the enzymes of the metabolic pathways for degradation of individual compounds are described in full below.Normally, prokaryotes do not code for the entire degradation pathways, but only for parts of them (Boden & Hutt, 2019;Liu et al., 2021).

| HMM development: DMSP degradation
DMSP is primarily produced by single-celled phytoplankton and algal seaweeds, where it acts as an osmolyte and anti-stress molecule (Kiene et al., 2000).Degradation of DMSP either requires a demethylation pathway or a DMSP lyase (Figure 1).The demethylation pathway is encoded by the dmdABCD gene cluster and starts with the demethylation of DMSP via DmdA to form methylmercaptopropionate. This intermediate is further catabolized by DmdB, DmdC and finally DmdD with the release of acetaldehyde and methanethiol (Bullock et al., 2014;Reisch et al., 2011).For each of the enzymes, one HMM was generated, making four in total.Several nonorthologous DMSP lyases, DddL, DddP, DddQ, DddW and DddY, have been characterized which convert DMSP to acrylate with the release of DMS and acrylate.The latter is then converted to 3-hydroxypropionate by AcuNK (Curson et al., 2011) or to propionyl-CoA by AcuI (Todd et al., 2012).DMSP lyase DddD catalyses formation of propionyl-CoA and DMS from DMSP in a single reaction without the formation of an acrylate intermediate.3-hydroxypropionate can be further converted to acetyl-CoA via DddA and DddC (Curson et al., 2011).
HMMs were generated for AcuI, AcuN, AcuK, DddA and all DMSP lyases.As there were less than ten sequences identified for DddQ, DddW and DddC, HMMs could not be constructed for these three enzymes.

| HMM development: assimilation of methanethiol and DMS
DMS and methanethiol are C 1 -organosulphur compounds derived mainly from the degradation of DMSP.Both can be assimilated by bacteria as a source of sulphur and carbon, where methanethiol is first converted to DMS, followed by oxidation and assimilation (Figure 1).The conversion of methanethiol to DMS is catalysed by methanethiol S-methyltransferase, MddA.This membrane-bound enzyme transfers a single sulphur atom from S-adenosylmethionine to methanethiol (Carrion et al., 2015).The resulting DMS can be further oxidized by either DMS cytochrome c reductase, DdhABCD, also known as DMS dehydrogenase (McDevitt, Hanson, et al., 2002), or by multicomponent DMS monooxygenase DsoABCDEF (Horinouchi et al., 1999).The periplasmic DdhABC DMS dehydrogenase couples the oxidation of DMS to the reduction of two c-type cytochromes, producing DMSO as the final product.DdhD is a cytoplasmic protein that is not part of the DMS dehydrogenase but has a proposed function in the assembly of the DdhAB complex and its secretion via the Tat pathway (McDevitt, Hugenholtz, et al., 2002).For DdhA and DdhB, it was possible to generate individual HMMs, while this was not the case for DdhC and DdhD which had less than ten validly annotated sequences in the training data set.The multicomponent DMS monooxygenase DsoABCDEF oxidizes DMS in a two-step reaction to DMSO 2 with DMSO as intermediate.As the sulphur moiety is specifically oxidized, this enzyme is also referred to in the literature as assimilatory DMS S-monooxygenase (Boden & Hutt, 2019).A total of six HMMs were generated for this complex.After the oxidation of DMS to DMSO 2 , the next step in sulphur assimilation is the oxygendependent conversion of DMSO 2 to methanesulphinate, catalysed by FMN-dependent DMSO 2 monooxygenase SnfG (Wicht, 2016).
SnfG was represented by a single HMM.Methanesulphinate is chemically oxidized to methanesulphonate, which is further oxidized to sulphite and formaldehyde by the assimilatory methanesulphonate monooxygenase MsuDE in a NADH-and oxygen-dependent reaction.For MsuDE, a HMM was trained for each subunit.

| HMM development: dissimilation of DMSO 2
Dimethylsulphone is mainly derived from oxidation of DMS.The degradation of dimethyl sulphone (DMSO 2 ) begins with its reduction to DMSO by a DMSO 2 reductase in an NADH-dependent reaction (Figure 1).Although the activity has been measured in crude extracts of some methylotrophic Actinobacteria and Alphaprotebacteria (Borodina et al., 2000(Borodina et al., , 2002)), the enzyme has not been characterized.DMSO is then further reduced to DMS.Two types of DMSO reductases have so far been characterized (Boden & Hutt, 2019).
The first, membrane-bound enzyme is composed of the three subunits, DmsABC, and uses electrons from the quinol pool for DMSO reduction (Bilous & Weiner, 1985).For this enzyme one HMM for each subunit was trained.The second DMSO reductase uses NADH for this purpose and probably consists of only one subunit with high similarity to DmsA, indicated by its cross-reaction with DmsA antibodies.A separate HMM could not be trained for this enzyme, because it is only known by its activity in crude extracts (Borodina et al., 2002).In addition to the Dms-type DMSO reductases, a soluble periplasmic DMSO reductase, DorCAD, has been characterized (McEwan et al., 1998).The corresponding genes are regulated by DorS and DorR (Kappler & Schäfer, 2014).For each of these five proteins/subunits, we constructed one HMM.The DMS, which is released by DMSO reductase of both types, is oxidized to methanethiol (CH 3 SH) and formaldehyde by a DMS monooxygenase, DmoAB, in another NADH-consuming reaction (Boden et al., 2011).
As only dmoA has been validly identified so far, we trained a HMM specifically for DmoA, but not for DmoB.Further oxidation of methanethiol by a methanethiol oxidase MtoX leads to the final release of sulphide and another molecule of formaldehyde (Eyice et al., 2017).
A single HMM was trained for MtoX.

| HMM development: dissimilation of methanesulphonate
Methanesulphonate is formed by spontaneous chemical oxidation of DMS in the atmosphere (Figure 1).It is used by diverse aerobic bacteria as a sulphur source and by some specialized methylotrophic prokaryotes as a source of carbon and energy (Kelly & Murrell, 1999).
The dissimilatory methanesulphonate monooxygenase catalyses the conversion of methanesulphonate to formaldehyde and sulphite (Henriques & De Marco, 2015).This enzyme is encoded by the ms-mABCD operon, which is often located adjacent to the msmEFGH operon, usually in the opposite direction.The latter encodes a putative ABC-type transporter (Figure 2b) proposed to facilitate the import of methanesulphonate into to the cytoplasm (Henriques & De Marco, 2015).Six HMMs were developed to represent each of these proteins.MsmC and MsmD had to be excluded due to the small number of sequences in the training data sets.

| HMM development: alkanesulphonate oxidation and transporters
The ssuEADCB gene cluster encodes the two-component alkanesulphonate monooxygenase SsuDE and the alkanesulphonate ABCtransporter SsuABC (Figure 2b).Alkanesulphonate monooxygenase catalyses the oxidation of various sulphonated alkanes as substrates with variable affinity, including phenylated organic compounds like N-phenyltaurine.After transport into the cell via SsuABC, the sulphonate is cleaved by SsuDE in a reaction dependent on NADH and molecular oxygen (Eichhorn et al., 1999).Electrons are provided by SsuE via an FMN cofactor.SsuD then cleaves the sulphonate group and oxidizes the terminal carbon atom.For this pathway five HMMs, one for each encoded protein, were created.

| HMM development: sulphoquinovose synthesis
Sulphoquinovose (SQ) is a sulphonated derivate of glucose where the 6-hydroxyl group is substituted by a sulphonate group.SQ is a constituent of the unique head group of the membrane-bound glycolipid sulphoquinovosyl diacylglycerol (SQDG) present in thylakoid membranes and photosynthetic prokaryotes.On a genetic level, five genes sqdA, sqdB, sqdC, sqdD and sqdX have been described to be involved in SQDG synthesis in bacteria so far (Benning & Somerville, 1992a, 1992b;Guler et al., 2000;Rossak et al., 1995).
The functions of SqdA and SqdC have not been completely resolved (Benning & Somerville, 1992b;Rossak et al., 1997).The synthesis begins with the exchange of the 6-hydroxyl group of uridine-diphosphate (UDP)-glucose for a sulphonate group by UDPsulphoquinovose synthase, SqdB.The formation of SQDG is then catalysed SQDG synthase, SqdD or SqdX (Rossak et al., 1995).A total of five HMMs was trained to detect the enzymes of this pathway.

| HMM development: sulphoquinovose degradation and transport
As sulphoquinovose is a sulphonated derivate of glucose, it is catabolized in a similar manner and can serve as a carbon and energy source (Hanson et al., 2021).Several pathways resembling glucose degradation have been characterized, including the sulpho-Embden-Meyerhof-Parnas pathway (Denger et al., 2014), the sulpho-Entner-Doudoroff pathway (Felux et al., 2015), the transaldolase-based pathway related to the pentose phosphate pathway (Frommeyer et al., 2020) and a complete degradation pathway based on a sulphoquinovose monooxygenase (Sharma et al., 2022; Figure 2a).
The sulpho-Embden-Meyerhof-Parnas pathway (Figure 2a) begins with import of sulphoquinovose by the transporter YihO.A sulpholipid α-glucosidase YihQ may also be involved and other SQ derivatives may also be imported.Analogous to the EMP pathway, SQ is then cleaved to dihydroxyacteonephosphate (DHAP) and 3-sulpholactaldehyde (SLA) via the isomerase YihS, kinase YihV and aldolase YihT.In an NADH-dependent reaction, the reductase YihU then reduces SLA to the final product 2,3-dihydroxypropane sulphonate (DHPS), which is transported out of the cell again via YihP.A separate HMM was created for each of the Yih proteins.
The Sulpho-Entner-Doudoroff is analogous to the ED pathway (Figure 2a).As there was no specific abbreviated name assigned to these enzymes by the original publication (Felux et al., 2015), we assigned names to enhance HMSS2 output readability.SQ is cleaved by a dehydrogenase SedA, a lactonase SedB, a dehydratase SedC, and an aldolase SedD to pyruvate and SLA.Another dehydrogenase, SedE, then oxidizes SLA in an NAD-dependent reaction to 3-sulpholactate (SL), which is then exported.A separate HMM was generated for each of the proteins mentioned, for a total of five HMMs.
The third SQ degradation pathway contains a transaldolase as the key enzyme (Figure 2a; Frommeyer et al., 2020).SQ is imported into this pathway via the transporter SftA and converted to sulphofructose by the isomerase SftI.This product, together with glycerinealdehyde-3-phosphate, is then converted by the transaldolase SftT to SLA and fructose-6-phosphate. SLA, in turn, is converted to SL in an NAD-dependent reaction by the dehydrogenase SftD and exported via the transporter SftE or reduced to DHPS in an NADHdependent reaction by the reductase SftR.A separate HMM was generated for each of the Sft proteins, for a total of six HMMs.
The fourth known degradation pathway for SQ (Figure 2a) differs from the others described so far, because it involves oxidation of the entire molecule, including cleavage of sulphur (Sharma et al., 2022).
The pathway described begins with the import of sulphoquinovosyl glycerol by an ABC transporter called SmoEFGH.In the cytoplasm,

| HMM development: isethionate and taurine degradation
Isethionate and taurine are C 2 -sulphonates which are produced by eukaryotes from cysteine or methionine (Moran & Durham, 2019).
Bacterial degradation of these compounds includes sulphoacetaldehyde as an intermediate which is a point of convergence with sulphoacetate degradation (Weinitschke, Hollemeyer, et al., 2010; Figure 2c).Two different transporters are proposed for the import of isethionate (Figure 2b).These are the TRAP transporters IseKLM and IseU from the major facilitator superfamily.After import into the cytoplasm, isethionate is oxidized to sulphoacetaldehyde by the isethionate dehydrogenase IseJ (Weinitschke, Sharma, et al., 2010).
In some organisms, isethionate is not converted, but the sulphonate group is cleaved off by isethionate sulphite lyase IslAB, releasing sulphite and acetaldehyde (Peck et al., 2019).
Taurine import is postulated to be facilitated by the ABC transporter TauAB1B2C or the TRAP transporter TauKLM (Figure 2b).
There are several possibilities for the further pathway.Taurine can either be oxygenated by TauD to form 1-hydroxy-2-aminoethane sulphonic acid, which decomposes to aminoacetaldehyde and sulphite (Eichhorn et al., 1999), or it is oxidized in NADH-dependent reaction by the taurine dehydrogenase TauXY, which produces sulphoacetaldehyde.The same product is also produced by the transfer of the amino group to pyruvate by taurine: pyruvate aminotransferase Tpa (Bruggemann et al., 2004) or to 2-oxoglutarate by taurine:2oxoglutarate aminotransferase Toa (Krejcik et al., 2010).
Sulphoacetaldehyde can be converted by the NADPH-dependent sulphoacetaldehyde reductase IsfD to isethionate which is then exported by the IsfE transporter (Krejcik et al., 2010).Another possible fate of sulphoacetyladehyde is desulphonation coupled to a phophorylation by sulphoacetaldehyde acetyltransferase Xsc to acetyl phosphate which is further converted to acetyl-CoA by phosphate acetyltransferase Pta (Weinitschke, Sharma, et al., 2010) Sulphite released in the each of these processes is exported via TauE (Weinitschke et al., 2007).An individual HMM was developed for each individual protein/subunit mentioned here.An exception was made for TauB1 and TauB2, which were combined into a single HMM due to their similarity.Additionally, we trained an HMM for TauZ, a protein of unknown function, and the regulator TauR.Both are commonly found genetically associated with other tau genes.

| HMM development: cysteine synthesis
Cysteine is an essential amino acid with a thiol side chain.Here, we started to cover the relevant proteins with HMMs primarily based on knowledge collected with enterobacterial model organisms.
PAPS reductase CysH then reduces the activated compound to sulphite.In some bacteria, including most cyanobacteria, APS can be reduced to sulphite directly, without phosphorylation to PAPS (Bick et al., 2000).The assimilatory APS reductases catalysing this reaction exhibit similarity to the assimilatory PAPS reductases (Abola et al., 1999;Bick et al., 2000) and are covered by the same HMM (CysH) in this work.In Enterobacteria, sulphite is reduced to sulphide via CysIJ.Finally, cysteine is synthesized from sulphide and O-acetyl-L-serine by the cysteine synthase CysK.A total of 10 new HMMs was generated for the mentioned proteins/subunits.An HMM for YeeE/YedE-like transporters was already available through HMS-S-S (Tanabe & Dahl, 2022).

| HMM validation: cross validation and independent test data set
The HMMs developed were validated by cross-validation and with an independent test data set.In cross-validation, sequences unrelated to the tested HMM training data were added as true negative examples in addition to the omitted training sequences (Chicco, 2017;Refaeilzadeh et al., 2009).The omitted sequences from each fold served as true positive examples.Cross-validation was performed using the optimized thresholds calculated prior to cross-validation.
Thus, the threshold values should also be checked for their suitability.Performance was measured using the MCC.This metric ranges from −1 to 1, with 0 corresponding to random assignment, 1 corresponding to perfect assignment with no misclassification, and −1 corresponding to complete misclassification.Here, the individual occurrence of FP or FN lowers the score on the MCC, while the combination of both misclassifications lowers the score more dramatically than the single occurrence of either type of error (Chicco & Jurman, 2020).
The majority of the HMMs developed showed high precision and recall in the cross-validation and on the test data set (Figure 3).Of the 134 HMMs covering proteins of organic sulphur compound metabolism, 127 stayed above an MCC of 0.80 during the cross-validation (Figure 3; Table S2).The evaluation of the 134 HMMs against the independent test data set resulted in 120 HMMs with an MCC of 0.80 or higher.HMMs for the alkanesulphonate transporter subunits SsuB and SsuC failed the cross-validation threshold of 0.8 slightly by 0.02 points but performed better on the independent test data set.
These were the only cases where the cross-validation performance was insufficient but the performance on the test data set was above the threshold.From the HMMs with an MCC >0.8 during crossvalidation, seven scored below 0.8 in the test data set.These were MsmG with an MCC of 0.78, SmoI (0.76), MsmB (0.66), DddA (0.62), DorA (0.46) and SftD (0.03).For SftD, MsmB, MsmG and DddA this was due to a high number of sequences which were falsely classified as negative, probably due to a low training sequence diversity.Thus, these HMMs had a high precision and did not generate high numbers of false positive hits, but they performed low in recognition resulting in a high number of unrecognized sequences.The opposite was the case for the DorA HMM, which generated too many false positive hits but no FN ones.Sulphoquinovosidase SmoI interfered in the detection with sulphoquinovosidase named YihQ.The same holds true for transporters HpsU and IseU.All sequences that were falsely classified by one of these two HMMs belonged to the other HMM.

| HMM validation: case study
HMSS2 was also validated with 24 complete genomes from bacteria with organic sulphur compound metabolism (Table S3), which were screened for the presence of enzymes for the utilization of taurine, isethionate, DHPS, sulphoquinovose and DMS (Figure 4).
Proteins for taurine utilization were found mainly in the known taurine-utilizing genera Octadecabacter, Roseobacter, Roseovarius and Ruegeria of the Rosebacterales, including the taurine degraders Roseovarius nubinhibens (Denger et al., 2009) and Ruegeria pomeroyi (Gorzynska et al., 2006).These strains encoded for the TauABC taurine importer, Tpa and Xsc constituting the complete degradation pathway from free taurine via sulphoacetaldehyde to acetyl phosphate with the release of sulphite.Roseobacter denitrificans additionally possessed genes for the taurine dehydrogenase TauXY and the taurine:2-oxoglutarate aminotransferase Toa, which can also convert taurine to sulphoacetaldehyde.The sulphoacetaldehyde acetyltransferase Xsc was present in all genomes examined.This is probably due to the fact that sulphoacetaldehyde is not only exclusively an intermediate of taurine degradation but also of isethionate, sulphoacetate and DHPS degradation, and possibly of other F I G U R E 3 Validation of the 134 HMMs generated in this work.Performance was assessed by cross-validation (blue dots) and on an independent test data set (red diamonds).For each HMM Matthew correlation coefficient was calculated.HMMs were ranked by their performance in cross-validation.
In line with this possibility, genes encoding isethionate dehydrogenase IseJ, which converts isethionate to sulphoacetaldehyde, were found in almost all analysed Rhodobacterales, Hyphomicrobiales and Gammaproteobacteria genomes, consistent with earlier reports (Weinitschke, Sharma, et al., 2010).Leminorella grimontii, Hyphomicrobium denitrificans and all Methylophaga species were exceptions, consistent with the inability of H. denitrificans and Methylophaga to consume organosulphur compounds with more than one carbon atom.
Instead, it contains sulphoacetaldehyde reductase IsfD (or SarD), which is also present in Bilophila wadsworthia.In both cases, this enzyme may provide an endogenous source of isethionate (Burrichter et al., 2021).
Most analysed genomes possessed the potential for sulph- Sulphoquinovose degradation via the sulpho-Entner-Doudoroff pathway is present in eight bacteria, including Pseudomonas putida and other bacteria for which this pathway has been described or postulated (Felux et al., 2015).The complete sulphoquinovose degradation pathway based on a sulphoquinovose monooxygenase was found in seven proteobacteria in accordance with previous reports (Sharma et al., 2022).The other known sulphoquinovose degradation pathways were not detected, which is likely due to the presence of the Sulpho-Embden-Meyerhof-Parnas pathway (Denger et al., 2014) primarily in Enterobacterales and the transaldolase-dependent sulphoquinovose degradation in Firmicutes (Frommeyer et al., 2020).
Bacteria from these taxonomic groups were not included in the case study.

F I G U R E 4
Presence/absence of proteins involved in the metabolism of organic sulphur compounds.Occurrence of genes for proteins involved in taurine degradation, isethionate degradation, 2,3-dihydroxypropane-1-sulphonate, sulphoquinovose and DMS metabolism, is indicated by filled orange, violet, purple, green and light brown circles respectively.The function of the individual proteins can be deduced from Figures 1 and 2.
In summary, our case study on characterized organosulphur compound degraders has shown that in all cases the detection by HMSS2 agrees with the published analyses of other authors.

| HMSS2 improvements
HMSS2 has a redesigned engine and additional features for protein annotation and output format customization (Figure 5).Proteins with multiple domains are now stored with all domains and not just the domain with the highest score.This was accomplished by improving the local relational database structure.This requires that the recognized domain regions in the primary sequence do not overlap, so that domains with high scores are not overwritten by lower scores.
On the other hand, high-scoring domains may still overwrite one or more lower-scoring domains during annotation.
Gene arrangement can now be used by HMSS2 for annotation as a nonhomologous criterion.Hits below the threshold are also considered and annotated if they lie within a gene cluster and the potentially assigned annotation would complete a known gene cluster arrangement.Thus, a gene that highly likely occurs within a gene cluster must reach a lower cutoff than normal to be detected if it is encoded within such a cluster.
The output formats have been greatly expanded, and new features were added to improve usability and readability.It is still possible to retrieve sequences filtered by protein type, the genomic proximity and the presence of proteins or gene clusters in the same genome.HMSS2 automatically recovers a list of all hits with genomic features and a separate protein sequence file in FASTA format.
Additionally, two subsets of the latter file are created.One subset includes all hits that are unique to their genome, respectively, while another subset includes all hits that occur at least twice in the same  S4).The observed increase in execution speed for HMSS2 became more significant as the number of genomes processed increased and scaled linearly with the number of input assemblies.While HMS-S-S required around 26 min to process 64 assemblies, HMSS2 needed only 7 min for this task.Thus, the introduced improvements led to a fourfold accelerated computation speed for HMSS2.

| DISCUSS ION
Here, we present a substantial update that provides an HMM-based search tool for proteins involved in the metabolism of inorganic and organic sulphur compounds.The high accuracy of the advanced tool presented here provides a reliable basis for genome analysis and is We also significantly broadened the applicability of HMSS2 by adding the conversion of sulphonated carbon compounds.HMSS2 now covers pathways from the entire sulphur cycle, enabling studies on the link between the cycles of inorganic and organic sulphur compounds.In addition to providing operon structure information to support equivalence prediction, the accessibility and display of the annotated proteins has been greatly enhanced.Not only can sequences now be filtered by annotation but also the presence of genes and genomic context can be displayed using other specialized applications, further extending the capabilities of synteny analysis.
Such analyses are not only limited to studies of the ecological role of prokaryotes but also include the evolution of metabolic pathways (Garcia et al., 2022), distribution of new pathways (Sharma et al., 2022) and genomic context visualization (Garcia et al., 2019;Letunic & Bork, 2021).
The expansion to the metabolism of organic sulphur com- Furthermore, the reliability of prediction is raised when genomic context is paired with the prediction made by the HMM detection as already discussed above.

| CON CLUS IONS
In summary, HMSS2 is an advanced comprehensive HMM-based tool for annotation and synteny analysis of prokaryotic sulphur metabolism.It has a higher speed and a much wider coverage than its predecessor HMS-S-S and now includes proteins involved in the metabolism of inorganic and organic sulphur compounds.The use of curated functionally equivalent sequences for HMM training resulted in HMMs with high precision and recall.This also fills a gap in the coverage of sulphur metabolism prediction by HMMs.The application possibilities also include the combination with other HMMs from public databases or user-defined models and can therefore be extended according to the user's needs.The improved output formats are also applicable to ecology and evolutionary research.
order.Nucleotide input files are first searched for open-reading frames and translated into protein sequences by Prodigal.This step is omitted if protein sequences are provided.Profile HMM are then queried against the protein sequences of the current file with validated bit score cut-offs via hmmsearch.Hits are saved in a local database together with corresponding genomic features and protein amino acid sequences.The local database now uses the SQLite database engine and an improved database table structure

F
I G U R E 2 Prokaryotic metabolism of organosulphur compounds with two or more carbon atoms and relevant transporters.(a) Pathways of sulphoquinovosyl glycerol degradation.(b) Transport systems for import and export of organic sulphur compounds.(c) Degradation pathways of C 2 and C 3 organosulphur compounds.Usually, the same cell does not contain all the pathways.All proteins show have a corresponding HMM in HMSS2.Cytc, Cytochrome c; DHPS, 2,3-dihydroxypropane-1-sulphonate; FMN, flavin mononucleotide; FMNH 2 , reduced flavin mononucleotide; 2-OG, 2-oxoglutarate.
sulphoquinovosyl glycerol is cleaved by the sulphoquinovosidase SmoI to SQ.In contrast to the other pathways, SQ is now transformed to 6-oxo-glucose and sulphite by an alkanesulphonate monooxygenase, SmoC.The electrons for this reaction come from NADPH via the flavin reductase SmoA.6-oxo-glucose is converted in another NADPH-dependent reaction by SmoB into glucose, which is then available for glycolysis.Eight HMMs were generated for this pathway, one for each protein.An additional HMM was trained for SmoD, a putative regulator encoded in the smo operon.
Together these two HMMs performed well in detecting of isethionate and DHPS transporters of the major facilitator superfamily.The situation was similar for YihO and SftA which are both postulated sulphoquinovose importers that catalyse the same function in the context of sulphoquinovose degradation.In summary, 112 of 134 HMMs were successfully tested via cross-validation and with an independent data set.Two other pairs of HMMS can be used together, for the safe detection of sulphoquinovosidase and the transporters YihO and SftA.
opyruvate and (R)-sulpholactate generation from DHPS and (L)sulpholactate.The potential of (R)-DHPS oxidation via HpsN generating 2 NADH equivalents was found in all analysed strains and most lso encoded for isomerization of (S)-DHPS to (R)-DHPS via HpsP (17/24 genomes).The predicted presence of genes for desulphonation of sulphopyruvate by ComDE and sulpholactate by SuyAB as found here is also in accordance with previous reports for the Roseobacterales clade (Chen et al., 2021; Denger et al., 2009), the Hyphomicrobiales (Chen et al., 2021), Desulfovibrio desulfuricans and B. wadsworthia(Hanson et al., 2021).Even without the ability to desulphonate sulphopyruvate or sulpholactate, the conversion of DHPS to sulphopyruvate or sulpholactate and export of these as end products provides 2-3 NADH equivalents and thus a growth advantage for the organism.
genome.Multidomain proteins, retrieved by the requested protein type, are listed separately if at least one other domain has been annotated.An output module for iTol compatible data sets was also included.This module integrates the generation of iTol data sets for the presence/absence of the keywords/domains for each F I G U R E 5 Algorithm overview of HMSS2.New features added in HMSS2 are highlighted in yellow.The only external programs required are HMMER3 and Prodigal.genome.Range data sets, which mark specific proteins in a phylogenetic tree, can now also be generated by HMSS2, as well as iTol-compatible data sets for displaying gene clusters.HMSS2 also comes with several utilities to modify the output protein FASTA files.It is now possible to assign the taxonomic name of the source organism to each sequence.Files can now be filtered by length, merged without duplicating sequence identifiers and sequences from multiple FASTA files originating from the same organism can be concatenated into a single sequence.With a FASTA-formatted file as input, a list of neighbouring genes is now accessible to support searches for conserved but previously undiscovered gene constellations.The execution time of the HMSS2 was compared to that of HMS-S-S to demonstrate the scalability and efficiency of HMSS2.For this test, increasing numbers of genomes were randomly selected from the assemblies of the training data set and gene clusters were annotated and determined with the 164 HMMs of the original library.Time measurements were performed in triplicate with random selection of input assemblies for each replicate.The execution time was then averaged over all replicates.Comparison between the two versions showed a large difference in the required execution time (Figure 6; Table further supported by the genomic context detection.The HMSS2 algorithm now uses homologous and nonhomologous criteria already in the protein annotation step, not just for the later identification of gene clusters.In addition, the overall execution time was accelerated by fourfold compared to the previous version, further speeding up the detection of sulphur metabolism pathways in genomes and metagenomes.With the increasing number of available genomes, faster protein annotation is required to handle the immense amount of available data. pounds resulted in the generation of 134 additional HMMs in addition to the 164 HMMs previously included in HMS-S-S, almost doubling the total number of proteins included.The accuracy of the newly generated HMMs and the respective thresholds were demonstrated by cross-validation and a test data set.Observed deviations between both testing methods are likely due to an uneven distribution and abundance of protein sequences influencing the number and diversity of testable sequences.The quality of the 134 novel HMMs was ensured by selection of high-quality genomes derived from the RefSeq and GenBank databases.The overall development process had already been successfully applied for the proteins of inorganic sulphur metabolism (Tanabe & Dahl, 2022).The test data set was obtained from the full diversity of phyla accessible from GenBank and should therefore reflect the widest possible range of sequence variation.However, although the cutoff values have been validated, they are likely to need adjustment for newly discovered phyla (Anantharaman et al., 2018; Jaffe et al., 2020).The diversity of proteins involved in the metabolism of organic sulphur compounds covered by HMSS2 also includes less prominent pathways for degradation and conversion of compounds such as sulphoquinovose or DMS.Although a considerable proportion of sulphur in the biosphere is bound in substrates or intermediates of these pathways, they are not commonly included in annotation pipelines and often unrecognized or incorrectly annotated.This is illustrated by fact that only 16 of the 124 proteins included here for the conversion of sulphoquinovose, taurine, isethionate or DMSP have an exact counterpart in PFAM (El-Gebali et al., 2019) F I G U R E 6 Computing time required by HMS-S-S compared to HMSS2.Test were performed in triplicate with defined numbers of randomly selected sulphur-oxidizing or sulphur-reducing prokaryotes and 164 HMMs.White circles: HMS-S-S, orange diamonds: HMSS2.or TIGRFAMs.In contrast, eight of ten HMMs covering sulphate assimilation for cysteine biosynthesis have a TIGRFAM equivalent.A common problem in the functional annotation of enzymes involved in metabolism of organic sulphur compounds are enzymes, such as DmsA or DorA, that belong to the DMSO reductase superfamily.This family includes tetrathionate reductase, polysulphide reductase and thiosulphate reductase, as well as several other proteins unrelated to sulphur metabolism.Tertiary structure and complex composition is conserved throughout all members of this family(McEwan et al., 2010) and substrate specificity may only arise through a small number of conserved amino acids at the active site(Struwe et al., 2021).The validation performed here showed that related complexes in the DMSO reductase family did not negatively affect the HMMs for DmsA and DorA.