MetaTrass: A high‐quality metagenome assembler of the human gut microbiome by cobarcoding sequencing reads

Abstract Metagenomic evidence of great genetic diversity within the nonconserved regions of the human gut microbial genomes appeals for new methods to elucidate the species‐level variability at high resolution. However, current approaches cannot satisfy this methodologically challenge. In this study, we proposed an efficient binning‐first‐and‐assembly‐later strategy, named MetaTrass, to recover high‐quality species‐resolved genomes based on public reference genomes and the single‐tube long fragment read (stLFR) technology, which enables cobarcoding. MetaTrass can generate genomes with longer contiguity, higher completeness, and lower contamination than those produced by conventional assembly‐first‐and‐binning‐later strategies. From a simulation study on a mock microbial community, MetaTrass showed the potential to improve the contiguity of assembly from kb to Mb without accuracy loss, as compared to other methods based on the next‐generation sequencing technology. From four human fecal samples, MetaTrass successfully retrieved 178 high‐quality genomes, whereas only 58 ones were provided by the optimal performance of other conventional strategies. Most importantly, these high‐quality genomes confirmed the high level of genetic diversity among different samples and unveiled much more. MetaTrass was designed to work with metagenomic reads sequenced by stLFR technology, but is also applicable to other types of cobarcoding libraries. With the high capability of assembling high‐quality genomes of metagenomic data sets, MetaTrass seeks to facilitate the study of spatial characters and dynamics of complex microbial communities at enhanced resolution. The open‐source code of MetaTrass is available at https://github.com/BGI-Qingdao/MetaTrass.


INTRODUCTION
Through sequencing and analyzing the DNA of microbial communities directly from the environment, metagenomics has shown significant promise in advancing the study of uncultured microbiomes [1,2].A high level of genetic diversity has been unveiled by the growing number of comprehensive metagenome-assembled genomes (MAGs), particularly from human gut microbiomes [3,4].The progress in metagenomics has shed new light on the study of spatial distribution and dynamics of complex microbial communities from the human gut [5][6][7].
Functional mining of high-quality strain-resolved genomes revealed a strong correlation between genotypic differences among strains and their phenotypic differences [8,9].Intra-species nonhomologous genes can serve as biomarkers to distinguish pathogenic strains from their commensal counterparts within a species [10][11][12].The percentage of conserved intra-species homologous genes shared among strains could be as low as 40% [13], and the remaining nonconserved genome sequences are considered to have a significant contribution to the phenotypic diversity of microorganisms.Therefore, complete genomes from a microbial sample at the species level will enable a more comprehensive view of intra-species genome diversity.But it is still a challenge to generate sufficient high-quality genomes from metagenomic data sets.
Most of the current approaches to analyzing microbial communities are designed to work with economical next-generation sequencing (NGS or high-throughput sequencing [HTS]) reads [14].Many highly modularized computational tools have been developed, including genome assemblers, genome binners, taxonomic binners, and taxonomic profilers [15][16][17].The assembly-first-andbinning-later strategies have been commonly used to generate MAGs.In these conventional strategies, short reads from a microbial community are first assembled into longer sequences by metagenomic assemblers with the consideration of uneven coverage depths of different microbial species [18][19][20] and further grouped into individual genomes by genome binners based on K-mer composition and read coverage [21][22][23].However, these conventional strategies often failed to resolve the long inter-species repeats during contig reconstruction.Therefore, the contiguity of the draft genomes recovered from NGS reads remains not long enough to study large structural variations in microbial genomes.
Various sequencing technologies with long-range information accompanied by specialized computational tools have been released to overcome the problem of long repeats.The third-generation single-molecule real-time sequencing (TGS) technologies developed by Pacific Biosciences and Oxford Nanopore Technology (ONT) can produce contiguous reads with lengths up to hundreds of kilobases, and they show great potential to generate complete genomes from both cultured and uncultured microbial communities [24][25][26].With the use of chromatin-level contact probability information generated by high-throughput chromosome conformation capture (Hi-C) technology, more high-quality MAGs with improved contiguity can be retrieved [27].For NGS data sets, the coabundance of species in multiple samples with the common K-mer composition is also used to improve the capability to achieve better assembly quality [28].However, these approaches have several limitations.The high sequencing error rate in TGS long reads hampers the distinction between true biological variations and sequencing errors.An effective contact map with an Hi-C library can only be established for a draft genome with long contiguity.Constructing coabundance in multiple samples ignores the genome characteristics of a single sample and increases the sequencing cost.
The cobarcoding sequencing library [29][30][31][32][33], an improved short-read sequencing technology with longrange genomic information, can provide an alternative way to improve metagenomic analysis.In cobarcoding library construction, long fragments of DNA molecules are first distributed into different isolated partitions and further sheared into shorter subfragments.Then these subfragments are indexed with a unique barcode (DNA cobarcoding).Finally, the cobarcoded subfragments are sequenced by standard short-read sequencing platforms.For different cobarcoding libraries, such as BGI's single tube long fragment reads (stLFR) [32], 10X Genomics' linked-reads [34], and Illumina's contiguity preserving transposase sequencing [30], technical differences in the number of barcodes and the short-read coverage of the original long fragment have a great impact on the downstream analysis [35][36][37][38].The cobarcoding correlation on the draft sequences or the assembled graph has been successfully applied to improve the contiguity of assembled genomes for both large eukaryotic genomes [39][40][41] and metagenomes [31,42,43].However, none of these conventional strategies can conquer the inherent problem of long repeats among species with uneven abundance for complex microbial communities.
In this study, we introduce a pipeline named Metagenomic Taxonomy Read Assembly of Single Species (MetaTrass) based on cobarcoding sequencing data and reference genomes.Unlike conventional strategies, MetaTrass is featured with an integrative binningfirst-and-assembly-later approach.The cobarcoding information is mainly used for two purposes.First, it facilitates the initial taxonomic binning procedure by clustering reads with the same barcode.Second, it improves the assemblies by providing long-range information.We applied MetaTrass to stLFR data sets of a mock microbial community and four human gut microbial communities to evaluate its capability of producing high-quality draft genomes with long contiguity and high taxonomic resolution.The microbiome composition and genetic diversity in the four human gut microbial communities were quantitatively analyzed using the high-quality draft genomes assembled by MetaTrass.In addition, the high-quality draft genomes assembled by MetaTrass have clear taxonomic information determined by the taxonomic binning, thus benefiting the microbial downstream analysis.All results were benchmarked by comparison with the existing mainstream tools.

MetaTrass pipeline
In this study, we developed a metagenome assembly pipeline named MetaTrass that integrates the information of reference genomes and DNA cobarcoding technology.As shown in the flowchart (Figure 1A), the taxonomic binning of stLFR reads was carried out before the genome assembling, unlike conventional assemblyfirst-and-binning-later strategies.In taxonomic binning, the metagenomic stLFR reads were classified into different taxonomic ranks by Kraken2 [44].Due to the inherent limitations of Kraken2, only reads from conserved species-specific regions can be successfully classified into the corresponding species.On the contrary, reads from inter-species repeat and unique genome-specific regions could not be effectively classified (Figure 1B).Here, the species-specific regions represent similar or repeat sequences among genomes from the same species.The inter-species repeat regions represent similar or repeat sequences among genomes from different species.The unique genome-specific regions represent dissimilar sequences between genomes from the same species.The reads from inter-species repeat regions are classified into the higher taxonomic ranks of the target species and those from unique genome-specific regions are categorized as unclassified.In total, about 10% of the reads were classified into ranks higher than the bacterial species level for the four human fecal data sets and about 9% of the reads were unclassified (Supporting Information: Table S1).In the step of cobarcoded read refinement, the cobarcoding correlation between the reads from species-specific regions and those from inter-species repeat or unique genome-specific regions was used to refine the final read set for a target species (Supporting Information: Figure 1B).The barcodes of the species-specific reads were first extracted as the candidate barcodes.Then, the final barcodes were collected by a constraint of data size and the quality of cobarcoding information.Finally, reads of the target species were collected based on the final barcodes to be assembled in the following step.A data size threshold was set to reduce computational consumption for the species with extremely high abundance.The data size threshold was set to 300× by default according to the parameter sweep results of the sample P_Gut_Meta01 (Supporting Information: Table S2).The quality of cobarcoding information of a specific barcode was quantified by the number of reads with a confident species classification and the proportion of these reads among the total reads.In cobarcoded read assembly, the refined reads of each species were independently assembled by Supernova [39].As aforementioned, there is a very small chance that long fragments from different species could be indexed with the same barcode in the stLFR libraries (Supporting Information: Figure S1).So, some false-positive reads were introduced in the cobarcoded refinement procedure, and they were assembled to form contaminant sequences in draft assemblies.In the step of contamination removal, we eliminated these contaminant assemblies according to their dissimilarity to the reference genomes.The thresholds of average nucleotide identity (ANI) and alignment fraction (AF) were set to 90% and 50% by default according to the optimal results of the sample P_Meta_Gut01 (Supporting Information: Table S3).Overall, the comprehensive use of cobarcoding information and references in our approach could reduce the false-negative effects of taxonomic binning and the false-positive effects of cobarcoded read refinement.

Assembly of the mock microbiome
The binning-first-and-assembly-later strategy has been widely adopted to assemble haplotype genomes for eukaryotes with large genome sizes [36,45].However, it has rarely been used to assemble metagenomes.We first applied MetaTrass to assemble stLFR read sets of the mock microbial community.In total, up to 99.4% of the reads were confidently assigned species-level taxonomy given the simplicity of the mock microbial community with low inter-species repeats and unique genomespecific regions (Supporting Information: Table S4).To investigate the efficiency of our strategy, we compared it with the mainstream mixed assembly approaches (Figure 2A).Besides the MetaTrass analysis, the stLFR reads were also directly assembled by IDBA-UD [20], MEGAHIT [18], Supernova [39], CloudSPAdes [42], and Athena [31].Additionally, the optimal mixed assemblies of ONT and Illumina NGS reads in Nicholls's work [46] were also included for comparison, and assembled by WTDBG [47] and by SPAdes [48], respectively.The draft genome of each species in a mixed assembly was refined by our contamination removal module.
Overall, our pipeline was superior in the production of draft genomes with high genome fractions and long contiguity (Figure 2).Two species, Enterococcus faecalis and Lactobacillus fermentum, were incompletely assembled by Supernova with genome fractions as low as 17.7% and 8.9%.On the contrary, both species were properly recovered by MetaTrass, indicating that the assembly complexity caused by uneven abundances was reduced by taxonomic binning.When compared to the results of TGS reads assembled by WTDBG, the assemblies by MetaTrass had high genome fractions, similar to those by NGS or cobarcoding assemblers designed for metagenomes.As compared with the NGS assemblers, the cobarcoding and TGS assemblers produced draft genomes with better contiguity, among which MetaTrass showed the best performance.MetaTrass produced seven draft assemblies with NG50 around 2 Mb and obtained the highest number of assemblies with NGA50 around 2 Mb.Further, MetaTrass yielded fewer assembly errors as compared to the TGS assembler (Supporting Information: Figure S2).The average numbers of mismatches and indels per 100 kb in assemblies of stLFR reads were 60 and 10, respectively, which were smaller than those of the ONT assemblies.It was worth noting that the improvement of the assembly of Pseudomonas aeruginosa genome by MetaTrass was not large, and this may come from the fact that this species had the least amount of valid cobarcoding information (Supporting Information: Table S5).The valid cobarcoding information about each species was evaluated by counting the number of valid long fragments from each species.A valid fragment was defined as a cluster with more than five paired-end reads and a maximum distance between two paired-end reads longer than 10 kb, by aligning paired-end reads to reference.

Assembly of four human gut microbiomes
To evaluate the robustness of our approach, we applied MetaTrass to four human fecal samples to study human gut microbial communities.The comprehensive genome references of the Unified Human Gastrointestinal Genomes (UHGG V1.0) were used to classify NGS reads by Kraken2 [44], and the community compositions were estimated through the classified reads at different taxonomic ranks (Supporting Information: Figure S3-S6).The three healthy samples had a similar microbial community, in which the major microbiomes were from the bacterial phylum Firmicutes A. This microbial community was different from the patient microbial community dominated by the bacterial phylum Proteobacteria, which is demonstrated to be strongly correlated with enteric diseases caused by dysbiosis in gut microbiota [49].The numbers of species with an abundance higher than 10× were 113, 108, 93, and 158 in samples H_Gut_Meta01, H_Gut_Meta02, H_Gut_Meta03, and P_Gut_Meta01, respectively.
The genome fraction of an assembly to the reference was used to evaluate the completeness of genome assembly.The genome fraction for all samples widely ranges from 0% to 90%, and the distributions of H_Gut_Meta01 and H_Gut_Meta02 were more concentrated than those of H_Gut_Meta03 and P_Gut_Meta01 (Figure 3A).More than half of the assemblies with confident species-level taxonomy had a genome fraction higher than 50%.Considering the large genetic diversity between sample genomes and reference genomes [8], these results suggested that our pipeline was able to obtain complete genomic information for the species with an abundance higher than 10×.The genetic diversity was also confirmed by the significant differences in genome fraction and the ratio of assembled length to the reference length among the four samples (Supporting Information: Figure S7).The distribution of genome N50 values was generally dispersed ranging from a few kb to several Mb, and the medians of H_Gut_Meta02 and H_Gut_Meta03 were higher than those of H_Gut_Meta01 and P_Gut_Me-ta01 (Figure 3B).Nevertheless, the third quartiles in the box plots for all the samples were larger than 100 kb, demonstrating that our pipeline has a strong capability to generate draft genomes with long contiguity.Note that from these three healthy samples, plenty of draft genomes with ultra-long contiguity (N50 > 1 Mb) were obtained, which provide possibilities to study the large genome difference in the microbiome.
Considering the intra-species genetic diversity, we also evaluated the quality of metagenomic assemblies by CheckM.The medians of the completeness of all assembled genomes in the three healthy samples were larger than 92%, while the medians of contamination were smaller than 2% (Figure 3C).The median completeness of the patient sample was about 83%, and the median contamination was about 7% (Supporting Information: Figure S8).A great number of high-and medium-quality genomes were recovered by MetaTrass from the four samples (Figure 3D).Fifty-two high-quality and 37 medium-quality genomes were produced for H_Gut_Meta01, 55 and 24 for H_Gut_Meta02, 47 and 16 for H_Gut_Meta03, and 24 and 28 for P_Gut_Meta01, respectively.

Comparison to the assembly-first-andbinning-later strategies
To further evaluate our approach's efficiency, we compared it with conventional assembly-first-andbinning-later approaches as listed in the Section "Methods."It should be noted that currently, there are still no genome binning tools to directly exploit the cobarcoding information.With regard to the number of genomes with a completeness of >50% (Supporting Information: Table S6), MetaTrass outperformed all the other methods on all data sets.Especially for P_Gut_-Meta01, MetaTrass yielded 117 draft genomes with completeness higher than 50%, much more than the 66 ones obtained by the optimal combination of Supernova [39] and Maxbin2.0 [22].
Through the comprehensive analysis of the completeness, contamination, and taxonomic rank of each genome, we assessed MetaTrass and conventional strategies on the ability to get high-and medium-quality genomes and resolution of taxonomic rank (Figure 4).For different samples, the best combinations to produce optimal results were different.The combinations of MetaSPAdes [19] and Maxbin2.0,Supernova and MetaBAT2 [23], MetaSPAdes and MetaBAT2, and Athena [31] and MetaBAT2 were optimal for H_Gut_Meta01, H_Gut_Meta02, H_Gut_Me-ta03, and P_Gut_Meta01, respectively.For the four samples, the optimal results of the conventional strategies were still inferior to those of MetaTrass.For the example of H_Gut_Meta01, the combination of MetaSPAdes and Maxbin2.0 produced 41 high-and medium-quality genomes, which was much less than the 90 obtained by MetaTrass.There were only 3 out of a total of 18 highquality genomes with a taxonomic rank lower than the bacterial order, but 15 out of 52 for MetaTrass.Compared with the strategies those only used NGS-read information, MetaTrass outcompeted all these approaches by producing higher quality and finer resolution.These results demonstrated that the usage of cobarcoding information in MetaTrass was more efficient and accurate than those in the conventional approaches.
The human gut microbiome composition attracts much attention for many reasons, one of them being its strong correlation with personality traits [50].To compare the microbiome composition of high-quality genomes obtained with different methods, we uniformly used GTDB-Tk [51] to annotate these genomes.The annotated taxonomic information was listed in Supporting Information: Table S7.Taxonomic trees of the high-quality genomes obtained by MetaTrass were constructed based on the taxonomic information, and the corresponding N50 values were attached in the left histogram as shown in Figure 5.The high-quality genomes obtained by the conventional strategies were marked in red in the middle of the heatmap (Figure 5) if the genome of the same species was also assembled by MetaTrass.The topology of the taxonomic tree of genomes assembled by MetaTrass gave comprehensive insights into the microbial composition structure.From the trees in Figure 5 and Supporting Information: Figures S9-S11, the numbers of the bacterial order with high-quality genomes assembled by MetaTrass were 9, 11, 7, and 7 for H_Gut_Meta01, H_Gut_Meta02, H_Gut_Meta03, and P_Gut_Meta01, respectively.Notably, some bacterial orders contained more than five highquality genomes, and this could be used to study the microbiome structure at the genome-wide scale in the same sample.For the sample H_Gut_Meta01 (Figure 5), there were 27 and 14 high-quality genomes classified into the bacterial order Lachnospirales and Oscillospirales, respectively.These two were exactly the dominating bacterial orders according to the taxonomic abundance distribution.Similar results were obtained for the other two healthy samples (Supporting Information: Figures S9  and S10), indicating that the microbiome with higher sequencing coverage could be better assembled in MetaTrass.In contrast, the bacterial orders with more than five high-quality genomes were Enterobacterales and Actinomycetales for P_Gut_Meta01 (Supporting Information: Figure S11).The obvious difference between the healthy and patient samples was consistent with the microbial composition differences observed in the taxonomic binning results.MetaTrass successfully assembled most of the high-quality genomes of all the conventional strategies in our tests.Only 25 out of 137 genomes with high-quality generated by all the conventional strategies were not assembled by MetaTrass (Figure 5).The heatmaps showed that most of the conventional strategies could assemble draft genomes for each bacterial order, but the numbers of genome in the same bacterial order were relatively small.The maximal number of genomes in one bacterial order was six and obtained by the combination of Supernova and Meta-BAT2 for the bacterial order Lachnospirales.Moreover, 146 of 179 high-quality genomes had N50 values larger than 100 kb, demonstrating that MetaTrass had a strong ability to improve the contiguity of assemblies.

Genetic diversity in different samples
Different types of genomic variations in gut microbiomes are strongly associated with host health, and the genetic diversity among different microbiomes has been intensively studied to unravel the genetic origin of phenotypic differences among people of different geographical origins or health statuses [52,53].By aligning draft genomes to the references, we called variations for highquality genomes for each species in different samples, including single nucleotide (SNV) as well as small and large indels.For different variations, the number of SNVs was significantly larger than those of the small and large indels for the four samples (Supporting Information: Figure S12).Meanwhile, median values of variation numbers in the three healthy samples were similar but obviously larger than those in the patient sample.The smaller number of variations in the patient sample was due to the fewer alignments than those in the healthy samples according to the QUAST [54] evaluation.However, when we removed the effect of the total aligned length by calculating the SNV density, the patient sample showed denser SNV than the healthy samples (Supporting Information: Figure S12).The median was about 21 for the patient sample, but around nine for the healthy samples.This difference could be related to the individual's physiological state, which was related to the diseases, territory, and race [4].
Based on the taxonomic information of the highquality genomes, we found 15 species shared by three samples, among which 14 species appeared in the three healthy samples and one species Escherichia appeared in the patient and two healthy samples.By analyzing the SNV density and intersection of variations between different samples for each species in three healthy samples, we further investigated the genetic diversity between species from different samples.The SNV densities were different for distinct species even in the same sample but were similar for the same species in different samples (Figure 6A).From Figure 6B-D, the number of unique and shared variations in different types significantly fluctuated for different species, but their difference among samples showed great consistency.H_Gut_Meta01 and H_Gut_Meta02 shared the highest number of variations among all sample pairs.Furthermore, the ratio of large indels shared by all three healthy samples to the total was much smaller than those of SNVs and small indels.These results demonstrated that large variations were more sample-specific than small variations, which is consistent with previous observations of the association between host health and structural variations in the human gut microbiome [52].

Computational performance
Runtime and used thread number of each assembler were recorded for all the human fecal data sets (Table 1).
F I G U R E 5 Taxonomic tree of the high-quality genomes assembled by MetaTrass for H_Gut_Meta01.The taxonomic tree is on the left.The distribution of the high-quality genomes assembled by other methods is colored red in the middle heatmap.N50 of each high-quality genome is shown in the right histogram.
Most of the assemblers were tested on a small server of 24 Intel(R) Xeon(R) Silver 4116 CPU @ 2.10 GHz, except for Athena [31] and Supernova [39], which were tested on other high-performance computing clusters due to their large memory requirements.The thread number set in each assembler was the same for different samples.The time consumption of the format conversion from stLFR reads to 10X linked-reads was not included, which was about 500 min for a data set with 50 Gb in a single thread.We found that MetaTrass was less F I G U R E 6 Single nucleotide variation (SNV) density and the number of unique and shared variations for each species appearing in all three healthy samples.(A) is the SNV density.(B), (C), and (D) are the number of SNVs, small, and large indels, respectively.The species numerical order in subfigures (B), (C), and (D) corresponds to the appearance order of species from left to right in subfigure A.
T A B L E 1 Runtimes and thread number of each assembler for all the human gut data sets Note: The exact runtime of assembling sample P_Gut_Meta01 by Athena was not collected correctly due to several uncontrolled interrupts on the high-performance computing cluster.
time-consuming than Athena but more time-consuming than other assemblers.This may be because both MetaTrass and Athena contained many subassembling, which took most of the time among all subprocesses in MetaTrass (Supporting Information: Table S8).Since the subassembling was independent, it could be run in parallel to further speed up by increasing the parallel number.Under the default parallel number of 8, the memory peaks used by MetaTrass were more than 50 Gb, but not more than 100 Gb for the four human fecal samples (Supporting Information: Table S9).

DISCUSSION
High-quality genomes at the species level are strongly demanded to investigate the genetic origins of diseases associated with the human gut microbiome.However, how to get a sufficient number of them in one sample is still a challenge due to the inter-species repeats and uneven abundance in metagenome assembly.In this study, we developed a tool to get high-quality genomes with fine taxonomic resolutions by combining the cobarcoding information with public references.Compared with conventional strategies, our pipeline generated a large number of high-quality genomes for the human microbiome cobarcoding data sets.Meanwhile, plenty of draft genomes were also assembled with an NG50 value larger than 1 Mb, some of which were even longer than the references for both the mock and human fecal data sets.For all four human fecal samples, 178 draft genomes with high completeness and low contamination were generated by MetaTrass, but their genome fractions relative to the references were low.The differences between the sample genomes assembled by MetaTrass and the reference genomes demonstrated that the cobarcoding information could be used to reduce the false-negative reads in taxonomic binning.These reads retrieved from interspecies repeat and genome-specific regions by cobarcoded read refinement could significantly improve the quality of assemblies.For the patient sample, the number of high-quality genomes with long contiguity assembled by MetaTrass was significantly larger than that generated without cobarcoded read refinement (Supporting Information: Figure S13).MetaTrass was originally designed for stLFR, but is also suitable for other kinds of cobarcoding sequencing reads.We analyzed the 10× Genomics linked-reads data set of ATCC Mock-20 used in the development of Athena [31], which contains two bacterial genera with multiple species.As a result, MetaTrass recovered all these species with highquality genomes, while the conventional strategies only gained several metagenomes with various degrees of missing, heterogeneity, and contamination (Supporting Information: Figure S14).This also demonstrated the capability of MetaTrass to recover metagenomes at species or strain resolution.Noted that Supernova [39] was replaced with CloudSPAdes [42] in the step of cobarcoded read assembly.Since the number of long fragments with the same barcode in linked-reads is greater than that of stLFR reads [32], more false-positive reads were introduced into the cobarcoded refined read sets, leading to the unsuccessful assembling of several species by Supernova.
The efficiency of our pipeline depends on the cobarcoding information quality and the cobarcoding assembler.The cobarcoding information quality has strong effects on the precision and sensitivity of cobarcoded read refinement.By comparing genome fractions of different read sets including the TRs defined in the Section "Methods," the refined reads, and all reads for species with medium abundance in P_Gut_Meta01 (Supporting Information: Table S10), we evaluated the sensitivity of the cobarcoded read refinement.We observed that the fraction with high depths of the refined reads was higher than that of the TRs, but still lower than that of all aligned reads.These results indicated that there were still some false-negative reads introduced by the low coverage or the short length of long fragments.The cobarcoded read assembly of Supernova consumed most of the computational time in the tests of the four human fecal samples and it cannot effectively assemble data sets with a high ratio of false-positive reads in the test of linked-reads.Thus, improvements on the cobarcoding library and algorithms of the cobarcoded read refinement and assembly would improve the performance of MetaTrass.

CONCLUSION
In summary, the application of MetaTrass in human fecal samples showed great promise in generating high-quality genomes for a real complex microbial community at a fine resolution.With the increasing number of reference genomes from various microbial communities and the development of cobarcoding sequencing libraries, the binning-first-and-assembly-later strategy in MetaTrass will be strengthened and facilitate the investigation of the association between host phenotypes and microbial genotypes for different microbial communities.

Data sets
A mock microbial and four human gut microbial communities were analyzed to evaluate the efficiency of MetaTrass.The mock microbial community (Zymo-BIOMICS™ Microbial Community DNA Standard.Catalog D6305, Lot ZRC190812) consists of eight isolated bacteria and two fungi with an average abundance of about 12% and 2%, respectively (Supporting Information: Table S4).The four human gut microbial communities were sampled from the feces of three healthy volunteers and one patient with inflammatory bowel disease.The stLFR libraries were constructed according to the standard protocol [32].The DNA samples were first sheared into long fragments, and then the long fragments were captured by magnetic microbeads with a unique barcode sequence.Finally, each long fragment was further fragmented and hybridized with a unique by the Tn5 transposase on the surface of the microbead.The stLFR libraries of the mock microbial community and the patient sample were sequenced on the BGISEQ-500 platform, and those of healthy samples were sequenced on the MGISEQ-2000 platform.The length of paired-end reads is 100 bp for all data sets.The three healthy sample libraries were individually allocated to a half lane and generated a total of about 50 Gb raw data on average.The mock and patient library was individually allocated to a full lane, resulting in about 85 and 100 Gb raw data, respectively.Barcode sequences were extracted from the end of the reverse read in a pair and then replaced with numerical symbols in the read names in the FASTQ file with our public tool stLFR_barcode_split [55].SOAPfilter_v2.2 with parameters (-y -F CTGTCTCTTATACACATCTTAGGAAGA CAAGCACTGACGACATGA -R TCTGCTGAGTCGAG AACGTCTCTGTGAGCCAAGGAGTTGCTCTGG -p -M 2 -f -1 -Q 10) was used to remove low-quality raw reads with adaptors, excessively confusing bases, and high duplications.Finally, 59.45 Gb of clean data were retained for the mock microbiome, 34.48 Gb for the first healthy sample (H_Gut_Meta01), 35.33 Gb for the second (H_Gut_Meta02), 37.88 Gb for the third (H_Gut_Meta03), and 97.20 Gb for the patient sample (P_Gut_Meta01).The clean data of all samples are available in the CNGB Sequence Archive (CNSA) [56] (https://db.cngb.org/cnsa/) of the China National Gene-Bank DataBase (CNGBdb) [57] with accession number CNP0002163.

Taxonomic binning
We adopted Kraken2 (version 2.0.9-beta)[44] to classify stLFR reads into different species.First, customized Kraken databases were constructed using the reference genomes of a studied microbial community.Then, the stLFR reads were classified according to the database.
Both processes of Kraken were run with default parameters.Specifically, references attached to the ZYMO product were used to construct the database for the mock sample.The Kraken2 database of the UHGG collection was downloaded for the human fecal samples, which included 4542 representative genomes at the species level [4].

Cobarcoded read refinement
A taxonomic tree of references was established to reduce the number of multiple hits of a K-mer from inter-species repeat sequences in Kraken2 [44].The reads from these repetitive regions were classified into the lowest common ancient (LCA) rank higher than its corresponding species.Several previous studies tried to reallocate these reads to species by statistical inferences using the coverage depth or cobarcoding information [58,59].In the MetaTrass pipeline, the cobarcoding correlation between reads classified into a species and those classified into the LCA ranks higher than that species was used to reduce the false-negative reads assigned to higher taxonomic ranks.Reads with a confident taxonomic assignment at the species level were treated as taxonomy reads (TRs) of the corresponding species.We collected and refined reads for each barcode according to the number of reads categorized as TRs (Num_T) and the ratio of these reads to the total reads (Ratio_T).Barcodes attached to TRs were first extracted as candidates.We ranked candidates according to Num_T in descending order first and then by Ratio_T also decreasingly.Finally, reads not more than 300× were collected to improve computational efficiency based on the rank of the barcodes.Since sufficient read coverage is required for assembling a complete genome, only the species with an abundance higher than 10× were retained in this step.Reads assigned to these species were further clustered and filtered with the cobarcoding correlation.The abundance of each species was estimated by the quotient of the total base number of TRs to that of the reference.The refined reads of each species were extracted using Seqtk (version 1.3-r114-dirty) [60].Note that there were still some false-positive reads, probably resulting from the collision of long fragments from different species captured by the same microbead, even though Ratio_T was set to minimize the number.Sequences assembled from these reads would be further filtered out in the way described in the Section, "Contamination removal."

Cobarcoded read assembly
Reads of a single species with an abundance higher than 10× were assembled in Supernova (version 2.1.1)[39], a high-performance cobarcoding de novo assembler for single large eukaryotic genomes.Supernova was designed for linked-reads of 10× Genomics (http:// www.10xgenomics.com/),which have different barcodes and formats from stLFR reads.Thus, we converted the stLFR reads into linked-reads FASTQ format before running Supernova.Additionally, the parameteraccept-extreme-coverage was set to yes to allow for large coverage depth differences.

Contamination removal
As long fragments from different species could be captured by the same microbead in the stLFR library at low odds, reads from different species could putatively be labeled with the same barcode.As a result, a draft assembly of reads after cobarcoded read refinement might contain contaminated sequences from other species.We cleaned the assembly based on its similarity to the corresponding reference genome.The AF and ANI have been commonly adopted to circumscribe species [3,4].MetaTrass also adopted the metrics of AF and ANI.The ANI was independently calculated for each alignment.The AF was defined as the ratio of total alignment length with sufficiently high ANI to the total contig length.The alignments with ANI values higher than 90% were used in the AF calculation, and the contigs with AF values higher than 50% were retained to form the final assembly.The initial alignments between contigs and references were generated by QUAST (version 5.0.2) [54] with default parameters, except that the identity threshold to obtain valid alignment was set to 90%.

Evaluations
Both reference-based and reference-free assessments were used to evaluate the quality of assemblies obtained using different strategies.For the mock microbial community, the reference-based tool QUAST [54] was used to evaluate the contiguity and accuracy of metagenomic assemblies.Minimap2 (2.17-r974-dirty) [62] was used to map assemblies to references and get valid alignments with the identity threshold of 95%.Then, statistics such as genome fraction, NG50/NGA50, and the number of misassemblies were assessed from the alignments with default parameters.For the human gut microbial communities, the reference-free tool CheckM (version 1.1.2)[63] was implemented with default parameters to evaluate the completeness and contamination of each genome.A definite number of marker genes conserved across almost all bacteria were used as the basis in CheckM.Following the guidance proposed in CheckM, we defined high-quality assemblies (completeness > 90% and contamination < 5%) and mediumquality assemblies (completeness > 50% and contamination < 10%, but not meet both completeness > 90% and contamination < 5%).In addition, the statistics of each genome notably N50, genome size, and taxonomic rank were also obtained by CheckM, and the taxonomic rank was used to show the resolution of a genome.

Genomic variation estimation and taxonomic classification
All the high-quality genomes assembled by MetaTrass were used to call variations for the four human fecal samples.We aligned each genome to their corresponding reference using Minimap2 [62] with parameters (-x asm5) to prevent an alignment extending to regions with diversity >5%.SAMtools (version 1.9) [64] and PAFtools (2.17-r982-dirty) [65] were used to convert the BAM file of initial unsorted alignments into a PAF file of sorted alignments.We identified variations using the "call" module in PAFtools with parameters (-L 10000) to filter out the alignments shorter than 10,000 bp.We considered single-nucleotide substitutions as SNVs but ignored single-base insertions or deletions.Insertions or deletions with lengths shorter than 50 bp were defined as small indels, and the rest were defined as large indels.Those genomic variations with the same position and sequence information that belong to the same species in different samples were determined as shared variations.We used the "classify_wf" function of GTDB-Tk (version 0.3.1)[51] to conduct taxonomic annotation of the genomes obtained by both MetaTrass and the conventional strategies with default parameters.Considering the procedure of the UHGG database construction [4], genomes were assigned at the species level if the AF to the close species representative genomes was higher than 30% and ANI was higher than 95%.Interactive Tree of Life (iTOL version 4.4.2) [66] was used to visualize taxonomic constructed from the annotated species information.
Flowchart and scheme of MetaTrass.(A) Flowchart of the MetaTrass assembling pipeline.(B) Scheme of cobarcoding correlation and taxonomic distribution of reads from different regions.

F I G U R E 2
Scheme and evaluations for different strategies.(A) Difference labels of the assemblies based on different sequencing and assembling strategies.(B) Genome fraction, NG50, and NGA50 evaluated by QUAST for the assemblies.
and CheckM evaluations of MetaTrass assemblies for the four human fecal samples.(A) Genome fraction.(B) Scaffold N50.(C) Box plot of completeness and contamination.(D) Number of high-and medium-quality genomes

F I G U R E 4
Comparison of metagenome assembly for different methods.(A) Number of high-and medium-quality genomes assembled with different methods.(B) Number of high-quality genomes with high-and low-rank with different methods.