Assessment of bacterial and viral gut communities in healthy and tumoral colorectal tissue using RNA and DNA deep sequencing

Abstract Background Colorectal cancer (CRC) is known to present a distinct microbiome profile compared to healthy mucosa. Non‐targeted deep‐sequencing strategies enable nowadays full microbiome characterization up to species level. Aim We aimed to analyze both bacterial and viral communities in CRC using these strategies. Materials & Methods We analyzed bacterial and viral communities using both DNA and RNA deep‐sequencing (Novaseq) in colorectal tissue specimens from 10 CRC patients and 10 matched control patients. Following taxonomy classification using Kraken 2, different metrics for alpha and beta diversities as well as relative and differential abundance were calculated to compare tumoral and healthy samples. Results No viral differences were identified between tissue types, but bacterial species Polynucleobacter necessarius had a highly increased presence for DNA in tumors (p = 0.001). RNA analyses showed that bacterial species Arabia massiliensis had a highly decreased transcription in tumors (p = 0.002) while Fusobacterium nucleatum transcription was highly increased in tumors (p = 0.002). Discussion Sequencing of both DNA and RNA enables a wider perspective of micriobiome profiles. Lack of RNA transcription (Polynucleobacter necessarius) casts doubt on possible role of a microorganism in CRC. The association of F. nucleatum mainly with transcription, may provide further insights on its role in CRC. Conclusion Joint assessment of the metagenome (DNA) and the metatranscriptome (RNA) at the species level provided a huge coverage for both bacteria and virus and identifies differential specific bacterial species as tumor associated.


| INTRODUCTION
Human microbiome studies have grown exponentially in the last decades, documenting several key associations to health and disease.2][3][4][5] The Hallmarks of Cancer now includes polymorphic microbiomes as new enabling characteristics, highlighting its importance in both the carcinogenic processes as well as in prognosis or development of resistance to chemotherapy. 6olorectal cancer (CRC) is the third most common incident cancer worldwide and the second in terms of mortality. 7CRC is one of the most studied cancers when it comes to the human microbiome. 8Diet is a major CRC risk factor and the microbiome has a direct effect on the nutrient metabolism in the colorectal epithelium. 1,3,9The ecological composition of colonic mucosa can directly influence tissue microenvironment and its functionality not only by the presence of certain microorganisms but also by their metabolic outputs. 9It has been widely demonstrated how gut microbiome shifts can lead to proinflammatory states that favors tumorigenic processes. 6,10Many bacterial taxa such as enterotoxigenic Bacteroidetes fragilis (ETBF) or Escherichia coli pks + have been related with tissue damage and DNA mutations via the production and secretion of bacterial enterotoxins. 11][14][15] Although many studies have described the human gut microbiome, 16,17 microbiome assessment techniques vary greatly both in the laboratory phase and in the bioinformatic analysis phase.As next generation sequencing (NGS) technologies are nowadays widely available at low costs, the power of this technology should be used to increase the resolution of the microbiome profile in CRC.In previous work with part of the samples used in the present study, we investigated the differences between healthy mucosa and tumor tissue from CRC patients as well as the differences between those CRC patients and healthy subjects using 16S rRNA sequencing. 18These analysis focused only on bacterial communities and were restricted most of the times to a genus level classification due to the technical limitations of only sequencing the 16S rRNA gene. 19In the present study, these issues were addressed by studying both bacterial and viral communities at the species level in colorectal human samples using both metagenomics and metatranscriptomics (both DNA and RNA deep-sequencing).

| Clinical study protocol
The present study includes biopsies from 10 patients who were diagnosed with Stages I-III colon cancer after colonoscopy examination and 10 patients that acted as controls where no tumor (neither malignant nor benign) was seen during colonoscopy examination.Cases and controls were matched by age and sex and description of all tumors and 2/10 controls as well as results on their 16S rRNA sequencing analysis have previously been published. 18haracteristics of age, sex, body mass index (BMI) and tumor stage from these 20 patients can be seen in Table 1.
The study was approved by the Regional Ethical Review Board in Gothenburg under study number 233-10 and registered at Clini calTr ials.gov(ID: NCT03072641).Informed consent was obtained from all study subjects.All research was performed in accordance with relevant guidelines and regulations.

| DNA and RNA extraction
DNA was extracted using the AllPrep DNA/RNA/Protein Kit (Qiagen), followed by polymerase chain reaction (PCR) inhibitor removal with the OneStep-96 PCR Inhibitor Removal Kit (Zymo Research), and DNA concentration measurement with the Qubit 2.0 Fluorometer (Life Technologies).Total RNA was extracted with the AllPrep DNA/RNA/Protein Kit (Qiagen) and reverse transcribed to single-stranded cDNA using the high-capacity cDNA reverse transcription kit (ThermoFisher Scientific) following manufacturer's instructions.Double-stranded cDNA was prepared following step 2 of the Maxima H Minus Double-Stranded cDNA Kit (ThermoFisher Scientific) and the cDNA was cleaned using the GeneJET PCR T A B L E 1 Clinical information for patients included in the study.

Non-CRC subjects
Sample

| Whole genome and whole transcriptome sequencing
Extracted material (DNA and cDNA) was thereafter subjected to library preparation, using the Nextera XT DNA library preparation kit (Illumina) following the manufacturers' reference guide, starting with 1 ng of DNA/cDNA (as recommended by the manufacturer) and using unique indexed adapters to facilitate pooling of the libraries.
Libraries were individually quantified using the Quan-tiFluor system (Promega) and the library sizes were measured using the Bioanalyzer (Agilent) as quality analysis.All 20 libraries were normalized to 2 nM and pooled prior paired-end sequencing using NovaSeq 6000 system (Illumina) at 2 × 150 bp aiming for 100 M high-quality paired end reads/sample.

| Bioinformatic preprocessing
Indexes, included in the Illumina adapters, were used to assign raw sequence reads obtained from the NextSeq500 (Illumina) platform to the originating samples.Reads were quality checked and adapter trimmed with Trimmomatic using 36 bp as minimal length for the reads. 20Highquality reads were screened against the human reference genome hg19 using NextGenMap 21 and only reads that did not map to the human genome, with >95% identity over 75% of their length, were considered as non-human and further analyzed for microbiome.Once human reads were filtered from the data set, high-quality non-human reads were classified using Kraken2 v. 2.1.1, 22which was run against a reference database containing all RefSeq bacterial and viral genomes (built in December 2020) with a 0.1 confidence threshold.

| Diversity analysis and statistics
All diversity analyses were performed at the species level using R (v.4.2.2).Packages used in this analysis were biomformat, 23 phyloseq, 24 ggvenn, 25 tidyr, 26 ggpubr, 27 vegan, 28 vtable, 29 metagenomeSeq, 30 funrar, 31 and superheat. 32he biom files generated with kraken-biom 33 were used, together with sample metadata, to construct phyloseq objects.Results reported all species which comprised more than 0.1% (in at least one sample) of total bacterial or viral reads, separately.Bacterial and viral species relative abundance comparison using group F-tests were carried out for healthy mucosa and tumors in both DNA and RNA datasets (Tables S1-S4).Relative abundance plots for top 10 species were plotted for tumor and healthy mucosa in both DNA and RNA datasets.Observed species, Shannon and inverse Simpson alpha diversity indexes were calculated after rarefaction (between 68,535 and 5,034,325 reads, depending on the subset analyzed) to standardize species representation regardless of sample depth.Differences between groups for alpha diversity were calculated using unpaired Wilcoxon tests.Bray-Curtis and Jaccard beta diversity indexes were calculated to analyze differences between bacterial and viral communities, visualized using principal component analysis (PCoA) and ANOSIM tests were used to stablish if differences between groups were greater than differences within groups.Finally, differential abundance (DA) analysis comparing tumor and healthy tissue was performed for DNA and RNA separately.In this last step, species counts were transformed using cumulative sum scaling (CSS), log2 transformation, as well as a pseudocount addition to handle data sparsity.P-values obtained from DA analysis were corrected using False Discovery Rate (FDR).Statistical significance was obtained when p < 0.01.

| Sequencing coverage
Raw input fastq files including both human and non-human reads had an average size of 18

| Taxonomic classification
Bacterial taxonomic resolution at the species level was at 89.94% for DNA sequencing reads and 88.36% for RNA sequencing reads.We identified up to 5918 species corresponding to 6580 taxa when performing metagenomics and 5915 species within 6694 taxa when analyzing metatranscriptomics.For viral species, taxonomic resolution was slightly higher, reaching 90.23% for DNA sequencing reads and 89.68% for RNA sequencing reads (628 species were identified from 696 taxa when performing metagenomics and 391 species from 436 taxa when performing metatranscriptomics).
Filtering by a minimum of 0.1% of relative abundance in at least one sample translated into detection of 55 and 64 bacterial species for DNA and RNA sequencing analysis, respectively.The top 10 bacterial species per group are presented in Figures 1A (DNA

| Bacteria
Comparing the bacterial species between colorectal tumor specimens and healthy mucosa by analyzing the relative abundance revealed statistically significant differences (p < 0.01) for 12 bacterial species (Table S1).Among those, Polynucleobacter necessarius (reported not only to be found mainly in water ponds, but also to be significantly enriched in patients with septic shock and as an important antibiotic-resistant gene host [34][35][36] ), was the species with highest F-value, being more abundant in tumors (Figure 1A, Table S1; F = 24.652,p < 0.01).
Alpha diversity indexes, which informed about diversity within samples, did not show any statistical differences between groups (Figure 1B).Beta indexes, which informed about differences in bacterial communities diversity among tumors and healthy tissue, were assessed with two different indexes (Figure 1C,D), Jaccard and Bray-Curtis indexes, which showed a clear clustering but not reaching statistically significance.DA analysis revealed only one bacterial species being more abundant in tumors when comparing with healthy tissue: Polynucleobacter necessarius (Table 2; log2 fold change of 6.5 and adjusted p = 0.001).

| Virus
Up to 10 viral species were identified when performing metagenomic analysis (Table S2).However, there was no statistically significance found when analyzing relative abundance of species, alpha diversity, beta diversity, or DA analysis (Figure 2).S3).Alpha diversity indexes, which informed about diversity within samples, did not show any statistical differences between groups (Figure 3B).Bacterial communities showed a clear clustering when evaluating Jaccard (p = 0.001) and Bray-Curtis (p = 0.001) beta diversity indexes (Figure 3C,D).DA analysis reported two bacterial species: Arabia massiliensis being more abundant in healthy tissue (log2 fold change of −12.68 and adjusted p = 0.0001) and Fusobacterium nucleatum (log2 fold change of 5.31 and adjusted p = 0.002) being more abundant in tumors (Table 2).

| Virus
Up to six viral species were detected when analyzing metatranscriptomes for both tumors and mucosa (Table S4).
However, there was no statistically significance found when analyzing species relative abundance, alpha diversity, beta diversity nor DA analysis (Figure 4).

| DISCUSSION
We report an unbiased and deep sequencing of the metagenome and metatranscriptome in CRC cases with non-CRC controls, identifying three specific bacterial species associated with the disease.P. necessarius and F. nucleatum were highly increased in tumor specimens when analyzing metagenomics and metatranscriptomics, respectively.Furthermore, A. massiliensis was highly decreased in tumor tissue when analyzing RNA data.P. necessarius did not reach the established cut-off when analyzing RNA sequencing data (present in <0.1% of total bacterial transcripts in at least one sample).This species has been mostly identified in freshwater habitats, 34 but there are studies where the corresponding genus has been reported to be significantly enriched in patients with septic shock and as one important antibiotic resistance gene host. 35,36When encountering environmental microorganisms, it is important to identify if those really come from the tumor/specimen, or if those correspond to deposition, and therefore, we highlight the importance of using negative controls to be able to search for microorganisms that may represent deposition/contamination.Also, both DNA and RNA analyses were performed within the exact same samples, and a systematic identification was not detected.Furthermore, observing a lack of RNA transcription casts doubt on possible role of this microorganism in colorectal tumors.Previous studies on skin have detected several hundreds of human papillomavirus types when analyzing DNA sequences, but most of these viruses were not actively transcribed (apparently representing deposition). 37oth A. massiliensis (recently identified in stool material) 38 and F. nucleatum (the by far most CRC-associated bacterium in previous studies) were highly detected among the RNA sequences of the mucosa and the tumors, respectively.Compared to DNA analysis, A. massiliensis did not surpass the established cut-off, and F. nucleatum did reach it, but its relative abundance presence was low and not different between tumor and healthy mucosa specimens.The absence or low prevalence of these species may be explained by the higher abundance of other species.
Regarding viruses, there were no viral species detected as statistically significant in any of the groups (tumors vs mucosa) despite the depth of sequencing achieved.A recent study on CRC studied the relevance of the interplay between viruses and bacteria, highlighting the importance of the co-occurrence of bacteriophages and their influence on the bacterial communities. 39The present study identified four different phages: all of them were identified while performing metagenomics (Acionnavirus Synechococcus phage S-CAM8 [closely related to Synechococcus phage s rim8, known to be enriched in late CRC stages 40 ], Eganvirus Salmonella phage SW9, Aeribacillus phage AP45, and uncultured crAssphage) and only one of them was detected when analyzing metatrascriptomics (Aeribacillus phage AP45).While 3/4 phages detected in metagenomics showed an increased relative abundance in CRC, as reported by other authors, 41,42 the present study did not find any statistical significance.
The strengths of the study include that we compared different analysis (relative abundance, alpha and beta diversities as well as DA) for both DNA and RNA sequencing data, aiming to detect bacteria and viruses up to the species level.Prior to analysis, bacterial and viral species were filtered (0.1% of total bacterial/viral reads the presence in at least one sample) to reduce complexity, noise and technical variability while preserving data integrity and representing main communities. 43Choosing the appropriate cutoff for filtering is a key step in microbiome analysis since it can strongly influence the downstream results and result in false positivity/negativity. 43We also included multiple metrics for each analysis.Up to three different indexes were used to analyze the alpha diversity, offering a wide perspective of within sample diversity, increasing sensitivity for both richness and evenness. 44Beta diversity analyses included non-phylogenetic indexes accounting for both qualitative (Jaccard) and quantitative (Bray-Curtis) indexes to avoid bias of undersampling meanwhile maintaining sensitivity to rare species 45 and DA analysis to provide more reliable results for specific species enrichment (more than just comparing relative abundances) since this normalized and transformed taxa can account for data sparsity. 30he limitations of the study may include the size of samples analyzed.The total number of different specimens was 20 (10 colorectal tumors and 10 healthy mucosa specimens).However, each specimen was sequenced using both DNA and RNA unbiased approaches, and each library generated (total of 40 libraries) provided a very large amount of data (with an average of 18 M and 47 M high-quality nonhuman sequencing reads in DNA and RNA, respectively) detecting >6000 species.Adding data from different studies present in public repositories to increase sample size was avoided to not introduce biases in selection of patients or differences in the exact analysis methods used.The present study was based on a very careful matching of cases and controls as the most reliable strategy.
This study demonstrated how deep sequencing of both DNA and RNA enables a wider perspective of microbiome profiles: individual microbial features can vary depending on whether the metagenome or metatranscriptome is analyzed.Our combination of both DNA and RNA deep sequencing in the same samples revealed that tumors had more diversity in terms of microorganism presence while healthy tissues were more diverse when considering microorganisms with active transcription, in agreement with previous literature. 46hile F. nucleatum is already known to be a potential CRC biomarker for screening, diagnosis, and prognosis prediction for patient outcomes, [12][13][14][15] the fact that we found a stronger association with transcription of F. nucleatum (than merely the presence of the organism) may provide further insights on the role of this microorganism in CRC.Additionally, the great depth achieved with those sequencing methods provides a significant tool for the study of specific genes or transcripts of interest in microbiome-host interactions.
The role of the recently identified A. massiliensis 38 and P. necessarius needs to be further investigated to elucidate their implication in human health.The present study comprised strict cut-offs to focus the analysis only on abundant species (species present only in small amounts may conceivably represent contamination and were therefore not studied); however, validation of these "putative" findings would be desirable and findings may need to be considered carefully.

| CONCLUSION
Combining deep sequencing of both metagenome and metatranscriptome allowed a refined microbiome profile of CRC.Diversity differences were stable between DNA and RNA approaches, but the specific tumor-associated bacteria had some difference depending on whether the metagenome or metatranscriptome was analyzed.
) and 3A (RNA), and the relative abundances of the complete list of filtered species are available in Tables S1 (DNA) and S3 (RNA).Ten viral species were detected by metagenomics and six viral species were identified using metatranscriptomics, after filtering.Viral species relative abundances are presented in Figures 2A (DNA), 4A (RNA) and in Tables S2 (DNA) and S4 (RNA).

F
I G U R E 1 DNA Bacterial diversity of filtered reads.(A) Relative abundance grouped for top 10 bacterial species per group.(B) Observed, Shannon and Inverse Simpson alpha diversity indexes rarefacted at 5M reads.(C) Jaccard beta diversity PCoA.(D) Bray-Curtis beta diversity PCoA.

3. 4 |
Tumor versus healthy mucosa (RNA) 3.4.1 | Bacteria Comparison of colorectal tumor and healthy mucosa tissue revealed a decreased relative abundance of F I G U R E 2 DNA Viral diversity of filtered reads.(A) Relative abundance grouped for top 10 virus species per group.(B) Observed, Shannon and Inverse Simpson alpha diversity indexes rarefacted at 68K. (C) Jaccard beta diversity PCoA.(D) Bray-Curtis beta diversity PCoA.T A B L E 2 Differentially (p < 0.01) abundant bacteria (healthy mucosa vs tumor) in DNA and RNA datasets.Arabia massiliensis (F = 26.021,p < 0.01) in tumors (Figure 3A, Table

F I G U R E 3
RNA Bacterial diversity of filtered reads.(A) Relative abundance grouped for top 10 bacterial species per group.(B) Observed, Shannon and Inverse Simpson alpha diversity indexes rarefacted at 771K reads.(C) Jaccard beta diversity PCoA.(D) Bray-Curtis beta diversity PCoA.

F
I G U R E 4 RNA Viral diversity of filtered reads.(A) Relative abundance grouped for top virus species per group.(B) Observed, Shannon and Inverse Simpson alpha diversity indexes rarefacted at 416K.(C) Jaccard beta diversity PCoA.(D) Bray-Curtis beta diversity PCoA.