Characterization of Exogenous Sequence Fragments in Extracellular Vesicles from Human

Extracellular vesicles (EVs) play crucial role in mediating intercellular communication. Small RNA is an important component in EVs. However, the proportion of small RNA sequencing (smRNA‐seq) reads in EVs mapped to the human genome is much lower than that of cells, suggesting the existence of many nonhuman sequences in EVs. However, there is no systematic study on EV fragments unmapped to the human genome. Herein, using EV smRNA‐seq data, the landscape of exogenous RNA cargoes in human EVs is portrayed. The results show the distribution of nonhuman sequence fragments in 1838 EV samples; an average of 21.82% of reads are unmapped to the human genome, and 12.33% are mapped to the collected exogenous reference sequences. Furthermore, the proportion of exogenous sequences in plasma EV samples is the lowest, while in the cell line EV samples, it is much higher, mainly from animals, bacteria, or contaminants. Exogenous sequences from plants are mainly from food, and the exogenous bacteria are mainly gut microbiota. Virus‐derived sequences reflect the high prevalence of viruses in the population, such as herpesvirus and hepatitis virus. This study provides the first landscape of exogenous fragments in human EVs and implies diverse RNA sources in the human body.

discussion of exogenous RNAs in human EVs; there is a research gap in the composition of exogenous fragments in human EVs.
To systematically investigate nonhuman fragments in EVs isolated from the human body, we collected 1838 human EV small RNA sequencing (small RNA-seq) datasets from EVAtlas [6] and six types of exogenous genomic sequences from authoritative databases.In this study, we showed that the distribution of exogenous fragments varied under different conditions and identified the main sources of exogenous fragments, including food-related species, pathogens, and intestinal bacteria.A comprehensive overview of the exogenous sequence fragments would provide a landscape of RNA components and implicate the potential source of nonhuman fragments in human EVs.

Sample Profiles and Exogenous Reference Sequences
We collected small RNA sequencing datasets of human EVs from the EVAtlas database.Samples with less than 40% reads mapping to the human genome were discarded as unqualified samples (Figure S1, Supporting Information), and 1838 samples were used in the study.Descriptive statistics of the samples are shown in Figure 1.Most samples were from plasma (1038), followed by serum (312), cell lines (253), saliva (98), cells (53), sperm (46), breast milk (36), and urine (2) (Figure 1A).The samples were taken from 13 countries (Figure 1B) and included 19 cancer types, 5 noncancer diseases, and 7 other physiological statuses (Figure 1C).To study the exogenous reads in human EVs, we collected six types of reference sequences: animal (40 species), plant (19 species), bacteria (4644 species), fungi (1506 genomes), virus (63 825 sequences), mycoplasma and chlamydia (9 common human pathogens), and contaminant (6291 sequences) (Table S1, Supporting Information).Then, we simultaneously mapped the EV small RNA-seq data from human to these exogenous references.Reads mapped to multiple references were assigned to one reference according to the reads dynamic assignment algorithm (RDAA). [6]As shown in Figure 1D, in 1838 samples, an average of 21.82% of reads were unmapped to human genomes, and an average of 12.47% of all reads were mapped to exogenous reference sequences (maximum 48.78%, minimum 0.20%).Most samples (1508) are in the range of 0.20% and 20%.

The Distribution of Exogenous Sequences in EVs of Healthy Samples
First, we studied the proportion distribution of nonhuman reads in healthy samples by summarizing the mapping results.The results demonstrated that the distribution of exogenous reads was significantly different under various conditions, as shown in Figure 2A.Healthy samples were from plasma (108) and serum (49).In plasma samples, the proportion of all exogenous sequences was much lower than in serum samples.The highest mapping rates in plasma samples were found in bacteria (average 1.99%) and animals (average 1.39%).However, it is worth noting that most serum samples may contain a higher proportion of contaminants (6.34%).

Exogenous Species with the Highest Mapping Rate in EVs of Healthy Samples
To explore which species the exogenous fragments came from, we summarized the top 20 species with the highest mapping rate in the references of plants, animals, bacteria, fungi, and viruses in 157 healthy samples (Figure 2B).When mapped to animal reference genomes, we found that the top 20 species are fishes (10 types: Scleropages formosus, Plectropomus leopardus, Mastacembelus armatus, etc.), parasites (5 types: Entamoeba histolytica, Paragonimus westermani, Taenia solium, etc.), and domestic animals (5 types: chicken, sheep, goat, bovine, pig).It is worth mentioning that bovine-mapped sequences may also be caused by experimental contamination because fetal bovine serum is a common preparation in the laboratory.In the mapped results of the top 20 plants, they were familiar fruits (Malus domestica (apple), Citrus reticulate (orange), Vitis vinifera (grape), etc.), vegetables (Solanum lycopersicum (tomato), Cucumis sativus (cucumber), Brassica rapa, etc.), and staple foods (rice, Solanum tuberosum (potato), etc.).
The top 20 subtypes of viruses with the most abundant sequence fragments included 10 types of Herpesviridae, 4 types of viruses causing hepatitis (hepatitis C and E), and 2 types of rotaviruses.For fungi, five of them were fungi that infect crops (Zymoseptoria tritici, Histoplasma ohiense, Fusarium pseudograminearum, Parastagonospora nodorum, and Fusarium verticillioides), and other fungi occurred in brewing (Schizosaccharomyces pombe), seasoning, and food spoilage (Aspergillus fischeri, other Aspergillus species are used to ferment alcohol and soy sauce).Among the bacteria, those with the highest mapping rates were commonly gut microbiota, such as the genera Collinsella [12] and Butyricicoccus. [13]For the contaminants that cannot be ignored, we curated various reference sequences of model organisms commonly used in the laboratory such as yeast, Escherichia coli, zebra fish, Arabidopsis thaliana, as well as vector contamination (Table S2, Supporting Information).Furthermore, contamination that occurs during experimental preparations, such as mussels, is also included. [14]

The Distribution Difference of Exogenous Sequences in EVs of Cancer Samples
The proportion of exogenous sequences from various sources differed significantly in EVs from the same cancer type (Figure 3A).In samples of the same cancer, EVs from the cancer cell line had a higher proportion of contaminants or bacteria than in plasma or serum samples.For example, in colorectal cancer (CRC) or breast cancer (BC) cell line samples, the proportion of bacteria was higher than in samples from plasma or serum.In glioblastoma (GBM) cell lines or cell samples, the proportion of contaminants was higher than that in serum samples.In BC samples from serum, there was a sample containing a high percentage of reads mapped to plants.After scrutiny of this sample, we found that it had a much higher proportion of reads mapped to fruit, staple food, and vegetables than the other three samples.In renal cell carcinoma (RCC) samples, it is noticeable that samples from urine contained a significantly higher percentage of reads mapped to bacteria than cell line samples.This may be because urine has a richer microbiome than cells, and studies have shown that urinary tract infection is positively correlated with RCC development. [15]5.The Difference in the Proportion of Nonhuman Sequences Between Disease and Normal Samples EVs contain nucleic acids from parental cells and are involved in many biological processes, which can reflect the physiological and pathological changes of their parental cells.[16] Thus, the nonhuman sequences in disease samples may differ from those in matched healthy samples.We compared the proportion of exogenous sequences in five types of cancers and two types of noncancer diseases with matched healthy samples (Figure 3B and S2A, Supporting Information).The bacterial proportion in EV sequence fragments of cancer samples was significantly lower than that in matched healthy samples (Wilcoxon p-value = 3.63 Â 10 À5 ) (Figure S2B, Supporting Information), although there were variations among samples.Specifically, in LUAD, PAAD, PH, preeclampsia (PE), cancer or disease samples, the proportion of bacteria was higher than that in normal samples (Figure 3B), while in CRC, neuroblastoma (NB), and PRAD cancer or disease samples, the opposite result was observed (Figure S2A, Supporting Information).The cancer/disease samples and their matched healthy samples of each type are from the same project.However, when comparing the samples of the same disease from different projects, a studyspecific effect emerged (Figure S3, Supporting Information).
This may be because of the heterogeneity EVs and their sourced samples.
The gut microbiota plays an important role in cancer initiation and progression, [17] and we also found that the subtypes of bacteria with a high proportion in EV exogenous sequences mainly belonged to gut bacteria (Figure 2B).Thus, we further explored the function of exogenous sequence fragments from human EVs mapped to bacteria reference genomes in cancer and healthy samples.Enrichment analysis for bacterial gene function was performed using the Clusters of Orthologous Genes (COG). [18] shown in Figure 4A, the enrichment results in cancer and normal samples were similar, except that in normal samples, the defense mechanisms had the second highest number of genes, while it was not a dominant class in cancer samples.
Viruses are also closely related to cancer. [19]Therefore, we also summarized the viruses with the highest mapping rate in cancer samples, as shown in Figure 4B.It has been reported that herpesvirus is associated with BC. [20] Consistent with this conclusion, our results show that human herpesvirus was the virus with the top three mapping rates in BC samples (0.0025%, nonhuman sequences as the denominator, similarly Parainfluenza virus and mammalian rubulavirus were found only in LUAD samples, and their respective mapping rates were very high (1.5% and 0.3%) (Figure 4B).

Exogenous Sequences in Human EVs Correlated with Dietary Habits in Different Countries
Food is another potential source of cargoes in human EVs; thus, we analyzed whether exogenous sequences in human EVs are influenced by the dietary habits of the population in different countries.To explore the distribution of food-related reads in EVs from human samples, we classified the exogenous references into six food types: poultry, livestock, fish, vegetable, staple food, and fruit (Figure 5A).Samples from the USA had a significantly higher mapping rate of fish (16.54%) than those from Australia (2.94%) and China (2.36%).In samples from Australia and China, the sum of vegetables and fruits (Australia 2.68%, China 1.60%) was higher than the sum of poultry and livestock (Australia 1.02%,China 1.22%), while in samples from the USA, the opposite situation was observed (0.85% versus 2.25%).We also investigated the proportion of different types of staple foods in 108 healthy plasma samples (Figure 5B).In detail, we found that reads mapped to wheat accounted for the largest proportion (67.96%), followed by maize (13.17%), rice (9.2%), soybean (5.65%), and potato (4.02%).To explore whether sequences in plants can influence human cellular phenotypes, we aligned these sequences to miRNA reference sequences [21] (Figure 5C).Interestingly, the sequence ACAACGAGAGAGAGCACGCT was mapped to miR535, albeit in low abundance (127 counts), and from only three samples (NB, plasma samples, SRR8696986 SRR8696987 SRR8696988 from China).miR535 is found in many plants, such as rice, grape, apple, and peach.In rice, osa-miR535 could modulate plant height, panicle architecture, and shape and suppress rice immunity against Magnaporthe oryzae by regulating OsSPLs genes. [22,23]We predicted the gene targets of the mapped reads and conducted functional annotation of these target genes.The results showed that synapse organization and Ras protein signal transduction were enriched in over 40 genes.Ras proteins are mutated in many human cancers and target novel approaches for cancer treatment. [24]

Discussion and Conclusion
EVs are emerging as critical players in intercellular communication by delivering their cargo.In addition to the human sequences in EVs, human EVs also contain many exogenous sequences. [7]owever, the report by Zhang et al. that rice MIR168a can be absorbed by the gastrointestinal tract and regulate LDLRAP1 expression in the liver has attracted much controversy.Doubts have focused on the inability of different laboratories to reproduce the result, [25] the extremely low levels of dietary miRNAs detected in vivo, and the effects of contamination from laboratory preparations and cross-contamination between samples that cannot be underestimated. [26]In response to this question, Zhang emphasized that the detection of food-derived plant miRNAs needs to focus on the selection of proper miRNAs and appropriate isolation methods.The subsequent effect of herbal-derived miRNAs on the influenza virus in mice also provided evidence for the retention and function of orally taken plant miRNAs in mammals. [27]o comprehensively characterize the exogenous sequence fragments in EVs from humans, we collected myriad reference genomes and applied read alignment.In the results, we showed the distribution of exogenous reads mapped to animals, plants, viruses, bacteria, fungi, and contaminants in the healthy samples and summarized the species with the highest mapping rate in Nightingale rose charts.Compared with animal-derived sequences, plant-derived sequences have the 2 0 -O-methyl modification of the 3 0 -ends, which protects them from uridylation [28] and has a lower degradation rate during cooking or in low-pH environments in the gastrointestinal tract. [8]Therefore, the plant species with the highest mapping rate are common foods, especially vegetables and fruits, which often enter the human body as raw food.We identified miR535-mapped reads only in NB samples.The predicted functions of miR535-mapped reads are related to synactivity, dendrite morphology and development, and Ras protein regulation.In the fungi chart, there were fungi infecting crops and fungi associated with food processing, which suggests that the fungi-derived sequences in human EVs could also reflect the composition of the human diet and microbes associated with food processing.Diet is an important and diversified source of exogenous sequences in the human body. [7]he distribution of nonhuman reads was different between conditions in some cancer samples, especially the proportion of bacteria.Bacteria and viruses in the human body are closely related to cancer.As the bacteria with a high mapping rate mostly belong to the gut microbiota, the difference in enriched functions between cancer and normal samples may reflect the change in host-microbiota interactions.The relationship between cancer and viruses has been discussed extensively. [29,30]Notably, parainfluenza virus and mammalian rubulavirus were detected only in LUAD samples and had significant abundance.In the contaminant category, it is difficult to determine whether mycoplasma, chlamydia, and E. coli are present as contaminants or pathogens.We divide mycoplasma and chlamydia into human pathogens (high prevalence) and contaminants (low prevalence) based on their prevalence in the population.][33][34] The sequencing dataset used in this study was originally not for the purpose of detecting exogenous RNA; only a small quantity of reliable exogenous miRNAs were detected, and the source of contamination in the experiment was unknown.We tried our best to distinguish the mapping results that might be caused by contamination, but we still could not discriminate all the contaminated sequences.

Figure 1 .
Figure 1.Summary of sample statistics and reads mapping of 1838 human EV small RNA sequencing samples.A) Material and body fluid source distributions of 1838 samples.B) Country distributions of 1838 samples.C) Numbers of samples of each kind of disease and condition.D) Proportions of reads mapped to the human genome, six types of exogenous references, and unmapped to any references in 1838 samples.

Figure 2 .
Figure 2. The sequence distribution of EVs from plasma and serum in healthy samples, and their top exogenous species.A) Bar plot for nonhuman reads distribution in 108 plasma and 49 serum healthy samples.B) Top 20 species with highly abundant exogenous sequence fragments for animals, plants, bacteria, viruses, and fungi in 157 healthy samples.

Figure 3 .
Figure 3. Proportions of EV reads from eight diseases mapped to the human genome and six types of exogenous references.A) Proportion of reads mapped to different types of references in breast cancer (BC), colorectal cancer (CRC), glioblastoma (GBM), lung adenocarcinoma (LUAD), prostate adenocarcinoma (PRAD), and renal cell carcinoma (RCC) samples with different material and body fluid sources.B) Proportion of reads mapped to different types of references in LUAD, pancreatic cancer (PAAD), pulmonary hypertension (PH) samples, and their matched normal samples.

Figure 4 .
Figure 4. COG annotation of EV bacterial sequences and proportion of virus sequences in EVs from cancer and normal samples.A) Enriched COG function of mapped bacterial reads in cancer (492) and normal (157) samples.B) Viruses with the highest mapping rate in breast cancer (BC) (18) and control (1) samples and lung adenocarcinoma (LUAD) (19) and control samples (5).The x-axis represents the proportion of virus-mapped reads to all nonhuman reads of samples.

Figure 5 .
Figure 5.The proportion of diet-correlated nonhuman reads of healthy samples and plant-derived miR535-mapped reads.A) The proportion of dietcorrelated reads to nonhuman reads in healthy samples from China (8, plasma), Australia (47, serum), and the USA (100, plasma).B) The ratio of exogenous sequences aligned to each type of staple food to sequences aligned to all types of staple food in human EVs from healthy samples (108, plasma).The sizes of the food genome are shown in the legend.C) Identified possible exogenous miRNAs with the highest abundance from all samples and the functional enrichment of the target genes of the reads.