A priori estimation of sequencing effort in complex microbial metatranscriptomes

Abstract Metatranscriptome analysis or the analysis of the expression profiles of whole microbial communities has the additional challenge of dealing with a complex system with dozens of different organisms expressing genes simultaneously. An underlying issue for virtually all metatranscriptomic sequencing experiments is how to allocate the limited sequencing budget while guaranteeing that the libraries have sufficient depth to cover the breadth of expression of the community. Estimating the required sequencing depth to effectively sample the target metatranscriptome using RNA‐seq is an essential first step to obtain robust results in subsequent analysis and to avoid overexpansion, once the information contained in the library reaches saturation. Here, we present a method to calculate the sequencing effort using a simulated series of metatranscriptomic/metagenomic matrices. This method is based on an extrapolation rarefaction curve using a Weibull growth model to estimate the maximum number of observed genes as a function of sequencing depth. This approach allowed us to compute the effort at different confidence intervals and to obtain an approximate a priori effort based on an initial fraction of sequences. The analytical pipeline presented here may be successfully used for the in‐depth and time‐effective characterization of complex microbial communities, representing a useful tool for the microbiome research community.


| INTRODUC TI ON
The study of the human microbiome has dramatically expanded our understanding of the role that microbes play in health and disease. These studies have been facilitated by the development of next-generation sequencing (NGS) technologies, which are capable of generating enough sequences to cover most of the diversity present in a sample. However, capturing the full composition is still a challenge, even when estimated by the distribution of 16S rDNA sequences (Ni et al., 2013;Tamames et al., 2012). The study of transcriptomes of whole microbial communities, or metatranscriptomics, has increased exponentially in the last few years (Shakya et al., 2019;Zhang et al., 2019), and trying to analyze these kind of data has produced a new set of challenges. Thus, García-Ortega | 13383 MONLEON-GETINO aNd FRIaS-LOPEZ and Martínez (2015), using a nonparametric estimator for the number of undetected genes, found that on average, approximately 10% of the expressed genes per accession remain undetected if individual sequencing libraries are analyzed. The power and accuracy of such experiments depend substantially on the number of reads sequenced, so a crucial step in the experiment design should be determining the optimal read depth for a particular study or verifying whether the experiment has adequate depth (Robinson & Storey, 2014).
Most RNA-seq studies have focused on assessing the depth of transcriptome sequencing in eukaryotic systems, using a wide range of estimated sequencing depths to cover the full patterns of expression. In the human transcriptome, the sequencing depth estimated as necessary to observe differences in expression profiles varies from 100 to 700 million sequences (Toung et al., 2011;Westermann et al., 2012). In the case of prokaryotic RNA-seq experiments, Haas et al. have shown that the reads typically produced in a single lane of the Illumina HiSeq sequencer far exceed the number needed to saturate the annotated transcriptomes of diverse bacteria growing in monoculture (Haas et al., 2012).
In metatranscriptome sequencing, saturation is reached when an increment in the number of reads does not result in an additional increment in the number of expressed transcripts, or no additional ORFs are detected in the case of shotgun metagenomic analysis.
One way of estimating the point of saturation is by using rarefaction curves, a method commonly used in ecology to estimate the species richness as a function of sampling effort. In the case of RNA-seq/ DNA-seq, a higher sequencing depth will only prolong the curve but is otherwise comparable to a lower sequencing depth curve for the same regions. Once the curve reaches a plateau, where additional sequencing would only marginally increase the number of transcripts observed, the curve can be considered as saturated, and there is therefore no need to increase the sequencing effort to describe the gene expression profiles of the community. Another useful feature of saturation curves is that they allow the complexity of the sample to be assessed: Expressed transcripts will be numerous in highly complex communities and low in those with low complexity.
We have developed a method to calculate the sequencing effort needed to reach the maximum number of existing genes using rarefaction curves extrapolating from a small initial sequencing depth (10%-20%) and estimating the confidence intervals at 90%, 95%, or 99% of the maximum sequencing effort.

| Methodological overview
We first simulated more than a thousand different metatranscriptomic/metagenomic matrices. On those matrices, we computed rarefaction curves using the function iNEXT( ) from the iNEXT R library for Interpolation and Extrapolation for Species Diversity (Hsieh et al., 2016). We then used a nonlinear growth model to compute the maximum number of genes expected and to estimate the sequencing depth (reads) required for 90%, 95% or 99% of the maximum sampling effort.
Finally, using a method based on machine learning, we predicted the 90%, 95%, or 99% of the maximum number of genes using only a minimum number of sequencing depth (reads) and the sampling effort needed. All these functionalities were included in some functions of R. The method was tested, as an application thereof, on metatranscriptomic samples of an oral microbiome. The results are presented in the supplementary material of this article.

| Simulation of metatranscriptomic/ metagenomic matrices
Metatranscriptomic/metagenomics matrices were simulated as described in Rodríguez-Casado et al. (2017) and in Monleon-Getino et al. (2019). In order to simulate gene data and the associated reads in each sample, it is necessary to know which underlying probability model best explains the distribution of the data, for example, the binomial (distribution of the reads per gene in a sample), multinomial (distribution of the counts for the set of genes in the sample), and complex distributions such as the Dirichlet-multinomial (distribution of reads for the set of genes and samples in the experiment). The following is a brief theoretical introduction to these distributions, which once known allow the simulation of new samples by Monte-Carlo simulation, a statistical method used to solve complex mathematical problems through the generation of random variables. Usually, for convenience, in M we change the notation of p by k; also during the statistical analysis, we use the transpose M′structure (k rows: genes, p columns: samples), which shows the samples (e.g. individuals) in the columns and the identified gene in the rows (Table 1).
As a result of genomic analysis, M′ can be very large and usually has thousands of genes, most of them with small frequencies or 0, that is, M′ is typically a sparse matrix. This matrix is truncated because some characteristics were not observed in the sampling.
From the statistical point of view, it is highly convenient to formalize the probability distribution underlying this matrix structure, so each sample from M′ can be represented by one k-dimensional random vector X j ;X j = m 1j , m 2j , …, m kj , where m kj represents the number of times that gene k is observed in sample j.
The probability distribution of each random vector X i. (vector row) and X .j (vector column) can be associated individually with a multinomial distribution, The multinomial distribution is a multivariate generalization of the binomial distribution, where the marginal distribution of each X ij is: for example, if we consider the partition of all sample space Ω j the j-sample space in k parts: One individual selected randomly has the probability kj of belonging to the gene A kj in the partition: If we wish to calculate the probability of sample j having N .j individuals, m 1j belongs to class A 1j , m 2j to class A 2j ,...,m kj to class A kj , with the restriction Furthermore, using the multinomial function of density (mass function) we can calculate this probability, MN N .j ; j = 1j , 2j , …, kj : where 0 ≤ ij ≤ 1 for all i in 1 to k, and 1j +…+ kj = 1 (∀j), and if k = 1, the mass function is reduced to the binomial, ∀j = 1, . . , n.
The conjugate prior of the Multinomial distribution is the Dirichlet distribution, the multivariate generalization of beta distribution. Hence, the parameter vector k = 1j , 2j , …, kj ; ∀j has a prior distribution given by: In (10), the density function is given by: In Bayesian inference, p ( |x) is known as posterior distribution and is proportional to likelihood (p(x|θ))x prior distribution (p(x)), so The posterior distribution of j given X is: Thus, in order to implement a new method that calculates the depth of the sample and conveniently estimates the sampling effort, as well as whether it is necessary to sequence more samples

| Calculating rarefaction curves
There are many methods for calculating the rarefaction curve for each M′; here, we chose to use one of the most recent ones, the iNEXT( ) function of R iNEXT for Interpolation and Extrapolation for Species Diversity (Hsieh et al., 2016). This library provides simple functions to compute and plot two types (sample size-and coverage-based) of rarefaction and extrapolation of species diversity (based on Hill numbers) for individual-based (abundance) data or sampling unit-based (incidence) data. (1) Using the iNEXT( ) function, we calculated the rarefaction curves for each metatranscriptomic/metagenomic matrix (M′) simulated previously.

| Calculating sampling effort
Unfortunately, iNEXT( ) cannot calculate the maximum number of genes or estimate the sampling effort, and the reads covering 90%, 95%, and 99% of the maximum number of genes in the case of nonsaturated rarefaction curves. To address this caveat, we propose a nonlinear parametric model.
In this type of study, it is common to perform an initial analysis for model selection. Thus, rarefaction curves were first fitted based on previous experience and a selection of possible nonlinear models   Several functions, including Weibull, logistic, asymptotic regression through the origin (or a two-parameter Weibull growth model), Gompertz, and Michaelis-Menten models, were tested using nonlinear regression for use as extrapolations of the rarefaction curves . The regression analysis was performed using the R-package function nls( ), and the model accuracy was tested with the function accuracy( ) of the R-package rcompanion (R Companion, 2018), which produces a table of statistics that can fit multiple models. The model accuracy was tested using Efron's pseudo r-squared, Min.max.accuracy (for minimum, maximum accuracy, more substantial indicates a better fit, and a perfect fit is equal to 1), and root-mean-square error (RMSE), which has the same units as the predicted values.
The Weibull sigmoid model obtained the best scores and was selected as a useful function that fits and extrapolates the rarefaction curve.
The Weibull growth model used in our studies is derived from the one-parameter Weibull function (10), given by: where γ is a shape parameter and x > 0 and γ > 0. The distribution function has a point of inflection at The following equation can then be used to obtain the sigmoidal curve for empirical use: Moreover, the Weibull function of four parameters can be de- where W (x) is the potential number of genes being expressed for each number of reads (x) and now a = , b = − , c = and m = . a, b, c, and m are parameters to be estimated and e is the base of the natural logarithms. a is the asymptote of limiting value of the response

| Estimation of the amount of sequencing (reads) needed to cover the total expected microbial metatranscriptome/metagenome (confidence band)
The maximum potential number of genes being expressed and the 95% confidence band was used as an estimation of the asymptote of limiting value in a Weibull growth model of four (12)

| A priori gene prediction using a few initial total reads
We used different algorithms to fit a regression model to predict the potential number of genes, effort/reads covering 90%, 95%, or 99% of the maximum number of genes based on the first 10%-20% of (10) sequences (reads). As a first strategy, a classical linear regression of the function lm( ) was optimized using a function step( ) to perform the stepwise model selection, and the model was validated using the function cv.lm (data, model, m) from the DAAG library (Maindonald & Braun, 2010. This function gives internal and cross-validation measures of predictive accuracy for multiple linear regression. Two other strategies were applied: the so-called machine learning algorithms such as support vector machines (SVM) and Extreme Gradient Boosting (XGBoost), in which we used the training data (with multiple features) x i (here the genes in each sequencing depth) to predict a target variable y i (maximum number of genes). Support vector machines (SVM) constitute a data classification method that separates data using hyperplanes, which is useful in the case of regression (Cortes & Vapnik, 1995). If we have labeled data, SVM can generate multiple separating hyperplanes, so that the data space is divided into segments, each containing only one kind of data. The SVM technique is generally useful for data which has nonregularity, that is, without a known distribution. We used the function SVM( ) in R for the calculation (Chang & Lin, 2011).
Extreme Gradient Boosting is an efficient implementation of the gradient boosting framework from Chen and Guestrin (2016). In order to check the accuracy of different models, it is common to use the coefficient of determination (R 2 or R-squared), the mean absolute error (MAE), and the root-mean-square error (RMSE) (Hyndman & Koehler, 2006). where n is the number of pairs of observations, ŷ i the predicted value and y i the observed value.

| Metatranscriptome databases used in the method application
We used metatranscriptome datasets from three different sources for the application of the proposed method. The first set was generated in our lab as described in Yost et al. (2015) and project, id 935 (http://metag enomi cs.anl.gov/linkin.cgi?proje ct=935). The third dataset was generated by Jorth et al. (2014) and is available at DNAnexus study number SRP033605 (http:// sra.dnane xus.com/studi es/SRP03 3605). All databases were bioinformatically cleaned of rRNA sequences, and in the case of SRP033605, we also removed low-quality sequences from the query files. Fast clipper and fastq quality filters from the Fastx toolkit (http://hanno nlab.cshl.edu/fastx toolk it/) were used to remove sequences shorter than 50 bp with a quality score > 20 in > 80% of the sequence.

| Metatranscriptomic/ Metagenomic matrix simulation, rarefaction computation, and estimation of parameters
Our focus was to study the transcriptome of whole complex microbial communities rather than individual transcriptomes, using an oral community as a model. The oral microbiome is one of the best characterized human body sites (Belda-Ferre et al., 2012;Haffajee et al., 2008;Marsh, 2006;Paster et al., 2001;Peterson et al., 2013;Socransky et al., 1998), comprising an extremely complex and highly organized biofilm community (Kolenbrander, 2000;Kolenbrander et al., 2002). More than 700 bacterial species have been identified in the oral cavity [Paster et al., 2001;Dewhirst et al., 2010]. Many oral bacterial species have not yet been cultivated, and the only information we possess about them derives from their 16S rRNA phylogenetic affiliation.
In the current study, we investigated the proposed mathematical Weibull model, using nonlinear regression modeling. This model is a generalization of the asymptotic growth model in that it reduces when the parameter m is unity (see Methods).

TA B L E 3
Model accuracy for the prediction of the maximum number of genes using only 20% of total reads in a simulation of 1,587 metatranscriptomic/ metagenomic genomic sequences Using an R script (see Supplementary Material), we simulated 1,587 metatranscriptomic/metagenomic matrices containing more than 9 9 reads, with random numbers of genes (min = 267, max = 339,319) and reads (min = 550, max = 6,823,774), and always 3 samples (replicas). The simulations had a high computational cost of more than 2 weeks and were carried out on a Linux Xeon SP 4114 2.2 GHz computer server with 40 cores. This information has been collected in a data frame for further analysis.
A rarefaction curve using the 1,587 simulated cases was computed using the function iNEXT( ), and the vector obtained (n = 100 points, x = reads, y = genes) was saved and used later to compute (a) the maximum number of genes, (b) the sampling effort to reach the maximum number of genes (minimum = 1%, maximum = 100%; see Table 2), and c) the reads covering 90, 95 and 99% of the max- Four examples of the results obtained are shown in Figure 1. The results can be distinguished in four different types of rarefaction curves: • Over-sampling curves: minimum sampling effort to obtain the maximum amount of genes in a quick rarefaction curve (Figure 1.a).
• Correct sampling curves: medium sampling effort to obtain the maximum amount of genes in a saturated rarefaction curve ( Figure 1.b).
• Under-sampling curves: maximum sampling effort to obtain the maximum amount of genes in a nonobserved saturated rarefaction curve (Figure 1.c).
• Very under-sampling: very maximum sampling effort to obtain the maximum amount of genes in a nonobserved saturated rarefaction curve (Figure 1.d).
Moreover, in the curves of Figure 1, we can distinguish the vertical lines of the reads covering 90%, 95%, and 99% of the maximum number of genes.

| A priori gene prediction using only a few total reads
Using the simulated data and the parameters estimated previously, we fitted a regression to predict the potential number of genes and F I G U R E 2 RMSE and absolute error bands (mean (red), 95% (blue), and 99% confidence (magenta)) of different methods [(a) support vector machine, (b) linear regression model, and (c) XBoost] using 20% of sequencing depth (reads) to predict the maximum number of genes. 300 random resamples were performed the reads covering 90%, 95%, or 99% of the maximum number of genes using only the first 20% of sequences (reads). To implement this method, we used three algorithms (linear model (lm), Extreme Gradient Boosting (XB), and support vector machine (SVM)) to predict the aforementioned values. Several predictors were tested to predict the maximum number of genes as a function of the first 20% of sequences (reads). Using the simulated data, several good predictors were detected, such as the asymptote, using a four-parameter Weibull model or other similar and well-known models such as the logistics curve model . Other predictors used were the minimum-maximum number of genes observed and, finally, the minimum-maximum number of reads observed (see Table 3, central column; model 1 and model 2 and supplementary material).
After testing the prediction of the proposed models using the three aforementioned prediction algorithms (lm, Xboost and SVM), it was found that the results of the prediction of interest (maximum number of genes and reads covering 90%, 95%, or 99% of the maximum number of genes) for the total curve with the 1,587 simulated samples were very similar, with an R 2 > 0.99, which indicates a possible over-fitting (see Table 3, right column).
To validate the method and the models, we initially used only the first 20 points of the rarefaction curve (reads of 20% of the total amount of the curve obtained) and then divided the total number of simulated rarefaction curves (n = 1,587) and the estimated parameters (maximum number of genes, sampling effort, etc.) into two parts using cross-validation: (a) In the training set, 70% was used to train and estimate the prediction models (lm, XB, and SVM), and (b) in the test set, 30% was used to check the model fit and capacity to predict the maximum number of genes, and reads covering 90%, 95%, and 99% of the maximum number of genes using only the first 20% of sequences (reads).
We used 300 random resamplings, and a significant computational effort was made to obtain the predictions using models 1 and 2. We determined that the XB and lm are useful methods to predict the maximum number of genes using only 20% of sequencing depth.
To prove the accuracy of the method, we used the mean absolute error (MAE), root-square-mean error (RSME), and the coefficient of determination (R 2 ) between estimations using the Weibull model with 100% and 20% of the rarefaction curve.
The results of the validations of the three prediction methods (XB, lm, and SVM) and model 1 are presented in Figures 2 and 3 (prediction of maximum number of genes) and 4 and 5 (prediction of reads covering 95% of the maximum number of genes), which show the absolute error (MAE), RMSE bands (mean and 95% and 99% confidence), and R 2 for the 300 random resampling test sets. It can be F I G U R E 3 Coefficient of determination (R 2 ) bands (mean (red), 95% (blue) and 99% confidence (magenta)) of the different methods used Finally, an XB model 1 including the total amount of simulated data (n = 1,587) was estimated and saved. The R 2 of all data and prediction models (lm, XB, and SVM) are presented in Figure 6. This model will be used to predict the described parameters of interest (maximum number of genes: Figure 6 a,b,c, reads to cover 95% of the maximum number of genes: Figure 6d,e,f, effort, etc.). Also, the confidence interval (95%) of the prediction was obtained by applying a "bagging" method, which was possible with the XB model and involves creating the same model many times (with randomness).
Finally, using 100 subsamples we obtained the prediction mean and 95% by means of the function ci.mean( ) of the library(Publish) for R. The final XB model estimated to predict the maximum number of genes has an MAE = 86 and R 2 = 0.9999997 between the observed and predicted values (Figure 6c). The final XB model estimated to predict reads covering 95% of the maximum number of genes has an MAE = 634 and R 2 = 0.9999999 between the observed and predicted values (Figure 6g).
F I G U R E 4 RMSE and absolute error bands (mean (red), 95% (blue), and 99% confidence (magenta)) of different methods [(a) Support vector machine, (b) linear regression model, and (c) XBoost] using 20% of sequencing depth (reads) to predict the reads covering 95% of the maximum number of genes. 300 random resamples were performed

| Application of the proposed method to real data
External validation (real data not used before) was performed to check the algorithms developed previously. To this end, we used a set of 15 datasets of metatranscriptomes from the oral cavity. These RNA sequences consist of vectors of 10 5 -1.5 × 10 7 read depth with a 10,000 and 600,000 gene size, most of them with saturation but in some cases with a definite unsaturation. We used these sequences

| Conclusions
This proposed method to estimate the maximum number of genes and the reads covering 90, 95, and 99% of the maximum number of genes, using an algorithm based on a rarefaction curve + Weibull model + machine learning prediction, will help researchers to know whether sampling is sufficient or needs to be increased. The method should be used with precaution when predicting the sequencing depth, especially with unsaturated samples. However, although the proposed model can cause predictive problems, it was found to work in most cases. Further studies using real sequences and typologies should be carried out to fully validate the model and the simulation-based methodology.
Estimating the sequencing depth required to adequately sample the target metatranscriptome/metagenome using RNA-seq, and Shotgun is an essential first step in obtaining robust results in subsequent analysis and avoiding overexpansion once the information contained in the library reaches saturation. Our method allows the use of an initial shallowly sequenced sample (in this case 20% of the total amount of reads sampled) to estimate the sequencing effort needed to cover the whole metatranscriptome/ metagenome from the same sample and therefore to estimate the sample size. The initial number of sequences is low enough for current NGS methods to analyze a considerable number of samples at a low cost.

Our thanks to Andreu Paytuví and Walter Sanseverino from
Sequentia-Biotech for his advice with metagenomics. Also, our thanks to Javier Mendez from Department of Microbiology of the University of Barcelona for his advice with the methods.

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
All code is open source and available in Github. All functionalities shown in Figure 1 (rarefaction curve, Weibull nonlinear model, F I G U R E 6 Prediction of maximum number of genes (a, b, c) and reads covering 95% of the maximum number of genes (d, e, f) using a SVM, lm, and XGBoost model and only 20% of the reads versus the observed value. All the samples were used (n = 1,556) effort estimation, extrapolation of the maximum number of genes, reads covering 90, 95, and 99% of the maximum number of genes) have been compiled in two new functions in R: PILI3( ) and monle.
predict.max( ) and added to the library(Sequencingeffort) and can be found at the repository https://github.com/amonl eong/Seque ncing effort. For each described function, at least two examples of use are given, as well as the explanation of the arguments of the functions.

S U PP O RTI N G I N FO R M ATI O N
Additional supporting information may be found online in the Supporting Information section.