Single-cell polyadenylation site mapping reveals 3′ isoform choice variability

Cell-to-cell variability in gene expression is important for many processes in biology, including embryonic development and stem cell homeostasis. While heterogeneity of gene expression levels has been extensively studied, less attention has been paid to mRNA polyadenylation isoform choice. 3′ untranslated regions regulate mRNA fate, and their choice is tightly controlled during development, but how 3′ isoform usage varies within genetically and developmentally homogeneous cell populations has not been explored. Here, we perform genome-wide quantification of polyadenylation site usage in single mouse embryonic and neural stem cells using a novel single-cell transcriptomic method, BATSeq. By applying BATBayes, a statistical framework for analyzing single-cell isoform data, we find that while the developmental state of the cell globally determines isoform usage, single cells from the same state differ in the choice of isoforms. Notably this variation exceeds random selection with equal preference in all cells, a finding that was confirmed by RNA FISH data. Variability in 3′ isoform choice has potential implications on functional cell-to-cell heterogeneity as well as utility in resolving cell populations.


.1 Selection of genes for modeling
Due to the relatively low capture efficiency of single-cell transcriptomics, we observe only a single strongly expressed 3' isoform for most genes. We therefore restrict the following analysis to those 493 genes for which we observe at least two isoforms at an average of at least 0.2 barcodes per cell each, which corresponds to approximately 8 RNA molecules per isoform in each cell based on the estimates of the model (see Table EV2 for an overview of all genes included). For the majority of genes retained, only two isoforms are expressed at that level; we therefore restrict the following analysis to the two most abundant isoforms for each gene.

Formulation of the BATBayes model
To separate the observed variance into technical noise, variance induced by random partitioning of mRNAs to isoforms and potential contributions 1 by variance in isoform preference (Figure 5a), we developed two Bayesian models. The null model ("first model") is that isoform preference is the same in all cells, whereas the alternative model ("second model") allows isoform preference to vary (see Figure 5b for a visual depiction of both models). For each combination of gene g, polyadenylation isoform i, and cell c, we observe N gic molecular barcodes. To account for technical noise, we model N gic as a sample of the number of RNA molecules M gic expressed in the cell where β c is the capture efficiency observed in cell c. β c is determined by regressing the number of observed ERCC spike in molecules against the number of molecules put into the reaction (Figure 2c, inset). For an overview of all quantities used in BATBayes, see inline table 1.  For both models, we assume that upon expression of an RNA molecule, the isoform is chosen randomly with a gene-(and possibly cell-) specific probability p gc . If a total of Q gc RNA molecules are expressed in a given cell, isoform 1 is chosen M g1c times according to binomial partitioning of RNA molecules to two isoforms: and It is evident that under these assumptions, the observed isoform ratios can be highly variable even if p gc is equal in all cells merely due to technical 2 noise and binomial partitioning. For the first model, we therefore assume equal isoform preference in all cells: In the second model, we assume that p gc is variable across cells. We model this using the conjugate prior of the binomial distribution in (2), the Beta distribution: which we parametrize by its mean and the so-called concentration parameter, a quantity related to the variance: We chose this parametrization because c g scales independently of the mean, whereas the variance is dependent on the mean (Appendix Figure S4a). We further note that this parametrization allows the model to easily be extended to more than two isoforms by using a Dirichlet distribution as the generalized form of the Beta distribution. For ρ, we assume a flat 0, 1 prior: c g is unknown, but of primary interest to our model as it describes the cellcell variability in isoform choice. For c → ∞, the Beta distribution becomes narrowly peaked and the model degenerates to the first model. In a simpler version of the second model ("second model A"), we assume that c g is equal for all genes: i.e. that all genes display equal variation in isoform preference. We parameterize by 1/η based on the consideration that at large values of the concentration parameter, even large changes have little effect on the resulting Beta distribution; a Beta distribution with concentration parameter 100 is almost identical to a Beta distribution with concentration parameter 200, but radically different from a Beta distribution with concentration parameter 0. We then place an exponential hyperprior on η: We note that the prior variance of η was 1, and the posterior variance of was 0.00007 or smaller, indicating that η was well defined by the data and that the exact choice of prior does not influence the conclusions drawn by the modeling.
We then allowed c g to vary across genes ("second model B"). Initially, we fitted second model A to all genes independently. We observed similar values for c g for most genes, but some outliers with very low concentration parameters (Appendix Figure S6c, x-axis). When we investigated these outliers, we found that they were mostly genes with very low expression level, i.e. typically only one barcode per cell observed. As beta distributions with extremely high variance (all density at 0 and 1) do not contain more prior uncertainty than extremely low variance beta distributions (all density at ρ), Bayesian parameter estimation greatly favored high-variance distributions in these cases. We therefore share information across genes and allow c g only to deviate from the typical variance in isoform preference if the data justifies it by using an empirically estimated lognormal prior on c g : We use the lognormal distribution because as described before, at small concentration parameters, smaller absolute changes in c translate to larger changes in the Beta distribution than at large concentration parameters. The distribution was parameterized by its mean η and a precision parameter τ , upon which we place a conjugate hyperprior: The posteriors of τ are only shifted against the prior by 30% (Appendix Figure S6b). Based on the DIC and simulations, we use second model B as final model ("BATBayes"), but we note that the prior on τ influences the spread of final estimates of gene-wise variances displayed in the lower panel of figure 5e. The use of a rather conservative exponential prior with rate 0.1 avoids an overestimation of gene-wise differences that is not backed up by the data. To model the number of RNA molecules for any gene, we assume a negative binomial distribution, based on the observation that RNA molecule counts are typically more disperse than a Poisson distribution [5]: parametrized by its expected value and rate parameters, for which we assume flat priors: The BATBayes source code is given in Code EV1. 4

Model fitting and comparison
We fit the model to the data (molecule count tables N gic and capture efficiencies β c ) by using JAGS, a program for Bayesian parameter estimation using Monte Carlo Markov Chains [3]. Starting parameters for mean gene expression and isoform ratio can readily be estimated from the data; for all other parameters, we initialize at values drawn randomly from the prior distributions. For each combination of model and data considered in the main text, we ran at least 3 chains for at least 300,000 iterations each. We recorded values of ρ, η, τ, c, µ and q at each 100 th iteration to avoid autocorrelation. We further monitored the variance of the isoform ratio N g1· /(N g1· + N g2· ), which were used for creating parts of figure 5e. We further recorded the posterior means of all latent variable. We use the CODA package of the programming language R to analyze MCMC chain output [4]. We discard a burn-in of up to 100,000 iterations and make sure that chains converge by using Gelman and Rubin's diagnostic [1], which was consistently below 1.1, as well as by manual inspection of diagnostic plots (see Appendix Figure S4b-d for representative examples). The deviance information criterion was computed using the JAGS DIC module. The DIC is defined as Where D avg (y) is the mean deviance of data y across samples from a MCM chain, and p D is the number of free parameters [6]. Akaike's Informaiton Criterion (AIC) and Schwartz' BIC are not applicable in the context of hierarchical Bayesian models because parameters can be closely constrained by their priors. p D is estimated from the difference between the average deviance and the deviance at the posterior mean of the parameters:

Simulation of control data sets
To ascertain that the model provides correct estimates of variability in isoform preference, we simulated several control data sets and investigate the behavior of the model fit. We first simulate data under the assumptions of the second model B (isoform preference is variable across different cells, and this variance is different in different genes) by drawing random values for N , M , Q, p and c from the distributions specified in equations 1-3, 5 and 9. As parameters for ρ, η, τ, µ and q, we assume the posterior means of the model fit to the ESC-2i data. For β, we assume the measured values. The parameter estimates obtained from fitting the second model B to the simulated data deviate from the true values by consistently less than 30%; the estimate of the important hyperparameter η deviates from the real value by less than 1% ( Figure 5d).
We then simulated data under the assumptions of the first model (isoform preference is equal in all cells) by drawing random values for N , M and Q from equations 1-4, again assuming the posterior means of the model fit to the ESC-2i data as parameters. The variance of isoform preference obtained from fitting the second model B to this simulated data set is correctly estimated to be close to 0, approximately 10 times lower than the value estimated for the real data ( Figure 5d).
Besides the observed data N , the model depends on the capture efficiencies β, which were estimated based on the use of spiked-in in vitro transcripts. To make sure that the conclusions drawn from the model are insensitive to experimental inaccuracies in determining β, we simulated data under the assumption of the first model (no variability in isoform preference) and one of the following scenarios. In the first scenario, we assume that the true value β is really lower than the experimental estimate β by a fixed factor In the second scenario, we assume that β is different for different genes were S is the sigmoid function and In the third scenario, we assume that β is different for different genes and isoforms The second model B was fit to data simulated from these models for several arbitrarily chosen parameters ϕ, and the fits correctly displayed no evidence for the presence of variability in isoform preference (Appendix Figure S5bd).

BATBayes2: Clustering of single cells based on polyadenylation site usage
To cluster cells based on polyadenylation site usage, we assume that a global correlation structure exists on the matrix p cg . We initially fitted the second model B ("BATBayes") to the data pooled from all 107 cells, and performed a principal component analysis on the posterior means of p. While this approach allowed us to retrieve a rough population structure ( Figure  7b), the ESC-FCS and NSC population were difficult to separate using the first two principal components. A possible reason for this relatively weak separation may be that the BATBayes model assumes that no non-random 6 correlation on p exists between different genes. We therefore extended the second model to explicitly include such a correlation structure on p (see also Figure EV4a), and fit the extended model to the data using the MCMC methodology described above.
In analogy to independent component analysis, we define a score δ c for each cell and a loading ε g for each gene. We then allow the mean percentage of major isoform produced (equation 6) to be different for different cells and calculate the expected isoform usage as We replace the prior in equation 8 by The magnitude of the correlation on p is unknown, and determined by the product δ · ε. Priors on δ and ε therefore cannot be uniquely identified. We fix the prior on δ: And estimate the prior on ε g empirically with hyperprior θ ∼ exp(0.1) To make sure that this approach does indeed cluster only based on 3' UTR usage and does not require differences in gene expression levels to work, we simulated a mixed population, where the parameters governing mRNA levels (µ and q) were taken from the fit of BATBayes to the NSC population for half of the cells and from the fit to the ESC-2i population for the other half. The parameter governing isoform ratios (ρ) was taken from the fit to the ESC-2i population for all cells. Evidently, cells clustered apart using hierarchical clustering on downsampled gene expression level [2], but not using BATBayes2 ( Figure EV4b). We then simulated a population taking µ and q from the fit to the ESC-2i population for all cells, whereas ρ was taken from the ESC-2i population for half of the cells and from the NSC population for the other half. Cells could not be distinguished based on gene expression levels, but isoform-based clustering separated populations very well ( Figure EV4b).
The BATBayes2 source code is given in Code EV 2.
[2] D. Jaitin, E.   a Overview of filtering strategy. In total, 42.3 million reads were obtained, of which 10.3 million reads passed quality control filters.
b Venn diagram of unique molecules (UMI-cell-gene combinations) identified during four MiSeq runs. After constructing sequencing libraries on magnetic beads, PCR enrichment was performed prior to sequencing like in standard library preparation protocols. Sequencing deeper into a library did not yield higher coverage (PCR 2, seq. 1 and 2) whereas repeating the final enrichment PCR did increase coverage.
c Distribution of observed UMIs across cells. In total, 869.000 unique molecules were observed.
d Correlations of read and molecule counts. Each row/column of the heatmaps corresponds to a single cell.
e Use of a well-specific spike-in reveals that template switching occurs at very low frequencies. A concern in UMI-based protocols is template switching during PCR, which would result in single original molecules being represented by multiple UMIs in the final sequencing data. In our protocol, such template switches would not only affect the UMI, but also the cell barcode; we therefore included an additional synthetic RNA spike-in (pGIBS-THR) in rows C & G before cell lysis. As multiple rows were pooled prior to PCR, a switch in barcode would result in pGIBS-THR reads in other rows.
f Alternative strategy of simulating the experiment shown in Figure 2d. Here, true expression values were estimated by dividing the measured gene expression values from the bulk experiment by the capture efficiency estimate. Obtained correlations are quantitatively identical to the correlations shown in Figure 2d.    a Choice of parametrization. The Beta distribution governing isoform preference was parameterized by a concentration parameter instead of a variance because the concentration parameter is independent of the mean.
b Convergence diagnostics for a typical Monte Carlo Markov Chain, the second model fitted to ESC-2i data. Here, three randomly initialized chains were run for 300.000 iterations each, a burn-in of 100.000 iterations was discarded and a thinning interval of 300 was applied. For the hyperparameters η and τ , as well as for randomly selected gene-wise isoform variability parameters c, the plots display autocorrelation (b), trace (c) and density (d). a Simulations recapitulate data. Distributions of bulk isoform ratios, gene expression levels and molecular counts estimated from the ESC-2i data and for a data set that was simulated using the assumption of no variability in isoform preference (First Model).
b Model output is robust to changes in the estimated capture efficiency. Posterior densities of variability in isoform preference η for the second model fit to data sets simulated using the assumption that isoform preference is identical in all single cells, and that the BATSeq capture efficiencies were not measured correctly. In b) true RT efficiencies were assumed to be lower than the estimated value by a factor between 0.2 and 0.8. In c), capture efficiencies were assumed differ for different genes by a standard deviation between 0.1 and 1. In d), RT capture efficiencies were assumed differ for different genes and isoforms by a standard deviation between 0.05 and 0.5. The posterior density of the fit to the ESC-2i data was included for reference. a A frequentist approach to single-cell isoform analysis. For each gene, the observed variance in isoform ratios (x-axis) was plotted against the variance obtained by simulating the first model (isoform preference equal in all cells). Points (y-axis) denote medians, error bars denote the interquartile range. 1000 simulations were run for each gene. For genes below the diagonal (blue), the observed variance exceeds what is expected from simulations, for genes above the diagonal (red) the observed variance is smaller than expected from simulations.
b Different genes differ slightly in isoform noise level. Comparison of model A (isoform preference variability is equal in different genes) and model B (isoform preference variability is different in different genes). Upper panel: Posterior densities of the gene-gene variance in isoform noise levels (1/τ ) for the ESC-2i data (red) and a dataset that was simulated under the assumptions of model A(green). The prior is also shown (blue). Lower panel: Comparison of model A and model B by the DIC.
c Information sharing across genes leads to moderated estimates of isoform preference variation in the limit of data. Concentration parameters c were once estimated for all genes individually (x-axis) and once for all genes in parallel, assuming an empirically estimated log-normal prior. For genes with extremely sparse data, the model frequently returned very low estimates of c when no information sharing was applied.
14 B A C