Probabilistic harmonization and annotation of single‐cell transcriptomics data with deep generative models

Abstract As the number of single‐cell transcriptomics datasets grows, the natural next step is to integrate the accumulating data to achieve a common ontology of cell types and states. However, it is not straightforward to compare gene expression levels across datasets and to automatically assign cell type labels in a new dataset based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of scRNA‐seq data, while accounting for uncertainty caused by biological and measurement noise. We also introduce single‐cell ANnotation using Variational Inference (scANVI), a semi‐supervised variant of scVI designed to leverage existing cell state annotations. We demonstrate that scVI and scANVI compare favorably to state‐of‐the‐art methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings. In contrast to existing methods, scVI and scANVI integrate multiple datasets with a single generative model that can be directly used for downstream tasks, such as differential expression. Both methods are easily accessible through scvi‐tools.

Appendix Figure 8: Supplementary study of harmonizing datasets with different cellular composition.We show here the case where each of the two datasets has a unique cell types and share all the others.For each box plot, we report over all the possible combinations of left-out cell types.(a) Entropy of batch mixing for the unique population (lower is better).(b) k-nearest neighbor purity (unique and non-unique; higher is better).(c) Entropy of batch mixing for the non-unique populations (higher is better).The boxplots are standard Tukey boxplots where the box is delineated by the first and third quartile and the whisker lines are the first and third quartile plus minus 1.5 times the box height.The dots are outliers that fall above or below the whisker lines.
Appendix Figure 15: Supplementary study of labels concordance.(a) k-nearest neighbors purity of the merged latent space on the protein expression space as a function of the size of the neighborhood.(b) Protein expression heatmap showing consistency of PBMC-Sorted labels and protein expression in PBMC-CITE.The protein expression per cell type is based on k-nearest neighbors imputation from the harmonized latent space obtained from scANVI trained with pure population labels.(c) We select individual cells that were labeled as dendritic cells or Natural Killer cells in the original publication of the respective datasets, and compare the raw transcript count from cells inside the scANVI T cells cluster (DC*, NK*) against cells outside the T cells cluster (DC, NK).The expression of marker genes suggest that DC* and NK* is more likely to be T cells and thus the scANVI latent space is more accurate.(d) The batch entropy mixing of the three datasets in scVI, scANVI and Seurat Alignment merged space.For scANVI we perform semi-supervision using the cell type label (not k-means cluster labels) from only one of the two datasets.Thus we train two separate models SCANVI1 and SCANVI2.To obtain a measure of clustering conservation, we first run k-means clustering in the latent space of dataset 1, then in the harmonized latent space using only cells from dataset 1.We compute the adjusted Rand Index of the two clustering results.We then do the same for dataset 2 and the final score is the average for both datasets.where one observes the cell labels for one dataset and wishes to transfer it to another (annotated) dataset.This is a well-established research area in computer vision, where algorithms such as NBNN [Tommasi and Caputo, 2013] and JDA [Long et al., 2013] are used to transfer labels over datasets of images (e.g., with different lightning conditions, collections of images).As these algorithms might not scale to millions of samples, a notable contribution is the DANN framework [Ganin et al., 2016] which learns a adversarial classifier to in order to generalize to the new (unlabelled) dataset.This approach is justified from the statistical theory perspective as the adversarial regularization is equivalent to controlling for the H-divergence between source domain P S X and target domain P T X .Another notable contribution is the variational fair autoencoder, which focuses on semi-supervised domain adaptation with VAEs [Louizos et al., 2016] by adding a maximum mean discrepency [Gretton et al., 2012] loss between source and target latent distributions.
Moving away from the supervised domain adaptation scenario, recent work based on generative adversarial networks [Goodfellow et al., 2014] focuses on unsupervised domain adaptation of unpaired datapoints.This problem is relevant to the scRNA-seq methodology since one never gets to observe the same cell several times (the protocol is a destructive process).Cycle-GAN [Zhu et al., 2017] is based on the idea of cycle consistency (a translator that would transform a french sentence into english and then back to french again should be the identity map).This idea was then improved by Cycada [Hoffman et al., 2018], which adds more consistency to the objective function.StarGAN [Choi et al., 2018] extended CycleGAN to multiple domains while reducing the overall complexity.Finally, MAGAN [Amodio and Krishnaswamy, 2018] proposed to add a correspondence loss to further orient the manifold alignment and facilitate the inference.Notably, MAGAN [Amodio and Krishnaswamy, 2018] was also applied to merging scRNA-seq and CyTOF data.

A.1.2 Semi-supervised learning with variational autoencoders
Extending scVI to semi-supervised learning took some design that we describe here.Our first attempt was based on the M2 model [Kingma et al., 2014], and is still implemented in our codebase1 .While this algorithm performed a posteriori as good or better than the final version of scANVI (in terms of cross-validation estimate of the cell type prediction accuracy), it restrained our latent space z to be conditioned on the cell type label (as in the conditional VAE [Sohn et al., 2015]) and it was not possible anymore to visualize the latent space, which is an crucial practice in scRNA-seq data analysis [Becht et al., 2018].We therefore turned to extending scVI based on the M1 + M2 model [Kingma et al., 2014], which enabled both a flexible modeling of cell types and visualization of a joint latent space for all cells.We also tried more complicated models based on the M1 + M2 model such as ADGM [Maaløe et al., 2016] and LVAE [Rasmus et al., 2015] but did not find them to contribute significantly enough to the accuracy, which might be either due to the labeling errors in the dataset, because dividing cells into cell types is a relatively easy problem in most regimes (e.g., not considering rare cell types) or because of the limited sample size in the datasets we consider.

A.2.1 Why is harmonization a subtle problem?
Harmonization is a hard and ill-defined problem.Especially, it can be difficult to formulate exactly its objective and at which level of granularity the "harmonization" is expected.Let us take the example of two scRNA-seq datasets of peripheral blood mononuclear cells.If we assume that these datasets that are exact biological replicates and with the same experimental conditions, then the problem is well defined.We wish to identify latent variables that govern these biological processes (for example, clear demarcation of cell types).Fundamentally, this is possible because we made a non-confounder assumption, and therefore, removing in a principled way all the variation in the data that corresponds to batches is reasonable.
However, if we consider a more complex although also more common case, the biology might be slightly different from one dataset to another.For example, in the case of T cell activation (one stimulated sample and one control sample), we expect that the overall clusters should stay similar.At a broader level, CD4 T cells should cluster together, as for all the other cell types.
However, a non-negligible proportion of T cells will express markers of activation.In this case, forcing the latent spaces to exactly overlay might be problematic in the sense that this subtle signal of activation will be lost.One might think about the activation signal as a confounding factor for harmonization, and this makes the overall problem much more difficult (ill-posed).
Therefore, it is extremely important to state how different models perform harmonization and what are their underlying hypotheses and modeling capabilities.
A.2.2 State-of-the-art approaches scmap [Kiselev et al., 2018] proposes a new gene filtering method to select gene that are claimed to be invariant to batch-effects.However, it is clear that over filtering can lead to ignoring biological information.MNN [Haghverdi et al., 2018] assumes that the topology of cell types can be easily resolved by a k-nearest neighbors graph, where neighborhoods are defined with respect to the cosine distance.This allows MNN [Haghverdi et al., 2018] and all the neighbormatching-based approaches [Stuart et al., 2019, Hie et al., 2019] to remarkably merge batches with the risk of also merging cell types in the case where they are not detectable with this normalization scheme.SAUCIE [Amodio et al., 2019] and MMD ResNet [Shaham et al., 2017] both propose to perform batch-correction by adding on their objective function a non-parametric measure of distance between distributions (maximum mean discrepancy).This approach specifically assumes that each dataset has the same cell type composition and the same biological signal.Therefore, SAUCIE and ResNet are susceptible to perform over-correction.Seurat Alignment [Butler et al., 2018] relies on a milder assumption that there is a common signal exactly reproducible between the two datasets and that CCA capture most of the biological variation.This is not obviously true considering limited suitability of linear Gaussian model for scRNA-seq.Seurat anchors [Stuart et al., 2019] relies on CCA and suffers from the same problems as MNN with its specific normalization scheme.Finally, the recent LIGER [Welch et al., 2019] method at first sight seems like a non-probabilistic version of scVI since it also learns a degenerate conditional distribution via its integrative non-negative matrix factorization [Welch et al., 2019] (NMF is a noisy-less version of a Gamma-Poisson generative model).However, it has a further quantile normalization of the latent spaces within clusters.First, the output of the clustering algorithm is not necessarily correct and could perturb downstream analyses.Second, if a cell type would be slightly different from one condition to another, this information would be lost in the final latent space.Overall, all these correction methods can therefore potentially lead to over-correction and statistical artifacts [Nygaard et al., 2016].

A.2.3
The approach taken by this manuscript scVI and scANVI perform harmonization by learning a common generative model for a collection of gene expression probability distributions [p(x | z, s)] s∈{1,••• ,K} indexed by the dataset-identifier s.The statistical richness of the collection of conditional distributions dictates the flexibility of our model towards integrating datasets.
Another notable factor that sensibly contributes to harmonization capabilities is the prior for cell-specific scaling factor l n that is dataset-specific.This helps probabilistically removing library-size caused discrepancies in the measurements and is more principled than normalization of the raw data [Vallejos et al., 2017].Another important parameter is the parametrization of the neural network that maps variables (z, γ) to the expected frequencies E[w | z, s] = f w (z, s).Since our function f w is now potentially non-linear, our model can benefit from more flexibility and fit batch-specific effects locally for each cell types or phenotypical condition.Especially, depending on how one designs the neural architecture of f w , it is possible to more flexibly correct dataset-specific effects.More specifically, we treat f w as a feed-forward neural network for which at each layer we concatenate the batch-identifier with the hidden activations.A consequence of this design is that with more hidden layers in f w , less parameters are shared and the family of distributions [p(x | z, s)] s∈{1,••• ,K} has more flexibility to fit batch-specific effects (Supplementary Figure Appendix Figure 2a-c).Conversely, when the latent dimension grows, the generative network has less incentive to use the dataset covariate and might mildly duplicate the information in its latent space (Supplementary Figure Appendix Figure 2df ).Throughout the paper, we have fixed those parameters (Materials and Methods) and show competitive performance for all our datasets.
Another insight comes from how the variational distribution q(z | x, s) is parametrized.Our neural networks play the role of an explicit stochastic mapping from the gene expression x n of a single-cell n to a location in a latent space z n (a standard theme in scRNAseq analysis).With this map, we match an empirical distribution in gene expression space (i.e, a dataset) p data (x, s) = n δ (xn,sn) (x, s), with a certain distribution on the latent space p data (z) = n δ Φ(xn,sn) (z).In certain cases, even though we designed this latent space to represent biological signal, the transformed dataset might still be confounded by technical effects [Lopez et al., 2018].In particular, this effect is susceptible to be severe when the generative model is not flexible enough to fit the dataset-specific effects or has a model misfit (as in, to some extent, the case with the CCA of [Butler et al., 2018] that assumes the data is log-normal).In this case, the go-to method is to empirically constrain the mapping (i.e the variational network in our case or the latent space itself in the case of SEURAT) to match the collections of variables p data (z, s) s∈{1,••• ,K} .This is what is done via all the methods presented above.Therefore, our method might be improved with respect to the entropy of batch mixing by adding some specific penalties to the loss function, but at the price of risking to over-correct.We have preferred to not add any penalties, which is enough for most applications (especially all those in this manuscript).

B Training Information
In order to train scANVI properly, several options are possible to train all the parameters (θ from the generative model, φ from the variational distribution except the labels' posterior and φ C from the labels' posterior exclusively).In all cases, parameters θ and φ should minimize the evidence lower bound decomposed over the labeled samples L and unlabeled samples U. Furthermore, [Kingma et al., 2014] suggests to jointly optimize a modified objective that penalize the ELBO by a classification loss C on the labeled data so that the parameters φ C also benefits from the learned data.They introduce the modified objective function where α is a parameter set by cross-validation.This modified lower bound can be interpreted as placing a Dirac prior on c [Kingma et al., 2014] or as a correction term for noisy labels [Langevin et al., 2018].In their procedure, J α is optimized with respect to all parameters [θ, φ, φ C ] and for a fixed number of epochs.This joint training procedure is appealing as it shapes the latent space directly, through the modification of the encoder's weights.However, we found this approach to have two limitations.First, this joint training breaks down the mixing in the latent space in the case of transferring labels from one dataset to another.We attribute this to the loss C having only contributions from a unique dataset.Second, we did not find convenient to choose the optimal value for the parameter α and concluded it may not be desirable for a practitioner either.We use in scANVI an alternate training procedure which deletes the need for α and has better performance in the setting of transferring annotations.We aggressively train the classifier separately, updating the parameters φ C for c > 1 epochs for every single epoch of updating [θ, φ].The total number of epochs is fixed and chosen high enough to guarantee convergence.In the case of a single dataset (resp.transfer of labels), we use c = 1 (resp.c = 100) epochs of classifier training in between each variational update.This helps the classifier correctly identify cell types at the end of each epoch of updating φ C .This is a clearly advantageous procedure, because it then improve indirectly the latent space quality, through the next steps of optimization.

C Alternative model choices C.1 Zero-inflation in the context of harmonization
In this manuscript, we mainly rely on a zero-inflated negative binomial (ZINB) distribution for the counts -which is a widely used model for single cell transcriptomics data.Recent research however suggests that for some technologies such as droplet-based UMI single-cell protocols, the abundance of zero mainly may be explained only by limited sensitivity and subsampling effect [Svensson, 2019].On the other hand, zero-inflated might be a realistic addition to count distributions for describing full-length plate-based technologies data [Vieth et al., 2017].Indeed, it has been hypothesized that PCR duplication or uneven fragment sampling may be responsible for zero-inflation in plate-based technologies [Svensson, 2019].Still, no definite conclusion can be drawn about which distribution better fits UMI technologies or full-length method.Although full-length methods seem more affected by technical noise, they also provide extra information at the transcript resolution that is lost in droplet-based UMI methods.Consequenly, many consortia have chosen to obtain data by both full-length and UMI methods [Schaum et al., 2018, Ecker et al., 2017].A method that can flexibly use different distribution and appropriately these two data types is therefore extremely valuable for analyzing collections of single cell transcriptomics data.
scVI and scANVI are both flexible to use both ZINB and NB methods and we show in Supplementary Figure Appendix Figure 20 that the benchmarking results are similar using both methods in all four datasets used in our benchmarking procedure.However, we observed that for MarrowTM dataset, using the ZINB version of scVI does perform better in terms of harmonization.Since in this case, we are merging a 10x dataset with a Smart-Seq2 dataset, we compared the average zero-inflation parameter for each gene in each dataset and found that the differences between Smart-Seq2 and 10x are significantly skewed to the right (more zero-inflation in Smart-Seq2, p=6.3e-31, using a D'Agostino-Pearson K 2 test).Finally, we also performed differential expression with negative binomial versions of scVI and scANVI and show that results are similar (Supplementary Figure Appendix Figure 21).These results show that both NB and ZINB model can be used in scVI and scANVI and in most instances, downstream analysis might not be impacted by this modeling choice.Interestingly, our model successfully learns the difference in the degree of zero-inflation in different datasets while merging them and exploits this information when necessary.

C.2 Library-size prior for scVI and scANVI
Besides zero-inflation, another major difference between sequencing technologies is the sequencing depth.scVI and scANVI both make use of technical scalar factors that have a batch-specific prior, and are therefore extremely suited for this setting.To evaluate the effectiveness of both the model and the prior choice, we compute the negative log likelihood of the scVI model using ZINB with batch-specific library size prior, ZINB with shared library size prior, NB with a batchspecific library size prior, and NB with shared library size prior.All four models are fit on two pairs of datasets, DentateGyrus (Fluidigm C1 and 10x) and MarrowTM (10x and SmartSeq2).Since only the second pair is a comparison between UMI and full-length datasets, we expect the differences between the four model choices to be larger in the MarrowTM comparison, and that ZINB with batch-specific library size prior to have the highest likelihood (reported in the table below).

D Hierarchical classification
In this note, we explain how we extended scANVI to handle a two-level hierarchical structure for the cell types annotation.This can in principle be adapted to any arbitrary tree representation of cell types taxonomy, but is left for future work.In our setting, the taxonomy needs to be hard-coded and known a priori.
We do not modify the generative model but only the structure of the variable c n in the variational distribution.Notably, we formally pose: where C denotes the number of cell types and C g the number of cell type groups.The parametrization of the full variational distribution q(c | z) = q(y, y g | z) must be further defined.For this, we notice that the prior taxonomy knowledge encapsulates whether the assignment (y g , y) = (i, j) is biologically possible (i.e cell type i is a sub-population of group cell type j).We encode this biological compatibility into a parent function π : {0, • • • , K} → {0, • • • , K g } that maps a cell type to its parent in the hierarchy.We note for simplicity: We then use two neural networks f and f g (with softmax non-linearities) to map the latent space to the joint approximate posterior q(y, y g | z) with the following rules: (5) Then, we can derive the marginal probability over finer cell types classes using the chain rule and Bayes rule: q(y i | z) = q(y i | y πi , z)q(y πi | z) = q(y i , y πi | x) q(y πj | x) q(y πi | z) = q(y i , y πi | x) j∈c(πi) q(y j , y πj | x) q(y πi | z) where c(π i ) denotes the set of children of node i children.

E Evidence Lower Bound decomposition
We drop the parameters Θ (resp.Φ) of the generative model (resp.the variational distribution) for notational convenience, as well as the conditioning on the batch identifier s.We report the evidence lower bound (ELBO) only for one sample (i.e one cell) and drop the index notations by substituting {x n , z n , u n , c n , l n } by {x, z, u, c, l}.This is without loss of generality since all the cells are independent and identically distributed under our model.We derive the ELBO in the case where c is not observed (almost same calculations resolve the case where c is observed).Similar derivations can be found in the variational autoencoder literature (e.g, [Kingma et al., 2014]).We assume our variational distribution factorizes as: q(c, z, u, l | x) = q(z | x)q(c | z)q(u | z, c)q(l | x). (10) In this case, we may simply apply Jensen's inequality weighted by the variational distribution q(z, u, l, c | x):  + E q(z,u,c,l | x) log p(c) q(c | z) (iii) + E q(z,u,c,l | x) log p(u) q(u | z, c) (iv) + E q(z,u,c,l | x) log p(l) q(l | x) Then we simplify each individual term of the ELBO in 11 by recognizing KL divergence terms.In particular, we use subscript notation KL G (.||.), and KL M (.||.) to denote Gaussian and multinomial KL divergences.E q(z,u,c,l | x) [log p(x | z, l)] = E q(z | x)q(l | x) [log p(x | z, l)] (i) Figure 16: Other methods of classifying T-cell subsets of the PBMC-Pure dataset.Coordinates for the scatter plots are derived from UMAP embedding based on the latent space of scANVI.(a) Ground truth labels from the purified PBMC populations (b) k-nearest neighbors classification labels when applied on scVI latent space from the seed set of cells (c) k-nearest neighbors classification labels when applied on Seurat Alignment latent space (d) k-means clustering based labels when applied to scVI latent space (e) DBSCAN clustering based labels when applied to scVI latent space.DBSCAN returns only one cluster but return some cells as unclassified.(f ) PhenoGraph clusters on scVI latent space Appendix Figure 18: Presentation of the simulated dataset used for differential expression benchmarking.(a) The tree used to sample the cells in SymSim.We sample cells from the five leaves nodes representing five different cell types derived from the same root node.(b) UMAP of scVI latent space colored by cell types and batch identifier (c) UMAP of scVI latent space without batch correction, proving that the data is indeed subject to batch effects.(d) Entropy of batch mixing for all the algorithms (e) Weighted accuracy using a k-nearest neighbors classifier on the latent space (f) Per cell type accuracy for the label transfer.Appendix Figure 19: The effect of the choice of number of classes on the scANVI model likelihood (a), classification accuracy (b) and entropy of batch mixing (c).We trained scANVI using PBMC8K as the labelled dataset, and varied the number of classes in scANVI from 6(true number of labelled cell types) to 14.The thicker line show the mean of 9 replicates, while the colored shading show the 95% confidence interval.We used a subsampled PBMC8K-CITE dataset, where NK cells are removed from the PBMC8K dataset and B cells are removed from the PBMC-CITE dataset.As we expect, the two unique dataset have low mixing in (c) while the other cell types have high mixing.Although there is no labelled B cells, scANVI does not cluster B cells from the PBMC8K dataset with other cell types in PBMC-CITE.The three metrics we use to evaluate scANVI performance are minimally affected by the increase in the number of classes.Appendix Figure 20: Performance of scVI and scANVI with a negative binomial (NB) distribution.(a) UMAP plot of the MarrowTM pair using a NB distribution for scVI.(b) Harmonization statistics and differences between regular scVI and NB version (scVI-NB).Dotted lines represent results using scVI-NB.21: Differential Expression with a negative binomial version of scVI.We report all metrics on all pairs of cell types using the simulated dataset previously analyzed as in Appendix Figure Appendix Figure 18a.
log p(x) ≥E q(z,u,c,l | x) log p(x, z, u, c, l) q(z, u, c, l | x) = E q(z,u,c,l | x) log p(x | z, l)p(z | u, c)p(c)p(u)p(l) q(z, u, c, l | x) ≥ E q(z,u,c,l | x) [log p(x | z, l)] Assessing the model's fit with different configuration on two pairs of dataset