scmFormer Integrates Large‐Scale Single‐Cell Proteomics and Transcriptomics Data by Multi‐Task Transformer

Abstract Transformer‐based models have revolutionized single cell RNA‐seq (scRNA‐seq) data analysis. However, their applicability is challenged by the complexity and scale of single‐cell multi‐omics data. Here a novel single‐cell multi‐modal/multi‐task transformer (scmFormer) is proposed to fill up the existing blank of integrating single‐cell proteomics with other omics data. Through systematic benchmarking, it is demonstrated that scmFormer excels in integrating large‐scale single‐cell multimodal data and heterogeneous multi‐batch paired multi‐omics data, while preserving shared information across batchs and distinct biological information. scmFormer achieves 54.5% higher average F1 score compared to the second method in transferring cell‐type labels from single‐cell transcriptomics to proteomics data. Using COVID‐19 datasets, it is presented that scmFormer successfully integrates over 1.48 million cells on a personal computer. Moreover, it is also proved that scmFormer performs better than existing methods on generating the unmeasured modality and is well‐suited for spatial multi‐omic data. Thus, scmFormer is a powerful and comprehensive tool for analyzing single‐cell multi‐omics data.


Supplementary Figures
Figure S1: Performance of methods on integration scRNA-seq and protein data.Table S3.Hyper-parameters of scmFormer.
As noted, scmFormer has a total of nine parameters, which fall into three categories: data preprocessing parameters, model parameters, and training parameters.
Here is a breakdown of each along with the settings applied across various datasets:

Data Preprocessing Parameters:
o The number of top genes: 2000 for all datasets (default 2000).
o Length of sub-vectors, number of principal components: These vary based on the specific integration task and the number of protein in the scProtein data when integrating single-cell proteomic and transcriptomic data.For integrating single-cell epigenomic and transcriptomic data, the length of sub-vectors is 128, and the number of heads is 8. o The number of principal components for all data generation tasks is determined by the smallest number of features in the two modalities being integrated.

Model Parameters:
o λ: This parameter balances the two loss functions.It is set to 50 when involving two modalities and 100 for three modalities (default 50).o Number of heads: Adjusted based on the number of protein or as specified in the preprocessing parameters.o Drop rate: 0 for all integration tasks (default 0,0.05,0.1).

Training Parameters:
o Epochs: Set to 50 for most tasks, except for the inner-dataset generation task for 'Generating protein data from gene expression data', where it is set to 10 (default 50).o Learning rate: 0.0001 for all integration works.(default0.0001 or 0.001) o Batch size: 32 for most tasks, but reduced to 4 when integrating spatial scATAC and scRNA data (default 32).A detailed table of parameter settings for different datasets is provided below.The table will serve as a reference for researchers aiming to replicate our findings or extend the scmFormer methodology to new datasets.We believe that these details will contribute to the transparency and reproducibility of our work.

Seurat
In this study, we employed Seurat v4.The authors introduce a computational method called "weighted-nearest neighbor" (WNN) analysis to address the challenge of integrating multiple data types for defining cellular identity.This unsupervised framework learns the relative utility of each data type in each cell and allows for an integrative analysis of multimodal data.Here,we used Harmony to perform the integrating scATAC-seq, scRNA-seq, and protein datasets.

Harmony
Harmony is an algorithm commonly utilized for the integration and

online iNMF
Online iNMF (incremental Non-negative Matrix Factorization) is an algorithm utilized for the analysis of large-scale single-cell transcriptomic datasets.Its primary purpose is to identify biologically relevant patterns and sources of variation within the data by decomposing the gene expression matrix into non-negative factors.The online iNMF algorithm operates in an incremental manner, allowing for efficient analysis of datasets that are too large to be processed as a whole.It processes the data in batches or subsets, updating the factorization iteratively to capture the underlying structure and variability of the dataset.This incremental approach enables the algorithm to handle datasets with millions of cells and thousands of genes, making it suitable for the analysis of extensive single-cell transcriptomic data.Here,we used online iNMF to perform the integrating scATAC-seq, scRNA-seq, and protein datasets.

BindSC
BindSC is a computational method developed for the integration of single-cell multiomics profiles generated by different single-cell technologies from the same biological sample.The algorithm is based on a novel mathematical solution called bi-order canonical correlation analysis (bi-CCA), which extends the commonly used CCA approach to align the rows and columns between data matrices iteratively.
Unlike existing integration methods that rely on shared features, BindSC utilizes full feature information to achieve precise alignment of cell subtypes and enables the discovery of novel gene-protein associations.Here,we used scJoint to perform the integrating scATAC-seq, scRNA-seq and protein.
(a) Performance of methods in the term of FOSCTTM.(b) Running time of methods.(c-d) Performance of scmFormer on integration unpaired scRNA-seq and protein data in terms of accuracy and macro F1.

Figure S2 :
Figure S2: Performance of methods on integration the NeurIPS_CITE_90k dataset.(a) Confusion matrix heatmaps for cross-validation results of Harmony and scVI on

Figure S3 :
Figure S3: Performance of methods on integration Seurat_CITE_160k dataset.(a) UMAP visualizations of the cell embeddings of different modalities in the Seurat_CITE_160k dataset aligned with different integration methods.(b) tSNE visualizations of the cell embeddings of integrated modalities in the Seurat_CITE_160k dataset.

Figure S4 :
Figure S4: Performance of methods on integration Seurat_CITE_50k dataset.(a) UMAP visualizations of the cell embeddings of different modalities in the Seurat_CITE_50k dataset aligned with different integration methods.(b) tSNE visualizations of the cell embeddings of integrated modalities in the Seurat_CITE_50k dataset.

Figure S5 :
Figure S5: Performance of methods on integration Mimitou_CITE_5k dataset.(a) UMAP visualizations of the cell embeddings of different modalities in the Mimitou_CITE_5k dataset aligned with different integration methods.(b) tSNE visualizations of the cell embeddings of integrated modalities in the Mimitou_CITE_5k dataset.

Figure S6 :
Figure S6: Performance of methods on integration Mimitou_ASAP_5k dataset.(a) UMAP visualizations of the cell embeddings of different modalities in the Mimitou_ASAP_5k dataset aligned with different integration methods.(b) tSNE visualizations of the cell embeddings of integrated modalities in the Mimitou_ASAP_5k dataset.

Figure S7 :
Figure S7: Performance of methods on integration Peterson_REAP_7k dataset.(a) UMAP visualizations of the cell embeddings of different modalities in the Peterson_REAP_7k dataset aligned with different integration methods.(b) tSNE visualizations of the cell embeddings of integrated modalities in the Peterson_REAP_7k dataset.

Figure S8 :
Figure S8: Performance of methods on integration the Mimitou_CITE_5k and Peterson_REAP_7k.UMAP visualizations of the cell embeddings for all cells of Mimitou_CITE_5k (first column) and Peterson_REAP_7k (second column), colored by cell types.UMAP visualizations of the integrated cell embeddings for all cells of Mimitou_CITE_5k and Peterson_REAP_7k, colored by omics layers(third column) and cell types(fourth column).

Figure S9 :
Figure S9: Performance of methods on integration the Mimitou_CITE_5k and Mimitou_CITE_5k.UMAP visualizations of the cell embeddings for all cells of Mimitou_ASAP_5k (first column) and Mimitou_CITE_5k (second column), colored by cell types.UMAP visualizations of the integrated cell embeddings for all cells of Mimitou_ASAP_5k and Mimitou_CITE_5k, colored by omics layers (third column) and cell types(fourth column).

Figure S12 :
Figure S12: Performance of scmFormer on generating protein data from gene expression data.

Figure S13 :
Figure S13: Performance of scmFormer on generating spatial protein data from spatial gene expression data.

Figure S14 .
Figure S14.Robustness of scmFormer.We conducted robustness testing on three hyperparameters(HVGs, Length of sub-vector, Num of heads across six datasets.The parameter λ is utilized to balance the trade-off between two different loss functions.However, in the task of generating

Figure S15 .
Figure S15.Robustness of scmFormer.We conducted robustness testing on four hyperparameters(Drop Rate, Epoch, Learning Rate, Batch Size ) across six datasets.
component is based on the number of proteins measured in scProtein-seq data.
harmonization of single-cell RNA sequencing (scRNA-seq) datasets obtained from multiple experimental conditions or batches.The underlying principle of Harmony involves modeling the sources of technical variability and subsequently adjusting the data to account for these variations.It employs a linear algebra framework to identify and remove the batch-specific effects, revealing the shared biological signal across different datasets.By effectively reducing the confounding effects of technical variation, Harmony enhances the biological coherence and comparability of integrated scRNA-seq datasets.Here,we used Harmony to perform the integrating scATAC-seq, scRNA-seq, and protein datasets.LIGERLIGER (linked inference of genomic experimental relationships) is an algorithm used for integrating multiple experimental conditions or batches of single-cell multi-omics data.Its main purpose is to fuse information from different data modalities, such as scRNA-seq and scATAC-seq, to achieve cell type consistency and enable comparative analysis across multi-omics datasets.The implementation of LIGER is based on a joint low-rank model that embeds cell features from different data modalities into a shared low-dimensional space.By minimizing the distances between different data modalities in the embedding space, LIGER accomplishes the alignment and integration of multi-omics data.Here,we used LIGER to perform the integrating scATAC-seq, scRNA-seq.

scJoint
scJoint is a transfer learning method designed to integrate large-scale and heterogeneous collections of scRNA-seq and scATAC-seq data in single-cell multiomics analysis.It utilizes a semisupervised framework and neural network-based techniques to simultaneously train labeled and unlabeled data, enabling label transfer and joint visualization.The algorithm consists of three main steps: joint dimension reduction and modality alignment, label transfer via k-nearest neighbors, and improved mixing between modalities using metric learning.Here,we used scJoint to perform the integrating scATAC-seq, scRNA-seq.PamonaPamona is an algorithm designed for the integration of heterogeneous single-cell multi-omics sequencing data.It addresses the challenge of aligning and representing shared and dataset-specific cellular structures across different modalities.The algorithm formulates this task as a partial manifold alignment problem and utilizes a partial Gromov-Wasserstein optimal transport framework to solve it.Pamona identifies shared and dataset-specific cells based on probabilistic couplings and aligns the cellular modalities in a common low-dimensional space while preserving both shared and dataset-specific structures.It can incorporate prior information such as cell type annotations or cell-cell correspondence to improve alignment quality.Here,we used Pamona to perform the integrating scATAC-seq, scRNA-seq and protein UnionCom UnionCom is an algorithm developed for the unsupervised topological alignment of single-cell multi-omics data integration.It addresses the challenge of integrating datasets consisting of unpaired cells measured with distinct unmatched features across modalities.The algorithm works by first embedding the intrinsic low-dimensional structure of each single-cell dataset into a distance matrix of cells within the same dataset.Then, it aligns the cells across datasets by matching the distance matrices through a matrix optimization method.Finally, UnionCom projects the distinct unmatched features into a common embedding space for feature comparability of the aligned cells.The key advantages of UnionCom are its unsupervised and data-driven nature, its ability to handle non-linear intrinsic structures, and its capacity to accommodate samples with dataset-specific cell types.It does not require correspondence information among cells or features, making it suitable for integrating single-cell multi-omics datasets.Here,we used UnionCom to perform the integrating scATAC-seq, scRNA-seq and protein scVI scVI (single-cell variational inference) is a scalable framework for the probabilistic representation and analysis of gene expression in single cells.It addresses the challenges of technical noise and bias in single-cell transcriptome measurements, providing a ready-to-use solution for downstream analyses.The main purpose of scVI is to model and account for the uncertainty in gene expression data at the single-cell level.It utilizes stochastic optimization and deep neural networks to aggregate information across similar cells and genes, while considering batch effects and limited sensitivity.By approximating the underlying distributions of observed expression values, scVI enables various analysis tasks such as batch correction, visualization, clustering, and differential expression.The algorithm models the observed gene expression of each cell as a sample drawn from a zero-inflated negative binomial (ZINB) distribution.It incorporates additional random variables to capture nuisance variation and biological differences between cells.A neural network is employed to map the latent variables to the parameters of the ZINB distribution, enabling efficient analysis and imputation of missing values.Here,we used scVI to perform the integrating scATAC-seq, scRNA-seq and protein TotalVI TotalVI is a framework designed for the joint analysis of paired RNA and protein measurements in single cells using the cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) technique.Its main purpose is to integrate these paired views into a unified representation of cell states, overcoming the technical challenges associated with each measurement.The totalVI algorithm employs total variational inference, a probabilistic latent variable model, to capture the uncertainty in observed RNA and protein counts.It represents the data as a composite of biological and technical factors, accounting for sources of variation such as protein background and batch effects.By optimizing the parameters of its components using the variational autoencoder (VAE) framework, totalVI achieves efficient and scalable analysis of CITE-seq data.BABELBABEL is a deep learning algorithm that addresses the challenge of simultaneous profiling of multiple modalities within a single cell in the field of single-cell biology.Its main purpose is to translate between chromatin, RNA, and protein profiles of single cells, allowing the computational synthesis of matched multiomic measurements when only one modality is experimentally available.The algorithm consists of four modular neural networks: two encoders and two decoders.The encoders project either RNA or ATAC profiles (chromatin accessibility) into a shared latent representation, while the decoders infer their corresponding profiles from this latent representation.This shared latent space serves as an abstract, integrated representation of cellular state, capturing major cellular variations.BABEL is trained using a loss function that requires both encoders to be interoperable with either decoder, enabling the translation across different modalities.It leverages paired data to learn a unified latent space without explicit alignment methods.The model predicts gene expression from chromatin accessibility and vice versa, using negative binomial and binary cross entropy loss functions, respectively.sciPENN sciPENN is a versatile deep learning algorithm designed to address the challenges of integrating and analyzing CITE-seq and scRNA-seq data in single-cell multi-omics studies.Its main purpose is to support data integration, protein expression prediction, protein expression imputation, uncertainty quantification, and cell type label transfer.Its implementation involves a network structure comprising various layers and blocks, along with a censored loss function scheme.scMM scMM (single-cell Mixture-of-Experts Multiomics) is a deep generative model designed for integrated analysis of single-cell multiomics data.It addresses the challenge of analyzing complex and high-dimensional multimodal single-cell data by inferring interpretable joint representations and enabling crossmodal generation of single-cell data.The algorithmic implementation of scMM involves a mixture-of-experts framework, consisting of four neural networks with an encoder-decoder pair for each modality.The encoders are used to infer the variational posterior, from which latent variables are sampled.The decoders calculate the parameters of probability distributions (such as negative binomial or zero-inflated negative binomial) to model the characteristics of each modality's data.scMM is