Integration of single‐cell proteomic datasets through distinctive proteins in cell clusters

The use of mass spectrometry and antibody‐based sequencing technologies at the single‐cell level has led to an increase in single‐cell proteomic datasets. Integrating these datasets is crucial to eliminate the batch effect that often arises due to their limited sequencing molecules. Although methods for horizontally integrating high‐dimensional single‐cell transcriptomic datasets can also be applied to single‐cell proteomic datasets, a specialized approach explicitly tailored for low‐dimensional proteomic datasets may enhance the integration process. Here, we introduce SCPRO‐HI, an algorithm for the horizontal integration of antibody‐based single‐cell proteomic datasets. It utilizes a hierarchical cell anchoring technique to match cells based on the similarity of distinctive proteins for constituting cell clusters. A novel variational auto‐encoder model is employed for correcting batch effects on the protein abundances, eliminating the need for mapping them into a new domain. Moreover, we propose a technique for extending the algorithm to high‐dimensional datasets. The performance of the SCPRO‐HI algorithm is evaluated using simulated and real‐world single‐cell proteomic datasets. The findings demonstrate our algorithm outperforms state‐of‐the‐art methods, achieving a 75% higher silhouette score while preserving HVPs 13% better. Furthermore, the algorithm shows competitive performance in transcriptomic datasets, suggesting potential for integrating high‐dimensional mass‐spectrometry‐based proteomic datasets.

directly use multiple single-cell datasets obtained from different individuals or tissues due to biological and technical noise referred to as batch effect [1].
The batch effect typically occurs in a stochastic manner depending on the experimental environment used during cell line generation or the researcher conducting the experiment [2].Additionally, the use of different technologies to perform measurements can also lead to batch effects [3].Biological noise, on the other hand, arises from the comparison of data obtained from multiple tissues or individuals [4].
Furthermore, sequencing of omic data from the entire cell or just the cell nucleus also results in biological noise.The amount of biological noise increases depending on the differences between tissues, sequencing of the cell nucleus or whole cell, and data obtained from different individuals [5].The batch effect must be removed while preserving the biological meaning of datasets such as preserving highly variable proteins in order to increase the accuracy of downstream analysis.
Integrating the same type of omic datasets from different experiments by removing the batch effect is known as horizontal integration [1].In recent years, the integration of single-cell omic datasets has become a popular research area in bioinformatics [6].Integrated single-cell datasets are used to solve a variety of problems such as cell type classification [7], identification of unique cell groups [8], uncovering cell trajectories [9], detection of cellular aging [10], and identification of spatial biological pathways within tissues [11].
The increasing popularity of single-cell transcriptomics data has led to the development of various methods for integrating these datasets, with many of them using an anchor-based approach [12].These methods often involve identifying similar cells and using them as a reference for integration.The most popular anchor-based method, Mutual Nearest Neighbors (MNN) [13], is widely used for the detection of similar cells by state-of-the-art methods [14,15].The MNN algorithm identifies similar cells by finding pairs of cells that are each other's closest neighbors.However, these methods may fail when the datasets are huge and contain a few mutual nearest cell pairs [5].
One concern with the conventional MNN approach arises from the mismatching that can occur when assessing the similarity of individual cells rather than the cell groups to which they belong.Consequently, approaches such as scMerge [16] and SMNN [14] first cluster similar cells and then apply the MNN process to these clusters individually.
Alternatively, methods like CLAIRE [17] generate pseudo matchings using the limited MNN pairs and subsequently refine the identified MNN pairs to remove any false matches.
There are graph-based alternative methods to MNN that create similarity graphs based on the distances between omic vectors of cells.This similarity graph is used for identifying cell clusters through unsupervised clustering [18].The cells in the same cluster are determined as similar cells and used as a reference while removing the batch effect.
In situations where there are limited similar cells in datasets by only their measurements, graph-based methods show higher performance than anchor-based methods.On the other hand, performance analysis of integrated datasets has shown that anchor-based methods are more

Significance Statement
Integrating single-cell proteomic datasets is crucial for obtaining comprehensive and more reliable measurements.This integration facilitates more precise execution of various downstream analysis tasks such as aging, trajectory inference, or cell-type identification.The proposed method enhances the integration quality of antibody-based singlecell proteomic datasets by using a hierarchical cell matching algorithm and a novel variational auto-encoder model for batch effect removal.In contrast to focusing solely on intracluster features such as highly variable proteins, the cell matching algorithm incorporates the importance of proteins that significantly contribute to distinguishing cell clusters.
The proposed variational auto-encoder model effectively addresses batch effects by adjusting measurements concurrently based on positive and negative cell anchors, similar to the contrastive learning approach.Correcting batch effects on protein abundances instead of embedding them into a lower-dimensional space is particularly advantageous for downstream tasks that require biological analysis at the individual protein level.successful in preserving biological meaning compared to graph-based methods [5].
Deep learning-based approaches are becoming increasingly popular for horizontal integration [19][20][21][22].The motivation for these methods is based on their ability to perform the integration process in an unsupervised manner and the success of deep learning models in preserving the biological meaning of the obtained data [5].However, the integrated data is often transformed into a new domain, making it difficult to use the resulting data directly as omic data in downstream analysis.This can hinder the use of advanced analysis techniques that require all the measurements of the biological signals contained in the integrated data.Nevertheless, algorithms based on deep learning are widely preferred when the relative measurements of the signals in the integrated dataset are sufficient for downstream analysis.
The lack of methods for integrating single-cell proteomic datasets is largely caused by the fact that the technology for producing highthroughput single-cell proteomic data is not fully mature yet.However, with the advent of new sequencing technologies [23], the number of such datasets in public databases has been steadily growing over the past 6 months.
The methods developed for integrating transcriptomic datasets can be applied to proteomic datasets.Challenges like identifying similar cells in different datasets and avoiding the loss of biological meaning during integration due to overfitting are issues that need to be addressed for both types of omic data.Nevertheless, these methods are efficient in handling high-dimensional gene data, but their performance decreases when applied to single-cell proteomic datasets, which usually have a limited number of protein measurements.
In addition to the challenge of limited coverage, one must take into account the inherent low abundance of proteins and the sparsity in single-cell proteomic datasets.In comparison to RNA molecules, proteins are often found in lower quantities within cells, posing challenges for accurate detection and introducing higher levels of noise.Additionally, single-cell proteomic data often display sparsity, where not all proteins are measured for every cell.Hence, high-performance algorithms tailored for single-cell proteomic dataset integration are needed to address these challenges and bridge the existing gap in the field.
To the best of our knowledge, the MARIO [24] algorithm is the only method available for integrating single-cell proteomic datasets.It is specifically designed for the horizontal integration of single-cell proteomic datasets, rather than the integration of transcriptomic datasets.
The method calculates two distance matrices between cells in different datasets by using shared and all the features in both datasets.
Then, the interpolation of the distance matrices is used to establish cell anchors.The anchors are filtered if their clusters do not match in a mutual k-means clustering composition.The remaining anchors are used for integrating the datasets into a new domain using the Canonical Correlation Analysis (CCA) algorithm [7].
In this study, a novel method based on a variational auto-encoder (VAE) [25] model is presented for integrating antibody-based singlecell proteomic datasets by removing the batch effect.This methodology allows for precise cell matching, even within sparse datasets.The resulting cell groups offer an abundance profile for each protein, even when some individual cells lack abundance data for certain proteins.
Similar clusters are identified using distinctive proteins within clusters.These distinguishing proteins are more likely to exhibit higher differential expression between clusters, even in the presence of lowabundance proteins.By prioritizing these informative features, our method effectively addresses the issue of low protein abundance, allowing us to capture biologically relevant differences across datasets.
The VAE model follows a training strategy similar to the contrastive learning approach employed by CLAIRE [17], utilizing cells along with their positive and negative anchor cells.It is worth mentioning that both methods were independently developed through separate studies, albeit concurrently.However, in contrast to contrastive learning methods, our method utilizes a single variational auto-encoder in contrast to the use of two Siamese auto-encoder.This VAE model performs two simultaneous tasks: bringing positive cells closer together and differentiating them from the negative anchor.
Consequently, the model generates corrected protein measurements for each cell as output instead of embedding them into a latent space.Thus, we obtain more accurate and less sparse measurements for each protein.
The batch effect is modeled by training a VAE model for each programmatically detected cell type without using any label or prior biological knowledge, which is often not provided.As a result, this allows protein measurements of the cells to be corrected and integrated into a common space, improving the accuracy of downstream analyses with corrected measurements of proteins in cells.
The proposed algorithm is compared to state-of-the-art horizontal data integration algorithms: Harmony [8], Scanorama [26], fastMNN [13], scVI [22], and MARIO [24].The performance of these algorithms is evaluated computationally and biologically using synthetic and real-world single-cell proteomic datasets for integration.Additionally, a brain dataset consisting of four publicly available scRNA-seq mouse brain studies is utilized to assess the generalizability of the proposed model to other types of omic datasets.The experimental results demonstrate that the SCPRO-HI algorithm outperforms its competitors in integrating anti-body-based single-cell proteomic datasets while preserving biological meaning, demonstrating a significant 75% improvement in silhouette score metric and 13% better preservation of HVPs.Furthermore, the performance results on scRNA-seq and mass-spectrometry-based proteomic datasets affirm the algorithm's applicability for the horizontal integration of various high-dimensional omic datasets.

METHODS
The SCPRO-HI method comprises two main steps: cell matching (A to C) and batch effect correction (D and E), as illustrated in Figure 1.In the cell matching step, cells are clustered, and anchors are established between them.These clusters and anchors can be utilized for downstream tasks or as cell matchings in the pipeline of another horizontal integration algorithm, as well.The integration of batches is achieved then by correcting measurements without mapping them into a lowdimensional space using variational auto-encoders.The VAEs leverage the anchors generated in the cell matching step to correct the measurements by generating corresponding corrected measurements for similar cells across different batches.
The proposed anchoring algorithm clusters datasets individually and establishes anchors between similar clusters in different datasets similar to scMerge [16] or SMNN [14].However, instead of relying on intracluster features such as highly variable genes/proteins or stable expressed genes (scSEGs), we employ intercluster distinguishing features of proteins to detect similar clusters across datasets.For each cluster, we calculate the significance of each protein in differentiating the cluster from other clusters within its own dataset.It is hypothesized that if clusters in different datasets are differentiated from the rest by similar proteins, it indicates that they contain the same type of cells.
The anchors and clusters are represented as a graph in order to determine the connected components, with anchors depicted as edges and clusters as nodes.Each connected component represents a specific cell type, and positive anchors of a cell are selected from clusters within the same connected component as its cluster.We employ Haghverdi's method [13] MNN to establish intercluster individual cell anchors.
Furthermore, for each pair of anchored cells, a negative anchor is randomly selected from the cells that are excluded from the corresponding connected component.approach enables the separation of distinct cell types that may otherwise overlap within each other's k-nearest neighborhood.This two-step process contributes to more precise and reliable cell matching.
In Step D, Variational auto-encoder (VAE) models are trained to generate batch effect-corrected protein abundance vectors.These models are trained to remove batch effects using measurements from the positive and negative anchors of a cell, along with the cell's own measurement.This enables the model to generate new protein abundances that resemble both the cell and its positive anchor while avoiding similarities with the negative anchor.The VAE model for each connected component is then trained in an unsupervised manner using these cell triplets generated for the cells within that specific connected component.After the models are trained, the cells themselves are employed as positive anchors in order to generate their corrected protein abundances.
Unlike other VAE-based methods [22,27], our method cleanses batch effect on protein abundance measurements instead of mapping them into a new domain.This allows for the individual evaluation of every protein in the datasets during downstream analysis.The subsequent sections provide detailed explanations of both the cell matching and batch effect removal steps, including all the methods and techniques employed at each step.

Cell matching
In order to employ SCPRO-HI for horizontal integration, all datasets have to contain measurements for the same proteins.Therefore, all datasets are filtered using the common proteins that are present in all datasets.Then, the datasets are normalized before matching using zscore normalization (1), as the domain of the datasets may vary due to the technology used in sequencing.
In Equation ( 1), x i is the protein abundance of cell i, z i is the normalized abundance, μ and σ are the mean and standard deviation of the dataset, respectively.
The normalized measurements are embedded into a 2-dimensional space for two purposes: to smooth out batch effects and reduce computational resource requirements.We employ the UMAP algorithm [28] to generate cell embeddings, as it is a widely used manifold learning-based method in state-of-the-art algorithms [29].UMAP's optimization strategies prioritize preserving the neighborhood structure when mapping high-dimensional data to a lower-dimensional space.This approach effectively reduces noise in the measurements while maintaining the local relationships between cells.As a result, clustering cells of the same type becomes more straightforward in this 2D space.
The density-based DBSCAN algorithm [30] is used for clustering, as it is well-suited for UMAP embeddings.The DBSCAN algorithm determines clusters based on a radius and the minimum number of neighbors considered similar by the UMAP algorithm.The radius or epsilon parameter specifies how close to each other the points must be to be considered part of the same cluster.This parameter should be carefully tuned, as it has a strong effect on the clustering.Rahmah's strategy [31] is used to determine the optimal epsilon value.This strategy first calculates the distance between any pair of points and then selects the minimum of the uncommon distances as the epsilon value.
This ensures that cells with commonly the same distance between them are kept together.The minimum number of neighbors is set to 3, which is a widely used value as it provides for locally retrievable and transmitted data.
The datasets are clustered individually.Similar clusters in different datasets are anchored, as it is expected that integration will bring the same type of cell groups together.The similarity metric is based on the feature importance of proteins in cell clusters, which represents the importance level of a protein in separating its cluster from other clusters in the same dataset.
The protein importance is calculated using decision trees.For each cluster, a decision tree is trained as a binary classifier using the protein abundances.The decision trees are trained in a supervised manner using computationally determined cluster labels, where a label of 1 is assigned to cells belonging to the current cluster, while a label of 0 is assigned to the remaining cells in the dataset.The protein importance values are obtained from the trained models, with each importance representing the percentage of information retrieved by using this particular feature while training the model.Therefore, the sum of the importance values is always equal to 1 for each cluster.
The similarity between each pair of clusters in different datasets is calculated by measuring the cosine similarity between the protein importance vectors.Cosine similarity tends to increase when the pairwise similarity between corresponding important proteins of cluster pair is high.As a result, a score of 0 indicates that the paired clusters do not share any common important proteins, whereas a score of 1 suggests that their protein importance values are identical.
Cluster anchors are established between cluster pairs that have a similarity above a 60% threshold by default.This empirically determined threshold can be adjusted to balance between generalized integration and cell-type-sensitive integration.A lower similarity threshold leads to more generalized integration, modeling the batch effect in a unified manner.On the other hand, a higher threshold results in fewer anchors between clusters, reducing integration quality but increasing precision, as each unique cell cluster models its own batch effect.
The choice of threshold should depend on the specific needs of the downstream analysis.

Batch effect correction
The novel VAE model in Figure 2 Here, X ′ is the corrected protein abundance vector of a cell, which is the output of the model.X is the initial measurement of proteins in the cell.X p and X n are the abundance vectors of positive and negative anchor cells, respectively.The exp(.) function is used to avoid negative loss values resulting from the negative similarity effect of negative anchors.KL is the abbreviation for Kullback-Leibler divergence [32], which calculates the distance between the distribution of the sampling function in the latent space and the Normal distribution.Such a loss function forces the model to generate an output abundance vector that is similar to both the cell and its positive anchors.Thus, we have an adjusted domain for all the datasets in integration, which is purified from batch-specific noise while optimizing the mutual information across the datasets.The cells are given themselves as their positive anchors as input while generating the corrected abundances from the trained models.

The novel VAE model takes three protein abundance vectors and concatenates them, as shown in
The logic behind such an approach is to generate corrected measurements by using the two cells in different batches with the same batch effect.The negative anchors are also given to the model in the generation phase, as in the training process.The negative anchors for cells that were not included in the training process are randomly selected, as explained earlier.The corrected protein abundances are obtained after all the cells pass through the VAE models that were individually trained for their respective connected components.

Dealing with high-dimensional data
The SCPRO-HI algorithm is developed for the integration of singlecell proteomic datasets, which typically contain tens or hundreds of proteins.However, the proposed algorithm can be extended for the integration of higher-dimensional omic datasets, such as transcriptomic datasets that contain measurements of tens of thousands of genes.The proposed algorithm may be inoperative when the number of features exceeds a few hundred, as the learning-based components of the algorithm become computationally expensive.
To address this limitation, the algorithm focuses on utilizing n important genes during the training of the VAE models instead of using all measurements when integrating high-dimensional datasets.The important gene list for a connected component can be provided by the user or determined by employing the protein importance calculation procedure on the connected component to identify the top n most important genes that distinguish it from the rest of the dataset.By leveraging these key genes, the output of the VAE models captures the majority of information related to cell heterogeneity, making it suitable for various downstream analyses.However, it is important to note that the algorithm also allows for calculating corrected measurements for all genes.To accomplish this, the genes, excluding the important genes F I G U R E 3 Workflow for generating corrected measurements for single-cell transcriptomic datasets.Important genes within the connected component are used to train a variational auto-encoder model.The VAE model generates corrected measurements for these n important genes.The genes not in the important list are divided into sets of size n, and multilayer perceptron (MLP) regression models are trained to map the corrected measurements of important genes to corrected gene expressions for the genes within each set.Subsequently, these batch-corrected expressions are utilized as input for the regression models and generate corrected measurements for the cells in the respective connected components of models.
of each connected component, are randomly divided into m sets, each containing n genes.Multilayer perceptron (MLP) regression models are then to establish the relationship between the genes in each set and the important genes of the connected component, as depicted in Figure 3.
The regression models are trained using all the cells within their respective connected components.Each set of genes undergoes training with a regression model, where the input consists of the expressions of important genes in the cells, and the output corresponds to the expressions of genes in the set.Consequently, these regression models effectively capture the relationship between the important genes and the genes in the sets.As a result, the batch correction VAE models can be utilized as transformers, enabling the correction of not only the important genes but also all the genes in the dataset.
Once the regression models are trained, the algorithm is capable of correcting batch effects for all genes.Initially, the VAE model associated with each connected component is employed to calculate batch-corrected expressions for the important genes within each cell.
Subsequently, these batch-corrected expressions are utilized as input for the regression models specific to their respective connected components.The regression models then generate batch effect corrected expressions for all genes.Although this approach may result in uniform modeling of batch effects across all genes, we hypothesize that the multilayer perceptrons (MLPs) incorporated in the algorithm are adequate in providing gene-specific nonlinear corrections.

Synthetic datasets
The SCPRO-HI algorithm was tested in a controlled environment to verify the functionality of the proposed cell anchoring and batch effect removal methods.To achieve this, we created two synthetic datasets, each containing 100 protein measurements from 1000 cells.
Protein abundances were simulated using a three-component Gaus-

Integrating cross-species datasets
In this study, we assess the performance of the SCPRO-HI algorithm by comparing it against four state-of-the-art single-cell transcriptomic data integration tools: Harmony, Scanorama, fastMNN, and scVI.Additionally, we incorporate the MARIO algorithm into our comparison, which, to the best of our knowledge, is the only algorithm specifically designed for integrating single-cell proteomic datasets.For the evaluation, we utilize four publicly available datasets, each containing 120,000 blood cells categorized into seven cell types obtained from (i) 35 healthy humans challenged with the H1N1 virus [33], (ii) humans stimulated with IFNγ, (iii) rhesus macaques stimulated with IFNγ, and (iv) cynomolgus monkeys stimulated with IFNγ [34].
In the first dataset, mass cytometry was used to profile 40 pro- All evaluations were conducted using all proteins shared among the four datasets.For the MARIO algorithm, we utilized all proteins present in each dataset, as this method particularly leverages batch-specific proteins.Notably, there are 38 proteins common to all datasets, listed in Table S1.We excluded the CD235a protein from the integration process because it was not measured in rhesus macaques.

TA B L E 1
The performance results of algorithms on cross-species single-cell proteomic datasets.Similarly, the CD27 protein was omitted as it was exclusively measured in humans with H1N1.All algorithms are tested using their publicly available Python packages as described in their user manuals or tutorials.

Method
The SCPRO-HI algorithm clustered the datasets into 32 cell clusters in total.For each cluster, we assigned a cell type which is the majority of ground truth cell types of cells in the cluster.These labels were only used to evaluate the performance of the cluster matching, as we do not use any prior knowledge about the cells for integration, except for their protein abundances.The evaluation results demonstrate the efficiency of our cluster-matching algorithm, as 37 of 43 anchors were established between clusters of the same cell type.
The clusters formed 12 connected components.Seven of these components contain two or more clusters, while five of them consist of isolated clusters that do not exhibit sufficient similarity with any cluster within the connected component, even after the second chance.These isolated clusters may be analyzed further, as they contain relatively distinguished cell types in the datasets.
The performance results of the algorithms are evaluated using five widely used metrics: Silhouette Score (SS), Adjusted Rank Index (ARI), iLISI score, cLISI score, and highly variable protein (HVP) preservation score [5].The experimental performance results are presented in Table 1 and the integration results are presented in Figure 5.The best two scores for each metric are highlighted in bold in the table.
The Silhouette score is calculated in two different ways.First, we utilize the ground truth cell types to determine the cluster centers.
Second, we calculate the Silhouette score using the computationally detected cluster centers as the default approach.When considering the ground truth labels, our algorithm demonstrates a silhouette score performance that is almost twice as high as that of the second-best- The Adjusted Rand Index (ARI) is calculated between the ground truth cell labels and the calculated cell types.We use the ground truth cell labels of the three nearest neighbors in the integrated space of each cell to assign a new label to the cell as the majority of its neighbors.
The results of Scanorama, Harmony, and SCPRO-HI are competitive and demonstrate that all three algorithms are successful at grouping similar cells together.The iLISI and cLISI metrics measure the quality of the batch integration and the cell-type composition of clusters in the corrected datasets.Our algorithm's efficient integration capability of different datasets results in the best iLISI score.However, the presence of a limited number of cells belonging to different cell types compared to the majority of cell types within the clusters of the integrated data has a significant impact, resulting in a relatively low cLISI score.This is because these outlier cells influence the local cell type purity of multiple cells located at a distance from them.As a result, the cumulative effect of each outlier is magnified, leading to lower cLISI scores compared to methods that do not effectively integrate different batches.
We evaluate the preservation of biological meaning by using the HVP metric.We follow the same strategy for calculating HVG conservation as proposed by Luecken [5].Our algorithm preserves the vari-ability of proteins almost 13% better than Scanorama.The heatmaps in Figure 5B shows that the algorithm mostly preserves the variability of the proteins.The fastMNN algorithm shows great success in HVG conservation due to its almost-linear correction strategy.On the other hand, the metric cannot be calculated for scVI, Harmony, and MARIO, since they map the protein abundances into a new low-dimensional space.

Human PBMC datasets
The benchmark algorithms were tested using the peripheral blood mononuclear cell (PBMC) datasets proposed in [29].We conducted this experiment to demonstrate the algorithm's generalization capabilities against varying levels of batch effects within the datasets, considering that the magnitude of batch effect tends to be higher in cross-species datasets compared to datasets obtained from the same species.
The PBMC datasets were derived from samples collected from eight donors infected with HIV.The samples were divided into two batches.In total, the datasets comprised 161,764 cells and included measurements for 228 proteins, listed in Table S2.There are seven cell types in the datasets.We employed the same methodology as in the previous experiment to evaluate the performance of all benchmark The proposed algorithm exhibits a competitive performance in terms of the Adjusted Rand Index (ARI) score, although it is relatively lower compared to its competitors.We attribute this difference in performance to the presence of miss-clustered small cell groups that are integrated with larger cell groups of different types.This flaw in clustering can be caused by the mislabeling of cells, particularly when these small groups are situated closer to other types of cell groups within the noisy protein measurement space.
The remarkable success of our algorithm in the iLISI metric remains consistent throughout the experiments conducted on the PBMC datasets, surpassing Harmony, the second-best performing method, by 25% in terms of results.However, in the cLISI metric, our algorithm shares the second position with Scanorama, which deviates from the previous experimental outcomes.Notably, fastMNN and scVI algorithms exhibit exceptional performance in the cLISI metrics, but they fall short when it comes to effectively integrating the batches, as indicated by their iLISI scores of 0.04 and 0.03, respectively.
The experiment conducted on integrating the human PBMC datasets shows the robustness of our algorithm across datasets with varying levels of batch effect.However, it is worth noting that the MARIO algorithm, which exhibited promising performance in integrating cross-species datasets, could not maintain its performance in TA B L E 3 The performance results on single-cell transcriptomic datasets.the PBMC dataset.Notably, the overall performance of the MARIO remains promising for the integration of proteomic datasets.On the other hand, both the Harmony and Scanorama methods demonstrated their robustness by delivering similar performance on both datasets.

Method
The FastMNN algorithm demonstrates improved performance in certain metrics while maintaining its consistent ranking.Notably, the scVI algorithm showed significant performance enhancement in this experiment; however, it struggled to effectively mix the batches, as evidenced by its low iLISI score.
Overall, our algorithm consistently exhibited robustness across different datasets, proving its ability to handle varying levels of batch effects.While some methods showed improvements or limitations in specific aspects, our algorithm emerged as a strong solution method in integrating the PBMC datasets.

Scaling up to high-dimensional datasets
The Harmony and Scanorama algorithms are evaluated on scRNA-seq datasets against the SCPRO-HI algorithm to assess the generalization of our algorithm on high-dimensional datasets.We selected these algorithms because of their ability to integrate scRNA-seq datasets, as demonstrated in many studies.The MARIO algorithm wasn't evaluated as it is designed specifically for single-cell proteomic datasets and does not have any extensions for high-dimensional transcriptomic datasets.Additionally, we did not include the fastMNN algorithm in our tests, as the Harmony and Scanorama algorithms have previously shown superior performance in other studies.The scVI algorithm could not be tested with our test environment, which had a 16-core CPU and 64 GB of RAM.
The four datasets used for comparison contain 2000 gene expressions obtained from 56,399 mouse brain cells in 10 different cell types [35].We conducted our tests by setting the number of important genes to 100, which is coherent with the number of protein measurements in antibody-based single-cell proteomic datasets.As a result, there were to handle higher-dimensional datasets with thousands of genes could enhance its applicability.Additionally, further research is needed to optimize the relationship modeling between important genes and gene sets to improve the preservation of highly variable genes in high-dimensional transcriptomic datasets.
In conclusion, the SCPRO-HI algorithm presented in this study provides an effective approach for the integration of single-cell proteomic datasets.The novel hierarchical cell anchoring technique and variational auto-encoder model successfully address the challenges of cell matching and batch effect correction.The algorithm outperforms state-of-the-art methods in terms of computational metrics and biological meaning preservation, demonstrating its potential for enhancing downstream analysis tasks in single-cell proteomics.Further research and improvements could extend the algorithm's applicability to higher-dimensional datasets and provide even more accurate integration results.
The preliminary clustering of similar cells significantly enhances the effectiveness of the MNN algorithm.This is particularly valuable in mitigating the challenges posed by low-abundance proteins.By grouping cells with similar biological characteristics, our F I G U R E 1 Pipeline of the SCPRO-HI algorithm for integrating single-cell proteomic datasets.(A-C) Hierarchical cell anchoring steps.(A) Each dataset is clustered using the DBSCAN algorithm in an unsupervised manner.(B) Clusters across different datasets are anchored based on a cluster similarity metric, considering protein importance values for intercluster diversity.The connected components are identified in the resulting graph, where clusters are nodes and anchors are edges.(C) Connected components are considered distinct cell types, and positive cell anchors are established only within the same connected component.The MNN algorithm is employed for establishing positive anchors.The negative anchor for a positive cell pair is selected randomly from the clusters that are not in the same component with the corresponding cell pair.(D, E) The integration process of the single-cell proteomic datasets.(D) Batch effect correction is performed using novel VAE models.Each connected component has its own VAE model, which models the batch effect individually.Training is performed using abundance vectors of cell triplets, which include positive and negative anchors along with cells within the connected component.The concatenated vectors are input to the encoder, and the decoder generates the corrected abundance vector for the cell.(E) After training, each cell in a connected component is processed through its respective VAE model, using the cell itself as the positive anchor along with the negative anchor of the cell, to generate corrected abundance vectors for all cells within that connected component.

F I G U R E 2
The anchors and clusters are represented as a graph in order to determine the connected components, where anchors are depicted as edges and clusters as vertices.A connected component (CC) is a subgraph where every two vertices are connected by a path, and the subgraph is not connected to any other vertices outside of it.Since all the cell clusters in a connected component are distinguished from their datasets by sharing similar proteins, we hypothesized that cells of a specific cell type across different datasets form a connected component.The algorithm provides a second chance for unmatched clusters to attach to a connected component.This ensures the inclusion of clusters that closely resemble some clusters within the connected component.The procedure is achieved by treating the cells within the connected component as a unified cluster, enabling the successful attachment of these nearly missed clusters.The important proteins of the connected components are calculated using the same procedure as for individual clusters, and the similarity between each unmatched cluster and each connected component is computed.If the similarity exceeds the 60% similarity threshold, the unmatched cluster is anchored to the most similar connected component.Clusters that remain unmatched after this process are considered as connected components on their own.The positive anchors of cells are selected from within their own connected components to bring together cells of the same cell type in the integrated data.The anchors are calculated individually for each cell in a connected component using the MNN algorithm.In the MNN algorithm, an anchor is established between two cells in different cell clusters if they are in each other's k-nearest neighborhood in the other cluster.The algorithm considers all cells in a connected component except those in the cell's own cluster when determining the MNNs.The number of positive anchors established for each cell ranges from 0 to k.Cell triplets including cell, it is positive and negative anchors are generated for training the VAE models used to correct the batch effect.The number of triplets is equal to the total number of positive cell anchors since they are generated by appending a negative anchor to each positive anchor pair of a cell.The negative anchor in a triplet is randomly selected from the clusters that are not in the same component with the corresponding cell.Once the cell triplets are generated, the cell matching step is completed.The architecture of the novel variational auto-encoder (VAE) model used for batch effect correction.The VAE model contains dense layers in both the encoder and decoder components sized three times the number of proteins.The encoder takes concatenated vectors of cell triplets as input and generates a latent embedding with a size equal to the number of proteins.This latent embedding captures the essential information of the input data.The decoder takes the latent embedding and processes it through two additional dense layers to produce the corrected abundance vector.
is proposed for generating full protein abundances without batch effect.The model enforces the similarity between the positive anchored cells while decreasing the similarity to the negative anchored cells.The following loss function is used for training the model:

Figure 2 .
There is a dense layer in both the encoder and decoder, each with a size of 3N where N is the num-ber of proteins in the datasets.The output of the dense layer in the encoder is connected to both the σ and μ layers, which are used for sampling the latent representation.The latent representation X s is fed into the dense layer in the decoder.The first dense layer is then connected to a dense layer with a size of X, which learns the corrected measurements.The output of the final dense layer serves as the model's output.The proposed algorithm aims to model batch effects individually for each cell type, in order not to smooth cell-type-specific protein abundance peaks.Therefore, a VAE model is trained for each connected component.The VAE models are trained using the cell triplets identified from the cells in their connected components.After the training process is complete, the models are ready to generate corrected protein abundance vectors for not only the cells used in training but all cells in the datasets.The batch effect is modeled using the cell anchors established between every possible pair of clusters in the connected component.Therefore, the models have the potential to correct batch effects on cells that are not included in the training of the model.
sian mixture model, with each component representing a distinct cell type.Consequently, both datasets included three types of cell groups.In these datasets, three specific proteins in each cell group were set to different values while all other measurements utilized a shared mean () value across cells.To ensure differentiation of cell groups across datasets, a unique set of three arbitrary proteins was selected for each randomly paired cell cluster, distinguishing these cells within their groups from others based on these specific protein measurements.The number of distinctive proteins was set to three to maintain consistency with real-world datasets.The abundances of the distinctive proteins are generated using a new  value for all the cells.Thus, we have distinct cell groups across datasets that are distinguished from the remaining cells by the abundance of the same proteins, providing a necessary foundation to evaluate the effectiveness of our proposed anchoring algorithm.A Gaussian noise is randomly generated for each dataset and added to the protein measurements to simulate batch effects.Consequently, we obtained two datasets, each consisting of three identical cell types, as shown in Figure4.The cell groups within the same cell type have different centers but share the same distinctive proteins.The integration results demonstrated the efficacy of the SCPRO-HI algorithm in detecting cell groups with identical distinctive proteins across different datasets.Moreover, we observed that the anchored cell groups clustered together in the integrated dataset.Thus, the VAE models successfully integrated the anchored cells in the corrected measurement space.
in immune cell subsets from 35 individuals across 11 time points F I G U R E 4 Synthetic datasets and integration results.(A) Dataset One and (B) Dataset Two consist of three distinct cell groups, where protein abundances were simulated using a three-component Gaussian mixture model, with each component representing a unique cell type.(C) The joint projection of both datasets before integration shows the spatial arrangement of cells in 2D.The integrated cells are projected, demonstrating the effectiveness of the SCPRO-HI algorithm in establishing anchors between cell groups with identical distinctive proteins across distinct datasets.The homogeneous and well-separated cell clusters validate the capability of the proposed variational auto-encoder (VAE) model to eliminate batch effects, resulting in similar measurements for the anchored cells.during the H1N1 virus challenge.The authors reported plate-specific batch effects among sample cohorts in this dataset.The multispecies datasets were generated for 39 proteins.The measurements are acquired through phospho-flow immune signaling profiling of whole blood from 86 healthy humans, 32 rhesus macaques (Macaca mulatta), and 32 cynomolgus macaques (Macaca fascicularis).These datasets offer a robust framework for integration algorithms, challenging the preservation of biological variation among species while removing batch effects.Visual representations in Figure S1 highlight both internal and external biological variations among the datasets.Notably, the first dataset displays distinct separation from the others, even when considering human samples within the other dataset.Additionally, it is essential to note that human cells were segregated due to biological variation, whereas data samples from monkeys and macaques are interspersed in the figure.
performing method.The silhouette score of our algorithm indicates that the corrected measurements of cells within the same cell type are similar in the corrected space.Additionally, the SCPRO-HI algorithm shows superior performance when the cell cluster labels are used for calculating silhouette scores.The cluster labels are generated using the K-Means clustering algorithm.The number of clusters k is determined as 7 since there are seven types of cells in the batches.The F I G U R E 5 Integration results of algorithms on single-cell proteomic datasets.(A) UMAP projections of the integrated datasets are displayed, with color annotations representing batches (top) and cell types (bottom).The results highlight the effectiveness of the SCPRO-HI algorithm in integrating cells across different batches by aggregating cells of the same type together.The MARIO, Harmony, and Scanorama algorithms also exhibit favorable performance in terms of batch integration and conservation of cell types within clusters.In contrast, the fastMNN algorithm fails to properly integrate datasets, resulting in no mixing of the same type of cell clusters in different batches.The scVI algorithm struggles with batch mixing also, but it demonstrates potential in clustering cells of the same type within its transformed domain.(B) Heatmaps illustrate the similarity between the rank of proteins according to normalized variances in the noisy data and the corrected measurements.Only self-relations of proteins are depicted by the similarity of the importance ranks between raw and integrated data.Lighter cells along the diagonal indicate better preservation of protein variance.All three algorithms demonstrate satisfactory performance, resulting in predominantly pale heatmaps.Notably, the fastMNN algorithm demonstrates the least discrepancy in protein variances, possibly due to its nearly linear correction strategy.silhouette scores by the cluster labels indicate that the cell clusters are well-separated in the corrected domain.
Note: "-" denotes that the corresponding metric is not applicable.Bold values indicate the top two highest scores for each metric.
The experimental performance results are presented in Table2and the integration results are presented in FigureS2.The best two scores for each metric are highlighted in bold in the table.Our algorithm and scVI perform equally well in terms of the silhouette score when integrating the human PBMC datasets.Impressively, both algorithms achieve a 20% improvement in the silhouette score compared to the second-best performing method.The number of clusters k is determined as 7, corresponding to the seven distinct cell types present in the batches.Furthermore, the results reveal that our algorithm successfully groups cells of the same type together while effectively differentiating them from other cell types based on their corrected protein measurements.This achievement is reflected in a 15% improvement in the silhouette score, considering the predicted cell types.
Note: "-" denotes that the corresponding metric is not applicable.Bold values indicate the top two highest scores for each metric.