A Cell Cycle‐Aware Network for Data Integration and Label Transferring of Single‐Cell RNA‐Seq and ATAC‐Seq

Abstract In recent years, the integration of single‐cell multi‐omics data has provided a more comprehensive understanding of cell functions and internal regulatory mechanisms from a non‐single omics perspective, but it still suffers many challenges, such as omics‐variance, sparsity, cell heterogeneity, and confounding factors. As it is known, the cell cycle is regarded as a confounder when analyzing other factors in single‐cell RNA‐seq data, but it is not clear how it will work on the integrated single‐cell multi‐omics data. Here, a cell cycle‐aware network (CCAN) is developed to remove cell cycle effects from the integrated single‐cell multi‐omics data while keeping the cell type‐specific variations. This is the first computational model to study the cell‐cycle effects in the integration of single‐cell multi‐omics data. Validations on several benchmark datasets show the outstanding performance of CCAN in a variety of downstream analyses and applications, including removing cell cycle effects and batch effects of scRNA‐seq datasets from different protocols, integrating paired and unpaired scRNA‐seq and scATAC‐seq data, accurately transferring cell type labels from scRNA‐seq to scATAC‐seq data, and characterizing the differentiation process from hematopoietic stem cells to different lineages in the integration of differentiation data.


Introduction
In recent years, advances in single-cell RNA sequencing (scRNA-seq) technology have enabled us to generate high-throughput gene expression data through different sequencing methods at single-cell resolution. [1]1c,2] There exists a multitude of methods for generating scRNA-seq data, with over a dozen most commonly used scRNA-seq protocols accessible. [3]hese technologies generate scRNA-seq data derived from distinct experiments, encompassing variations in capture timing, handlers, reagent batches, equipment, and even technological platforms. [3,4]These inherent dissimilarities engender batch effects within scRNA-seq data, [2b,5] which become the priority and grand challenges in single-cell RNA-seq data analysis. [6]n addition, the emergence of single-cell multi-omics technologies has enabled insights into complex cellular microenvironments and biological processes, offering many exciting biological opportunities from perspectives other than transcriptomics, such as genomics, epigenomics, proteomics, metabolomics, spatial transcriptomics, etc. [7] Particularly, singlecell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) is an epigenomic profiling technology for studying the chromatin accessibility of individual cells. [8]8a] This enhances our comprehension of the epigenetic state, cell-type heterogeneity, and cell state. [9]8b,10] Meanwhile, cell type annotation for scATAC-seq data is challenging due to lack of specifically designed tools and use of unintuitive cis-and trans-regulatory elements in single-cell ATAC-seq data. [11]In recent years, advanced technologies have made it possible to simultaneously characterize gene expression and chromatin accessibility in the same cell, which we call the generated data paired data. [12]These techniques provide tools for the integrated analysis of scRNA-seq and scATAC-seq data, which can apply the information obtained from the large amount of annotated scRNA-seq data for the cell type annotation of scATAC-seq data. [13]However, single-omics data are more readily available than paired scRNA-seq and scATACseq data.1b] Therefore, in cases where paired data are not abundant, integrating these unpaired data is a better option for researchers to conduct broader studies.However, unpaired data frequently introduce complexity to subsequent analyses due to differences in features and sparsity level, so it is important to develop novel data integration methods that can be applied to unpaired scRNA-seq and scATAC-seq data.
In summary, the current integration of scRNA-seq and scATAC-seq data can be divided into three types: [14] 1) intramodality integration, the integration of the same omics data (scRNA-seq data) measured from different cells and different experiments; 2) paired inter-modality integration, that is, the integration of scRNA-seq and scATAC-seq data measured from the same cell; and 3) unpaired inter-modality integration, that is, the integration of scRNA-seq and scATAC-seq data generated from different cells, samples or experiments.Corresponding integration methods have also been developed for different integration types: 1) methods for intra-modality integration employ dimensionality reduction algorithms to reduce the complexity of the data and identify the common biological signal across datasets to align cells or cell populations to integrate scRNA-seq datasets from different sources or experiments, but most of them struggle with excessive data scale, run time or resource requirements. [15]) methods for paired inter-modality integration apply matrix factorization, [16] weighted nearest neighbor algorithm, [17] and neural networks [18] to integrate scRNA-seq and scATAC-seq data measured within a cell and to obtain a joint profile of cellular state.These methods are specially designed for paired data, making their application to other unpaired data challenging.3) unpaired data not only have different features but also frequently exhibit significant variations in cell count, thus giving rise to a distinct category of integration methods for unpaired data.Methods for unpaired inter-modality integration focus on finding solutions for the manifold learning and cell alignment in the embedding space using neural networks. [19]There is also a nonneural network approach that uses the non-negative matrix factorization approach and online learning algorithm to incorporate new data without recalculating from scratch. [20]However, despite the aforementioned approaches, most of them are specialized to address specific challenges within particular integration type of single-cell data.Currently, there is a lack of a comprehensive approach capable of simultaneously addressing all three types of integration issues outlined above.The single-cell data integration method to be developed needs to consider the sparsity, data scale, feature difference, high dimensionality, and other inherent disparities of scRNA-seq and scATAC-seq data.In addition, the cell cycle is often considered a confounding factor in the study of cell population and cell heterogeneity based on single-cell RNAsequencing data. [21]How it will work in the integrated analysis of scRNA-seq and scATAC-seq data is still unknown.
To address such concerns, we developed an advanced Cell Cycle-Aware Network (CCAN) with the aim of extracting intrinsic biological signals masked by context-specific patterns (i.e., cell type-specific heterogeneity) and confounding factors (i.e., cell cycle effects, batch effects, and noise).Notably, CCAN can integrate single-cell multi-omics data and remove cell cycle effects from the integrated data while maintaining heterogeneity between cell types.CCAN is based on a domain separation network, adding a periodic activation function to the private decoder to simulate the dynamic process of the cell cycle, and projecting single-cell data from different platforms or modalities into a common lowdimensional space through shared projection.The distribution constraint function and the class alignment loss function are added to the shared embedding space to make the distribution of different data as similar as possible and the difference between different types of data to be maximized.In addition to single-cell data integration, CCAN enables cell type prediction of scATACseq data via transferring the cell type annotation information of scRNA-seq data to scATAC-seq data.Validations based on multiple sets of data prove that CCAN can not only eliminate the batch effect between scRNA-seq data from different platforms, but also integrate paired and unpaired scRNA-seq data and scATAC-seq data well in the embedding space.Integration of unpaired data enables accurate cell type prediction for scATAC-seq data.Furthermore, CCAN can maintain cell differentiation trajectories when integrating single-cell differentiation data.

Overview of CCAN Approach
As illustrated in Figure 1, CCAN is a self-supervised approach using the labeled transcriptomic profile of scRNA-seq data (source domain) and unlabeled profile from same/different omics data (target domain), such as gene expression of scRNA-seq data and chromatin accessibility of scATAC-seq data.CCAN uses a domain separation network (DSN) to integrate data from source and target domains and transfer the annotations from source domain to target domain.Shared encoders and private encoders in DSN are three-layer perceptrons to learn noncircular and circular embeddings of both domains, separately.The shared embedding function projects a high-dimensional profile of each cell to a low-dimensional vector, which distinguishes biological meaningful signals from circular confounding factors (private embedding) and transforms the embeddings of cells from different domains into a similar distribution.In the decoder, we use sine and cosine as the activation functions specific for private embeddings, followed by a two-layer perceptron performing noncircular transformations mapping the embedded data to the original space.The training of CCAN has four main steps: 1) pretraining of the cell cycle-aware domain separation network; 2) label transferring from source domain to target domain; 3) refining CCAN by introducing a cluster alignment loss and 4) finalization and applications of CCAN model.We assessed CCAN using several real single-cell datasets, including two scRNA-seq datasets from different protocols, [15c] paired and unpaired scRNA-seq and scATAC-seq datasets, [12c,22] and single-cell differentiated datasets from different modalities, [23] etc. Evaluation results indicate that CCAN is a versatile method that can be used for multi-tasks, including single-cell multi-omics data integration, batch effect removal, cell cycle effects removal, label transferring, and cell type prediction (Figure 1).CCAN is an effective method and CCAN is based on a domain separation network, taking labeled scRNA-seq data as source domain and unlabeled scRNA-seq or scATAC-seq data as target domain.Both source and target domain are fed into a shared encoder and two private encoders.The shared encoder learns the common information from both domains and employs an alignment loss to constrain the distribution of source and target domain to be similar in the shared embedding space.Private encoders are specific to source or target domain to extract periodic information of the cell cycle effect.An orthogonal difference loss between the shared embedding and private embedding enforces features learned by the shared encoder and the private encoder to be as different as possible in the low-dimensional space.Both source and target domain have the same structure of decoder by contacting the shared embedding and private embedding to reconstruct the original data.A supervised classifier is trained on the labeled scRNA-seq data and can be used to predict the label of target domain.CCAN has four main training steps and can be applied to multi-tasks, including batch effect removal, single-cell multi-omics data integration, cell cycle effect removal, cell cycle prediction, label transferring and trajectory analysis, etc.
competitive in various applications compared with other existing methods.

Batch Effect Removal of scRNA-Seq Datasets from Different Protocols
The emergence of advanced technologies enabled comprehensive transcriptional characterization of cell-type heterogeneity across a variety of biological and clinical conditions, integrating these scRNA-seq datasets from different protocols while maintaining cell-type heterogeneity has become very challenging.15c] We denoted them as pbmc_6k and pbmc_8k respectively in this study.We first clustered the two scRNA-seq datasets separately using the Louvain clustering algorithm, [24] resulting 10 clusters of the pbmc_6k data and 13 clusters of the pbmc_8k data (Figure 2a).Then we used canonical cell type marker genes to annotate PBMC clusters before integration (Figure 2b; Figures S1 and S2, Supporting Information).The annotation resulted six major cell populations (Figure 2c): monocytes (CD14+/FCGR3A+/MS4A7+/LYZ+), dendritic cells (FCER1A+), B cells (MS4A1+), T cells (CD3D+), megakaryocytes (PPBP+), and natural killer (NK) cells (CD3D-/GNLY+). [17,25]hese marker genes were significantly expressed in the corresponding annotated cell types (Figure 2d).15d] As illustrated in Figure 2e, the two PBMC data were completely separated before integration, which showed the confident existence of the batch effect.These batch effects significantly impeded the accurate classification of cell types.Evidently, the same cell types were subdivided into two distinct categories prior to integration, including B cells, T cells, and monocytes.CCAN outperformed other methods by accomplishing a perfect integration, effectively eliminating datasetspecific variations and mixing all cells within each cluster in the projected space.The projection distribution on the two coordinate axes of UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) [26] visualization also illustrated that the distributions of pbmc_6k and pbmc_8k exhibited near-complete overlap, further affirming the successful integration of PBMCs using CCAN.To quantitatively measure the performance of CCAN, we calculated kBET (k-nearest neighbor Batch Effect Test), [27] a metric used to assess the batch effects in single-cell RNA sequencing data by comparing the distribution of nearest neighbors between cells from different batches.Compared to other methods, CCAN had the lowest kBET score when choosing different numbers of nearest neighbors (Figure 2f), indicating that cells from different batches had more similar nearest neighbors after integration using CCAN.What's more, a perfect integration not only required no significant variability between the two scRNA-seq data, but also needed to enhance the differences between cell types, which was valuable for subsequent downstream analyses.We applied k-means clustering to the integrated data, and compared the clustering results with the annotated cell types.By calculating the three clustering evaluation metrics of Rand Index (RI), Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), we observed that the results of CCAN were slightly inferior to those of BBKNN and Seurat V4, but always higher than Harmony, online iNMF, Conos and Scanorama (Figure 2g).However, considering the performance of both data mixing and cell type separation, CCAN was still a better choice for integration that had less impact by batch effects and was more reliable for downstream analyses.Details in the calculation of three clustering evaluation metrics are presented in Text S1 (Supporting Information).

Cell Cycle Identification and Cell Cycle Effect Removal of the Integrated Data
To evaluate the CCAN's performance in cell cycle identification, we utilized a mouse embryonic dataset [28] with known ground truth cell cycle information as one batch.4c,29] Subsequently, CCAN was applied to integrate two batches and estimate cell cycle pseudotime from cell cycle-specific variations.We leveraged a Gaussian Mixture Model (GMM) with three components (Figure S3, Supporting Information) to discretize the continuous pseudotime generated by CCAN into discrete cell cycle stages.21c] Seurat employs 43 S phase marker genes and 54 G2M phase marker genes to delineate cell cycle stages, [4a] while reCAT utilizes 378 cell cycle genes from Cyclebase3 [31] to estimate cell cycle stages. [30]To evaluate the performance of these models, we employed four classification metrics: Accuracy, Rand Index (RI), Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI).As illustrated in Figure 3a, CCAN demonstrated outstanding performance in identifying cell cycle stages compared with marker gene-based methods.
Cell cycle dysregulation presents as changes in cell distribution across different stages of the cell cycle and variations in the expression of cell cycle regulatory genes. [32]To further assess the CCAN's performance in identifying cell cycles from different conditions.We used scRNA-seq tumor data of 176644 cells from breast cancer patients. [33]CCAN was utilized to integrat data with two conditions (letrozole alone and intermittent high-dose ribociclib) and predict cancer cell cycle transitions.We calculated the proportion of cancer cells in the mitotic (S/G2) phase in biopsies from each patient (Figure S4a, Supporting Information).During the combination therapy, there was an increase in the proportion of cancer cells in the mitotic and growth (S/G2) phases.We studied the expression fluctuations of CDK6 and CDKN2A genes throughout the cell cycle under combination therapy, a reduction in CDK inhibitor 2A and an increase in CDK6 expression from G1 to S/G2 phase were observed (Figure S4b, Supporting Information).In conclusion, CCAN accurately estimates cancer cell cycle transitions.
In addition to the impact of data batches on cell type heterogeneity, cell cycle is often seen as a confounding factor when studying differences between cell types.Due to the lack of real data with both cell cycle and cell type labels as ground truth, we use the simulated data in Cyclum [21c] to characterize the cell cycle effect in virtual samples.The simulated dataset comprises two subclones.The first subclone, designated as subclone 1, consists of samples from the mESC data, [28] which includes experimental cell cycle labels serving as ground truth.The second subclone is generated by doubling the expression levels of a randomly selected set of genes, encompassing varying numbers of known cell-cycle and non-cell-cycle genes.Cells from these two subclones are amalgamated into a virtual tumor sample.4c,29] This process was repeated 10 times, generating 10 sets of simulated data.We then assessed the effectiveness of CCAN alongside Cyclum, Seurat, and ccRemover in removing the cell cycle effect.Figure 3b presents UMAP visualizations of a representative sampled dataset.CCAN successfully segregates two subclones accurately, with a noticeable distinction between phenotypes compared to the data before removing the cell cycle effect.Cyclum separates cells into two groups, but they do not correspond to the expected subclones.Seurat and ccRemover fail to distinguish between the two subclones.Figures S5-S9 (Supporting Information) illustrate the UMAP visualizations of 10 simulated datasets before and after removing the cell cycle effect using four methods.To quantify the effectiveness of cell cycle effect removal, we employ the metric of separability to measure the degree to which the data points in different classes can be correctly separated or discriminated by the provided subclone clusters.A higher separability value indicates better discrimination between subclones.Figure 3c highlights CCAN's superior separability, underscoring its prowess in removing the cell cycle effect from integrated data.
For the PBMC data before integration, we meticulously selected a subset of variable genes from the PBMCs and proceeded to conduct Principal Component Analysis (PCA) based on these selected genes.Notably, some genes associated with the cell cycle prominently featured among the top 10 principal components.For instance, in the pbmc_6k dataset, genes such as TYMS, RRM2, BIRC5, PCNA, and HMGB2 displayed significant associations with principal components 7, 8, 9, and 10 (PC_7, PC_8, PC_9, and PC_10).Similarly, within the pbmc_8k dataset, genes TYMS, BIRC5, and MKI67 were prominent in principal component 10 (PC_10) (Figure 3d).Despite their minimal or low expression in the majority of cells, several cell cycle-related genes (such as PCNA and HMGB2 in pbmc_6k and BIRC5 in pbmc_8k) exhibited normal expression patterns, thereby corroborating the presence of cell cycle effects.Inspired by Cyclum, [21c] we developed a cell-cycle aware module in CCAN to remove cell cycle effects in the integration of scRNA-seq data.CCAN employed a distinct sinusoidal component in the private autoencoder to effectively capture the circular trajectory in the high-dimensional gene expression space.The private embedding space was formed by single cells sampled at various stages of a periodic process.In essence, the private encoder in CCAN was dedicated to pinpointing an optimal cell embedding within this circular space, which we denoted as cell cycle pseudotime.We compared the performance of CCAN with other existing cell cycle effect removal methods, including Seurat, Cyclum, and ccRemover.Before integration, we used the cell cycle marker genes in Seurat [17] to identify the cell cycle phases of cells in the two scRNA-seq datasets, and used the identified cell cycle as a benchmark label to evaluate the effectiveness of cell cycle effects removal.UMAP visualization of the integrated data after cell cycle effects removal showed that integration using CCAN was unable to clearly distinguish the three annotated cell cycle phases, as did Seurat, Cyclum, and ccRemover.However, the advantage of CCAN over other methods was that only CCAN did not introduce extra noise to the downstream analysis based on cell types after removing the cell cycle effects, which was reflected in the fact that CCAN can accurately distinguish six different cell types after removing the cell cycle effects (Figure 3e).We calculated the metric of separability to quantitatively measure the discrimination between cell types after removing the cell cycle effect using CCAN, Seurat, Cyclum, and ccRemover (Figure S10, Supporting Information).This also demonstrated the effectiveness and reliability of CCAN in removing cell cycle effects.

Integration of Joint Profiling of scRNA-Seq Data and scATAC-Seq Data
12c] We compared the performance of CCAN with eight existing integration methods, containing scMVP, [18a] scAI, [16] scMVAE, [18b] Seurat v4, [17] scJoint, [19b] GLUE, [19a] SCALEX, [19c] and online_iNMF. [20]Among these methods, scMVP, scAI, and scMVAE are specifically designed only for integrating paired data, so their generation of the integrated data is based on concatenation of features in the latent dimension (Figure S11, Supporting Information).While Seurat v4, scJoint, GlUE, SCALEX, and online_iNMF have the capability to integrate both paired and unpaired datasets.Their integration strategy for paired datasets in the comparison is to treat different modalities as two datasets from different experiments and integrate them together, so their integration is based on concatenation of cells (Figure S11, Supporting Information).The paired datasets are measured in the same cells, which provides ground truth labels that allow CCAN to integrate without predicting the pseudo label of the scATAC-seq data first (see Experimental from different protocols with annotated cell types.d) UMAP visualization of marker expression in pbmc_6k and pbmc_8k data, respectively.e) UMAP visualization of scRNA-seq datasets before integration and after integration using CCAN, Harmony, Seurat V4, online iNMF, Conos, Scanorama and BBKNN, respectively.The upper panel was colored by different data batches and the bottom panel colored by annotated cell types.f) Line plot of kBET scores of integrated data using different methods when choosing different numbers of nearest neighbors.k is neighborhood size.g) scatter plot of kBET score and three clustering metrics including RI, ARI, and NMI of different methods.Different methods were represented using different colors and different shapes.RI, Rand Index; ARI, Adjusted Rand Index; NMI, Normalized Mutual Information.Section).In addition, CCAN's integration for paired data is also the concatenation of learned features of scRNA-seq and scATACseq data in the latent space.UMAP visualization of the integrated data shows the excellent integration ability of CCAN, which enables the integrated data to distinguish 13 different cell types better than other methods.Following CCAN, GLUE exhibits the second-best performance.Notably, CCAN's integrated data showcases a tighter aggregation of cells of the same type, with cells of different types are more dispersed.Although scAI, scMVAE, Seurat V4, and SCALEX can also clearly distinguish the large clusters such as L2 /3T, L4, L5CT, and L6IT, different cell types are not far away from each other (Figure 4a).The results from three clustering evaluation metrics, including Rand Index (RI), Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI), computed based on k-means clusters and ground truth, further corroborate that CCAN outperforms other methods in both integrating cells and accurately separating different cell types (Figure 4b).

Integration of Unpaired Datasets and Label Transferring from scRNA-Seq Data to scATAC-Seq Data
Although more and more technologies are able to measure gene expression and chromatin accessibility in the same cell, there are still unpaired datasets from the same tissue.There is no correspondence between cells in scRNA-seq data and cells in scATACseq data.The integration of unpaired datasets maps both modalities to a common space, which allows tools and analyses designed for scRNA-seq data to have the potential to be applied to scATACseq data.Variations between different modalities are different from batch effects between scRNA-seq datasets from different protocols, since scATAC-seq data characterize the chromatin accessibility instead of transcriptome profile and have more sparsity than scRNA-seq data.Even though the integration of unpaired datasets from different modalities has many challenges, CCAN still has excellent performance compared with other existing methods (Figure 5a).We assessed the performance of CCAN using scRNA-seq data from CITE-seq and scATAC-seq data from ASAP-seq.The CITE-seq generates filtered scRNA-seq data based on the condition of mitochondrial reads greater than 10%, a number of expressed genes fewer than 500, and total number of UMI fewer than 1000.For the scATAC-seq data from ASAP-seq, we filtered out cells with a number of peaks more than 1 00 000 and calculated the gene activity matrices for scATAC-seq data using Signac. [34]After that, 17668 overlapped genes are selected as the input features of CCAN.In comparison to the state before integration, CCAN effectively integrates data from two distinct modalities, leading to the clear identification of seven cell types in the integrated dataset.CCAN is comparable to Seurat v4, GLUE, SCALEX, and online_iNMF (Figure 5a).Although the integration performance of CCAN is slightly inferior to scJoint, its label transferring exhibits higher accuracy in identifying cell types in scATAC-seq data when compared to scJoint and Seurat v4.This is evident in the superior accuracy, F1 score, precision, and recall values of CCAN (Figure 5b; Text S1, Supporting Information).In terms of predicted cell types for scATAC-seq data, CCAN's results closely align with the golden standard, indicating a high level of accuracy (Figure 5c).This suggests that CCAN achieves a wellbalanced integration of data and label transferring, allowing it to accurately predict cell types for scATAC-seq data without compromising the performance of data integration for both modalities.

Trajectory Inference and Pseudotime Analysis on the Integrated scRNA-Seq Data and scATAC-Seq Data
We used the human hematopoiesis dataset [23] to evaluate the performance of the integration of differentiated datasets.The human hematopoiesis dataset profiles the chromatin accessibility and gene-expression data of single cells that undergo a differentiation path from hematopoietic stem cells (HSC) dividing into branches.One branch differentiates into plasmacytoid dendritic cells (pDC), the other goes through common myeloid progenitor (CMP) and differentiates into megakaryocyte erythroid progenitor (MEP) and granulocyte-monocyte progenitors (GMP).Granulocyte-monocyte progenitors (GMP) can further differentiate into Monocyte (mono) cells (Figure 6a).The integration of CCAN effectively merged the majority of cells from both scRNAseq and scATAC-seq data within the embedding space.However, a small subset of cells from the scATAC-seq data remained distinct.This observation shows CCAN's capability to mitigate modality-specific variations (Figure 6b). Figure 6c shows the visualization of inferred trajectories of the hematopoiesis stem cell differentiation process based on the joint embedding of scRNAseq and scATAC-seq data.Cells are colored with ground-truth cell types.Lineage 1 and Lineage 2 were inferred and smoothed using Slingshot. [35]In the inferred Lineage 1 of the hematopoiesis dataset, a clear trajectory emerges, with hematopoietic stem cells (HSC) transitioning through the common myeloid progenitor (CMP) stage and further differentiating into megakaryocyte erythroid progenitor (MEP) cells.Along Lineage 2, HSCs follow a similar path through the CMP stage, eventually leading to a mixed cluster comprising granulocyte-monocyte progenitors (GMP) and Monocyte (mono) cells.Notably, toward the end of Lineage 2, we observe a trend wherein GMP cells transition toward plasmacytoid dendritic cell (pDC) differentiation in the integrated data.The literature [36] provides evidence suggesting that GMP cells undergo a series of differentiation steps, including differentiation into earlier progenitor cells, followed by further differentiation into various types of immune cells such as dendritic cells.This finding potentially elucidates the observed trajectory from GMP cells to pDC cells.Based on the cell type annotations on the integrated HSC dataset, we can distinctly observe a lineage progression from hematopoietic stem cells (HSC) to plasmacytoid dendritic cells (pDC), a trajectory that is not readily inferred by Slingshot.For a more intuitive representation, we manually added this lineage (referred to as Lineage 3) to highlight this ccRemover.d) Heatmap of variable genes in principle components of PBMCs in two scRNA-seq data.Only the principal components related to cell cycle genes among the top ten principal components are shown.Cell-cycle genes are marked in red boxes.e) UMAP visualization of integrated PBMC data after removal of cell cycle effects using CCAN, Seurat, Cyclum, and ccRemover.Top panels are colored by annotated cell cycle and bottom panels are colored by annotated cell types.developmental pathway.Pseudotime analysis [37] based on the inferred trajectories confirms the ability of our model to integrate differentiated scRNA-seq data and scATAC-seq data (Figure 6d).
Differential gene analysis can identify differences in gene expression between different cell types during the differentiation process.By calculating the Pearson correlation between gene expression and the inferred cell differentiation pseudotime, we found the nine genes most related to pseudotime, are KLF1, IRF8, SPIB, IRF7, IRF4, ZNF683, STAT2, PRDM1, and BLC11A (Figure 6e).These genes are called transition genes. [38]They are highly expressed in leaf node cells on the cell differentiation trajectory (Figure S12, Supporting Information).According to some published literature studies, the expression of KLF1 is limited to erythrocytes and megakaryocytic-erythroid progenitor cells MEP; [39] IRF8, IRF7, IRF4, STAT2, PRDM1, and BLC11A are all marker genes for plasmacytoid dendritic cells and are highly expressed in pDCs. [40]In addition, we also used DESeq2 [41] to identify differentially expressed genes between different cell types.As shown in Figure S13 (Supporting Information), HOXA6, PRDM16, IRF8, GATA1, CEBPD, and CEBPE are differentially expressed genes in hematopoietic stem cells, common myeloid progenitor cells, common myeloid progenitors, plasmacytoid dendritic cells, megakaryocyte-erythroid progenitors, monocytes, and granulocyte-monocyte progenitors, respectively.40e,42] Enrichment analysis based on differential genes across all cell types showed that these differential genes were concentrated in pathways related to hematopoietic stem cell differentiation (Figure 6h).Differential gene analysis further confirmed the effectiveness and accuracy of CCAN in integrating differentiation data with multimodalities.

Discussion
Data integration of single-cell multi-omics has enhanced our investigation of cell functions and internal regulatory mechanisms beyond single omics viewpoints.However, single-cell multi-omics integration has numerous challenges, including issues such as omics-variance, sparsity, cell heterogeneity, and confounding factors.The cell cycle confounders in scRNA-seq data inspired us to think about the cell cycle effects in the integration of multi-omics data of cells, especially the integration of scRNA-seq and scATAC-seq data.In this study, we developed CCAN, a cell cycle-aware network for data integration of scRNAseq and scATAC-seq data and label transferring from scRNA-seq to scATAC-seq.CCAN is based on a domain separation network, which includes periodic activation functions (sine and cosine) in the private autoencoder to simulate and remove cell cycle effects from the integrated single-cell multi-omics data, and projects single-cell data from different platforms or different omics into a common low-dimensional space through shared autoencoder to integrate while maintaining heterogeneity between cell types.The domain adaptive network solves the problem of inconsistent distribution between single-cell data from different platforms or different omics.The class alignment loss is added to the hidden layer of the domain adaptive network to enhance the differences between different cell types in the integrated data.At the same time, by introducing sine and cosine activation functions into the network, the impact of the cell cycle on cell type heterogeneity can be eliminated while effectively integrating single-cell multiomics data, further improving the performance of data integration.
The design of the cell cycle-aware module within CCAN was inspired by Cyclum.The cell cycle-aware module within CCAN shares similarities with Cyclum's circular component.Both modules leverage nonlinear periodic functions to emulate the circular trajectory of the cell cycle.However, a key distinction lies in CCAN's integration of an orthogonal loss, absent in Cyclum.This additional loss function facilitates the separation of features between the private and shared embeddings, a crucial factor contributing to CCAN's superior performance in cell cycle effect removal.Furthermore, CCAN integrates the cell cycle module into the private decoder of a domain separation network, enhancing its adaptability to multi-omics data integration tasks.Conversely, Cyclum embeds the circular component within an autoencoder framework, thereby limiting its use in the integration of singlecell multi-omics data.
Through comprehensive downstream analyses across diverse data integration scenarios, it has been demonstrated that CCAN (Cell Cycle-Aware Network) possesses the capability to not only mitigate batch effects and cell cycle effects in single-cell RNA sequencing (scRNA-seq) data originating from different platforms but also seamlessly integrate both paired and unpaired scRNAseq data and single-cell ATAC-seq (scATAC-seq) data.The integration of unpaired scRNA-seq data and scATAC-seq data is particularly noteworthy, as CCAN facilitates accurate cell type prediction for scATAC-seq data by leveraging the transformation of annotation information gleaned from scRNA-seq data.This unique feature enhances the utility of CCAN in deciphering cellular heterogeneity and functional states across diverse data modalities.Furthermore, CCAN exhibits remarkable versatility by not only integrating differentiated data from various modalities but also capturing the intricacies of cell differentiation trajectories.This is exemplified in its ability to characterize the differentiation process from hematopoietic stem cells (HSC) to different branches, providing a holistic understanding of cell fate determination in the context of integrated differentiation data.In essence, CCAN emerges as a powerful tool for unraveling the complexities of cellular dynamics across heterogeneous datasets.
In CCAN, we use a domain separation network that requires the source domain data and the target domain data to have the same number of features.The label transferring in the domain separation network limits CCAN to be used only when the data in the source domain and the target domain have the same cell types.This limitation is not prone to over-correction, which is likely to occur especially when integrating collections of datasets with considerable differences in cellular composition.Currently, there are many methods for analyzing and removing cell cycle effects based on scRNA-seq data, but there are few methods for cell cycle analysis based on scATAC-seq data.Therefore, in this manuscript, the cell cycle effects analysis was focused on the integration of scRNA-seq data from different platforms.When integrating scRNA-seq and scATAC-seq data, we can regard the private-encoder embedding as a confounding factor that affects data integration and cell classification.Besides, we only used single-cell datasets with two modalities, including gene expression profiling from scRNA-seq data and chromatin accessibility from scATAC-seq data.However, with the ongoing advancements in high-throughput single-cell sequencing technologies, the accessibility to analyze various molecular components such as DNA, mRNA, and proteins at a single-cell resolution has expanded significantly.Recognizing the potential for a more nuanced understanding through the integration of diverse omics data types, we envision extending the capabilities of CCAN in future development.The objective is to broaden CCAN's scope to encompass the integration and comprehensive analysis of a more extensive array of single-cell omics data, thereby facilitating diverse analytical objectives.

Experimental Section
Basic Structure of Domain Separation Network in CCAN: Domain separation network (DSN) is a specific type of neural network architecture used for domain adaptation, it was designed based on the labeled sourcedomain data and unlabeled target-domain data. [43]In CCAN, sourcedomain data is gene expression profiles of scRNA-seq data and targetdomain data can be same/different single-cell omics data, such as gene expression of scRNA-seq data from different protocols, chromatin accessibility of scATAC-seq data, etc.In the basic structure of DSN, it contains one shared encoder and two domain-specific private encoders to extract the common and private representations from input data.The common and private components of the same domain should be totally split to make sure the independence of these parts.A shared decoder reconstructs the input domain by cascading the shared and private embeddings.Given a labeled dataset in a source domain and an unlabeled dataset in a target domain, cell type classification was mainly used as the cross-domain task, training on the shared embedding from the source domain that generalizes to the target domain, which requires the high invariance between shared embeddings of source and target domains.To achieve this goal, alignment was considered in the embedding space to eliminate differences between domains.Objectively, DSN is a model that produces a shared representation that is similar for both domains and a private representation that is different and transfers the classification label from source domain to target domain.
Cell Cycle-Aware Module in CCAN: The cell cycle has been recognized as a confounding factor in the analysis of cell type-dependent processes.In CCAN, to achieve better cell type transfer from the source domain to the target domain, it is necessary not only to account for differences in the shared embeddings between different domains, but also to remove cell cycle effects from the shared representations.Considering the mutual independence between shared and private components in DSN, a cell cycle-aware module was introduced in the private part of DSN to characterize the dynamic process of cell cycle.The high degree of differences between shared and private components can make shared components out of the influence of private components, that is, the prediction of cell types based on shared components can eliminate the effect of cell cycle modeling based on private components.Taking the source data as X s and the target data as X t , the objective of the cell cycle-aware module is to infer the cell cycle pseudotime Z •p for cells from their corresponding profiles X s or X t .DSN is an autoencoder-based network, in the encoder, a standard multi-layer perceptron with hyperbolic tangent activation functions was used in the private part (also as circular part) and selu activation functions in the shared part (also as acyclic part).The encoder of source data X s is as below where W's and b's are the weight matrices and bias vectors of the encoder.Z sp and Z ss are private embedding and shared embedding of the source domain.Z s represents the cascade of Z sp and Z ss in the hidden layer.In the decoder, cosine and sine were used as the activation functions in the first layer, followed by two layers performing linear transformations.The reconstruction of source data XS can be represented mathematically as where V's are the weight matrices of the decoder and XS is the reconstructed matrix of source domain generated by the decoder.The target domain has the same autoencoder structure of cell cycle-aware module in the private part as follows.
where Z tp and Z ts are private embedding and shared embedding of the target domain.Z t represents the cascade of Z tp and Z ts in the hidden layer.Xt is the reconstructed matrix of target domain.

Algorithm Design and Training of CCAN:
CCAN uses a domain separation network (DSN) to integrate data from source and target domains and transfer the annotations from source domain to target domain.Shared encoder and private encoders in DSN are three-layer perceptrons to learn circular and acyclic embeddings of both modalities, separately.The shared embedding function projects a high-dimensional profile of each cell to a low-dimensional vector, which distinguishes biological meaningful signals from circular confounding factors (private embedding) and transforms the embeddings of cells from different domains into a similar distribution.In the decoder, sine and cosine were used as the activation functions specific for private embeddings, followed by a two-layer perceptron performing noncircular transformations mapping the embedded data to the original space.CCAN used the labeled transcriptomic profile of scRNA-seq data (source domain) and unlabeled profile from same/different omics data (target domain) as input.Taking scRNA-seq data as source domain and scATAC-seq data as target domain as an example, the training of CCAN has four main steps: Step 1: Pretraining of the cell cycle-aware domain separation network.A cell-cycle aware domain separation network was used to perform joint embedding and modality alignment in a common embedding space through a multi-objective loss .First, CCAN trains the source and target autoencoders by minimizing the data reconstruction error.
where N s and N t are number of cells in source and target data, X s and X t are input source and target data, Xs and Xt are the decoder-reconstructed data.Second, it learns shared signals between the scRNA-seq data and scATAC-seq data as well as private signals that are unique to the scRNAseq data and scATAC-seq data.CCAN applies an orthogonal constrain  diff to push the features of the shared and the private embeddings apart from each other where Z ss and Z ts are the embedded source and target data from shared encoder, Z sp and Z tp are the embedded data from domain-specific private encoders.The rationale is to disentangle biological signals specific for cell type heterogeneity from cell cycle confounders.Third, CCAN regularizes the embeddings of scRNA-seq data and scATAC-seq data to make their distributions to be similar.Aligning the distributions across cells from different domains in the shared embedding space can alleviate the out-ofdistribution problem between different modalities.The Maximum Mean Discrepancy loss  MMD was used as the alignment loss  align to align the distribution of the scRNA-seq data and scATAC-seq data where k represents the kernel function.Finally, after the unsupervised pretraining of the autoencoder, a supervised cell type classification model can be trained from the aligned shared embedding using the labeled scRNAseq data, refer as a cross-entropy classification loss  class where y i is ground-truth cell type label of scRNA-seq data and p i is the predicted probability generated by the cell type classifier.In summary, the total loss in the pretraining of CCAN is Multiple losses will be given different weights during training.
Step 2: Label transferring from source domain to target domain.After pretraining of CCAN, the trained cell type classification network is applied to the unlabeled scATAC-seq data, transferring the cell type information of scRNA-seq data to annotate the scATAC-seq data, regarded as label transferring.CCAN makes it possible to remove circular confounders while performing accurate annotations for scATAC-seq data.After the label transferring in step 2, a pseudo-label of the scATAC-seq data was obtained.
Step 3: Refining CCAN by introducing a cluster alignment loss.The joint embedding and label prediction performance was improved using the ground truth of scRNA-seq data and the pseudo-label of the scATAC-seq data.A cluster alignment loss  ca was added to refine the neural networks in CCAN.))] where m is the distance threshold,  s and  t are centroids of embedded source and target data, K is the number of cell types. ca is a classconditional loss that forces the features from the same class to concentrate together and the features from different classes to be separated.In addition, it introduces a conditional feature matching loss to improve the alignment between two domains that aligns the clusters which correspond to the same class but come from different domains.Thus, the updated alignment loss is Step 4: Finalization of CCAN model.The last step generates the joint profile of two modalities and finalizes the annotation of scATAC-seq data.Besides, more downstream analyses can be operated on the joint embedding profile of scRNA-seq and scATAC-seq data when the shared embedding of scRNA-seq and scATAC-seq was cascaded.
Balanced Mini-Batch Training for Cluster Alignment: In the refining step, a cluster alignment loss in CCAN was introduced, which challenges the choice of batch size in neural network training.The batch size selection in the pretraining step is no longer suitable for the refining step to calculate the cluster alignment loss, because it cannot guarantee that each batch size of data can contain all cell types, and may cause extreme imbalance of batch-size data, which will affect the performance of class alignment.To overcome this problem, a balanced mini-batch training in the optimization of the cluster alignment loss was applied that can virtually balance the class ratio of training samples in CCAN.The numbers of samples for each class in a mini-batch are restricted to be the same.This method does not modify or discard cells in the input data, so it can avoid oversampling and under-sampling problems while improving cluster alignment performance, achieving a better integration of single-cell multi-omics data.
CCAN Parameters: By default, the following parameters were set for CCAN: batch size: 64, the hidden-layer dimensions in the encoder: [512, 256], the hidden-layer dimensions in the classifier: [32, 16], the dimension of latent space: 64, learning rate: 1e-3, number of training epochs: 1000.Parameters are optimized via grid search and may vary based on input data.
Datasets: CCAN is an effective tool for multitasking.In order to verify the performance of CCAN in different application scenarios, several real single-cell datasets were used (Table S1, Supporting Information), including scRNA-seq data of peripheral blood mononuclear cells from different protocols, [15c] breast cancer single-cell dataset, [33] paired scRNAseq and scATAC-seq data generated from SNARE-seq technology, [12c] unpaired scRNA-seq and scATAC-seq data of different cells from the same tissue [22] and single-cell multi-omics data of human hematopoietic differentiation. [23,44]Virtual datasets were also generated based on the real mESC data, as shown below [28] : 1) scRNA-seq data of PBMCs: The scRNA-seq data of PBMCs consisted of two scRNA-seq data, each assayed on the Chromium 10X platform, but using different library construction protocols: 3′ end v1 (3pV1) and 3′ end v2 (3pV2) chemistries.This dataset was pre-processed following the vignette in Seurat (https://satijalab.org/seurat/articles/pbmc3k_tutorial).The 3pV1 scRNA-seq data has 5356 cells after data pre-processing and the 3pV2 scRNA-seq data has 8806 cells after data pre-processing.

Figure 1 .
Figure1.Overview of CCAN approach.CCAN is based on a domain separation network, taking labeled scRNA-seq data as source domain and unlabeled scRNA-seq or scATAC-seq data as target domain.Both source and target domain are fed into a shared encoder and two private encoders.The shared encoder learns the common information from both domains and employs an alignment loss to constrain the distribution of source and target domain to be similar in the shared embedding space.Private encoders are specific to source or target domain to extract periodic information of the cell cycle effect.An orthogonal difference loss between the shared embedding and private embedding enforces features learned by the shared encoder and the private encoder to be as different as possible in the low-dimensional space.Both source and target domain have the same structure of decoder by contacting the shared embedding and private embedding to reconstruct the original data.A supervised classifier is trained on the labeled scRNA-seq data and can be used to predict the label of target domain.CCAN has four main training steps and can be applied to multi-tasks, including batch effect removal, single-cell multi-omics data integration, cell cycle effect removal, cell cycle prediction, label transferring and trajectory analysis, etc.

Figure 2 .
Figure 2. Annotation of PBMCs and batch effect removal of scRNA-seq datasets from different protocols.a) UMAP visualization of two single-cell PBMC data from different protocols.Different colors represent different clusters using Louvain clustering method.b) Annotation of different clusters of two PBMC data.Different cell clusters were annotated into six major cell types using marker genes.c) UMAP visualization of two single-cell PBMC data

Figure 3 .
Figure 3. Cell cycle identification and cell cycle effect removal of virtual mESC data and single-cell PBMC data.a) Bar plot of four clustering metrics including Accuracy RI, ARI and NMI of different methods in evaluating the performance of cell cycle identification.Different methods were represented using different colors.RI, Rand Index; ARI, Adjusted Rand Index; NMI, Normalized Mutual Information.b) UMAP visualization of integrated mESC data before and after removing cell cycle effects using CCAN, Seurat, Cyclum and ccRemover.Top panels are colored by subclones and bottom panels are colored by cell cycle groundtruth.c) The separability of mESC data before and after removing cell cycle effect using CCAN, Seurat, Cyclum and

Figure 4 .
Figure 4. Data integration of paired datasets of scRNA-seq data and scATAC-seq data from adult mouse brain using SNARE-seq technology.a) UMAP visualization of integrated SNARE-seq data using CCAN, scMVP, scAI, scMVAE, Seurat v4, scJoint, GLUE, SCALEX, and online_iNMF, respectively.Different colors represent different cell types annotated by Seurat.b) Bar plot of three clustering metrics including RI, ARI, and NMI of different methods.Different metrics were represented using different colors.RI, Rand Index; ARI, Adjusted Rand Index; NMI, Normalized Mutual Information.

Figure 5 .
Figure 5. Integration of unpaired datasets and label transferring from scRNA-seq data to scATAC-seq data.a) UMAP visualization of unpaired scRNA-seq data and scATAC-seq data before and after integration using CCAN, Seurat V4, scJoint, GLUE, SCALEX, and online_iNMF, respectively.Different colors in the upper panel represent different modalities and different colors in the lower panel represent different cell types.b) Label prediction comparison of CCAN with scJoint and Seurat V4 using four clustering metrics including Accuracy, F1 score, Precision and Recall.In multi-class prediction evaluation, Precision, Recall and F1 score represent Macro-Precision, Macro-Recall and Macro-F1 score, separately.Different methods are represented using different colors.c) UMAP visualization of scATAC-seq data labeled with ground-truth cell types and CCAN-predicted cell types, respectively.Different colors represent different cell types.

Figure 6 .
Figure 6.Trajectory inference and pseudotime analysis on the integrated scRNA-seq data and scATAC-seq data.a) Reference of human hematopoietic differentiation process from literature.hematopoietic stem cells (HSC) devided into branches, one branch differentiates in to plasmacytoid dendritic cells (pDC), the other goes through common myeloid progenitor (CMP) and differentiates into two different cell types, including megakaryocyte erythroid progenitor (MEP) and granulocyte-monocyte progenitors (GMP).Granulocyte-monocyte progenitors (GMP) can further differentiate into Monocyte (mono) cells.b) UMAP visualization of integrated scRNA-seq data and scATAC-seq data colored by different modalities.c) UMAP visualization of integrated cells along the inferred trajectory, cells are colored by different cell types.Lineage 1 and Lineage 2 were inferred and smoothed using Slingshot, Lineage 3 was manually added for a more intuitive representation.d) UMAP visualization of integrated scRNA-seq data and scATAC-seq data colored by differentiation pseudotime.e) scatter plot of gene expression along the pseodotime of nine highly-correlated genes.f) bubble plot of enriched pathways of differentially expressed genes.