DeCOOC Deconvoluted Hi‐C Map Characterizes the Chromatin Architecture of Cells in Physiologically Distinctive Tissues

Abstract Deciphering variations in chromosome conformations based on bulk three‐dimensional (3D) genomic data from heterogenous tissues is a key to understanding cell‐type specific genome architecture and dynamics. Surprisingly, computational deconvolution methods for high‐throughput chromosome conformation capture (Hi‐C) data remain very rare in the literature. Here, a deep convolutional neural network (CNN), deconvolve bulk Hi‐C data (deCOOC) that remarkably outperformed all the state‐of‐the‐art tools in the deconvolution task is developed. Interestingly, it is noticed that the chromatin accessibility or the Hi‐C contact frequency alone is insufficient to explain the power of deCOOC, suggesting the existence of a latent embedded layer of information pertaining to the cell type specific 3D genome architecture. By applying deCOOC to in‐house‐generated bulk Hi‐C data from visceral and subcutaneous adipose tissues, it is found that the characteristic chromatin features of M2 cells in the two anatomical loci are distinctively bound to different physiological functionalities. Taken together, deCOOC is both a reliable Hi‐C data deconvolution method and a powerful tool for functional extraction of 3D genome architecture.


Introduction
The high-throughput chromosome conformation capture (Hi-C) [1] technique and its variants have greatly broadened our DOI: 10.1002/advs.202301058[4][5] Single-cell Hi-C has further advanced our understand of the dynamics of cells in a cell population. [6]It has now been widely acknowledged that 3D genome architecture is dynamic and varies substantially from cell to cell. [7]Thus, the bulk Hi-C map of tissue samples may merely represent an average profile over all the complicated cell types it comprises, making it necessary to characterize the 3D genome of each cell type therein.For example, distinguishing the key structural alterations that only occur in cancer cells while seen in the bulk Hi-C map of a tumor sample may be essential in identifying driver mutations in both cancer research and clinical applications.However, sorting cells from solid tissues into well-defined cell types remains challenging in most cases, [8] leaving bulk sequencing the only applicable experimental option.11] To the best of our knowledge, except for Thunder, [12] we have yet to find other algorithms published for Hi-C data deconvolution.Thunder requires a list of predefined marker genes specifically expressed in cell types, that is, cell typespecific genes, while such gene lists are not always available.Moreover, the accuracy of Thunder failed to compete with that of transcriptome-oriented methods (see the comparative assessment in this work).A long list of deconvolution algorithms for transcriptome data has already been published in the literature. [9,13]Essentially, most transcriptome-oriented methods assume that the expression levels measured in bulk samples are a mixture, mostly linear, of cell type-specific expression. [9]ne pioneering work proposed that the mixture profile may be decomposed by the product of two submatrices, representing cell type-specific expression profiles and cellular composition, by non-negative least squares. [14]Later, more data decompositionbased algorithms, such as non-negative matrix factorization (NMF)-based methods, ssKL, [15] DSA, [16] and latent Dirichlet allocation-based CDSeq, [17] also appeared.Deconvolution methods were used to estimate the proportion of each cell type with given bulk and cell type-specific expression profiles, for example, non-negative least squares (NNLS), [18,19] the robust regression FARDEEP, [20] v-support vector regression (v-SVR) CIBER-SORT (CS), [21] DeconRNASeq with quadratic programming, [22] and dtangle [23] with input data in logarithmic scale.Last, one may also predict the proportion of cell types using cell type-specific genes, that is, ssKL and DSA. [15,16]Our assessment suggested that direct application of these transcriptome-oriented methods to Hi-C data may not be reliable (see Section 2); thus, accurately deconvoluting bulk Hi-C data remains challenging.
Here, we present a convolutional neural network (CNN)-based algorithm to deconvolve bulk Hi-C data (deCOOC) to estimate the proportions of cell types in the sample.By comparison with stateof-the-art transcriptome-oriented tools and Thunder, we demonstrated that deCOOC remarkably outperformed all competitors.Finally, by applying deCOOC, we compared adipose samples between the upper layer of backfat (ULB) and the greater omentum (GOM).Intriguingly, we found that the characteristic chromatin interactions indicated by deCOOC in the two tissues were directly associated with their physiology.Thus, Hi-C deconvolution could be a powerful tool for the functional exportation of 3D genome architecture.

A Neural Network-Based Bulk Hi-C Matrix Deconvolution Model: deCOOC
We developed a CNN-based model, named deCOOC, to infer the proportions of cell type compositions from bulk Hi-C data.Briefly, deCOOC consists of four convolution layers, three pooled layers, two fully connected neural network layers and the last output layers (Figure 1A).The deCOOC takes KR (Knight-Ruiz) [24] normalized bulk Hi-C as input and outputs the final predicted cell composition proportions.Considering the ultrasparse nature of the off-diagonal region in the Hi-C matrix [ 25] and the variations in chromosome lengths, the input data were formed with only submatrices in the diagonal region, instead of using the full matrix.That is, for a given chromosome, the 30 × 30 submatrices were collected along the diagonal of the Hi-C matrix with sliding windows (Figure 1B).To avoid plausible variations that might exist between chromosomes, and to mitigate the potential selection of an inappropriate chromosome in the model by introducing a level of hedging (Figure S2, Supporting Information), the final input data for deCOOC were the two concatenated submatrices collected from two different chromosomes.The detailed model description, training process, and parameter selection can be found in the Experimental Section.
To train and test deCOOC, we synthesized training and testing datasets from public single-cell Hi-C datasets and in-house bulk Hi-C of purified cell line samples.The synthesized data were composed of the mixed cell population with numerous cell composition proportions by randomly sampling from the single cells or purified cell line samples (Figure S1, Supporting Information and Experimental Section).The deCOOC is robust to parameter settings.For example, the size, sliding steps or coordinates of the two matrices picked from the two chromosomes do not substantially affect performance (Figure S3A,B, Supporting Information).In the present work, chromosomes 9 and 11 were randomly taken as examples.However, the choice of chromosome has little effect on performance, as similar results could be seen with other random combinations of chromosomes (Table S1 and Figure S4, Supporting Information).
To assess deCOOC, we compared it with various state-of-theart deconvolution algorithms.In addition to Thunder, the sole Hi-C deconvolution tool published, [12] we also compared eight transcriptome data-oriented methods, that is, CS, [21] CDSeq, [17] DeconRNAseq, [22] DSA, [16] dtangle, [23] FARDEEP, [20] NNLS, [19] and ssKL. [26]Some tools can take bulk Hi-C data as input directly, for example, CDseq, while others require additional cell type-specific genes, for example, DSA and ssKL.To adapt these tools to Hi-C data, we constructed chromatin interaction profiles (CIPs) to mimic the transcriptome data.Differential interaction regions were identified to mimic cell type-specific gene markers (Figure S6, Supporting Information).The performance of de-COOC was quantified by the root mean square error (RMSE) and a modified version of Lin's concordance correlation coefficient (CCC, see Experimental Section). [27]We assessed deCOOC with two single-cell Hi-C datasets, that is, mouse cell cycle data (denoted as mCC) [7] and human front cortex datasets (denoted as HFC), [6] as well as in-house-generated bulk Hi-C data from pig fat tissue cell lines.

The deCOOC Robustly Deconvolutes a Hi-C Map of Synthesized Bulk Samples with Simple Cell Composition
We assessed deCOOC with synthesized bulk Hi-C data from the experimental single-cell Hi-C dataset mCC, which contains only four cell cycle phases, that is, G1, early-S, later S-G2, and M, [7] and 11 combinations with more than two cell stages can be fully censused.We synthesized 100 samples for each combination with randomly generated proportions for each cell stage, resulting in 1100 samples.A sample is a cell population with 1000 single cells randomly sampled from mCC, representing a synthesized bulk Hi-C library.Two-thirds and one-third of the 1100 samples were taken as training and testing data, respectively.Performance assessment was presented with the average from fivefold crossvalidation.The deCOOC achieved the lowest average RMSE and the highest average CCC, followed by ssKL, CDSeq, and CS (Figure 2A).The predictions by deCOOC were nearly identical to the proportions of ground truth, as it yielded an average RMSE less than 0.008 and CCC larger than 0.990, while the average RMSE and CCC were larger than 0.10 and smaller than 0.54, respectively, for all ssKL, CDSeq, and CS.Similar results were also observed for cell stage-specific performance (Figure 2B).Although the other tools generated relatively better CCC (≈0.90-0.94) in G1, S, and G2, it remained smaller than 0.83 in the M stage.
Next, we assessed data volume dependency against the performance of tools by calculating Pearson's correlation between accuracy and number of contacts in each dataset for deCOOC, ssKL, CDSeq, and CS (Figure 2C).We found that the CCCs were independent of data resolution for all four tools.For RMSE, the independencies were only seen in deCOOC and ssKL, while weak, but significant, correlations were observed in CS and CDseq (Figure 2C), and such dependencies can also be seen in other tools, for example, Thunder and DSA (Figure S7, Supporting Information).Together, deCOOC is the only tool that predicts cell composition from synthesized bulk Hi-C from mCC data robustly and accurately.

The deCOOC Robustly Deconvolutes the Hi-C Map of Synthesized Bulk Samples with Underrepresented Training Data
We assessed the tools with publicly available single-cell Hi-C data on the human frontal cortex (HFC), which contains 14 cortical cell types, as an example, [6] to test whether deCOOC can infer a heterogeneous cell population.As the number of combinations of all 14 cell types (16354, n ≥ 2) is too large to be simulated with limited computational resources, HFC is a perfect dataset to assess the generalization ability of deCOOC.In our experiment, we generated 1300 samples with 892 combinations.After removing 14 samples with only one cell type, we obtained 1286 samples, representing 5.5% of the whole combination space (892 out of 16354, Table S2, Supporting Information, see Experimental Section).
The deCOOC performed well in these synthesized heterogeneous cell populations.The deCOOC obtained the highest Figure 2. deCOOC performs better (lower RMSE and higher CCC [27] ) on simulated mouse data than other methods.A) Boxplots of RMSE and CCC values over all test bulk samples from deCOOC and other deconvolution algorithms for the simulated mCC test dataset.B) Lineplots of RMSE and CCC values for each cell type.Each symbol represents the RMSE or CCC value between ground-truth and predicted cell fractions for one cell type.C) Scatterplots of RMSE (CCC in bottom row) values and the number of Hi-C contacts for simulated mouse data with deCOOC, ssKL, CDSeq, and CS.Pearson correlation coefficients and p values are given above the plots.Low RMSE and high CCC values represent good prediction performance of the method.For all algorithms, the number of test samples n = 363.
average CCC and the lowest average RMSE, followed by CS and DeconRNASeq (Figure 3A).Notably, except for deCOOC, the order of performance was substantially different between mCC and HFC for all tools.For example, ssKL and CDSeq were the top two tools with mCC data, while the top two were CS and Decon-RNAseq with HFC.Next, we assessed the accuracy of the prediction for each cell type.The deCOOC achieved nearly optimal predictive power for all cell types in terms of CCC and RMSE (Figure 3B), followed, again, by CS and DeconRNAseq.For robustness results, all tested tools showed little dependency on data resolution (Figure 3C and Figure S8, Supporting Information).Given the correlation observed in mCC data, we think that those transcriptome-based methods could only occasionally be affected by sequencing depth.
Next, we assessed the generalization ability of deCOOC, that is, how much it may work with unseen combinations far from those training samples in the cell type combination space.Underrepresentative sampling may bring local combinatory bias caused by stochastic fluctuation.To obtain such distal samples, we employed Euclidean distance to define the distance between samples.By randomly generating combinations with random cell proportions, we calculated the minimal distances to the existing samples and noticed a periodic distribution (Figure S9B, Supporting Information).Then, the distal unseen samples were defined as the samples with minimal distances larger than the third peak, representing approximately the top 10% of distal newly generated samples (red line in Figure S9B, Supporting Information).We randomly selected 200 samples from the unseen data for further verification.Remarkably, even for those distal unseen samples, deCOOC achieved good performance with a mean RMSE of 0.011 and CCC of 0.993 (Figure S9C, Supporting Information).Notably, deCOOC can quickly converge to the optimal status with such predictive power for unseen samples, as it reached saturated performance with only ≈700-800 training samples (Figure S9D,E, Supporting Information).Taken together, deCOOC is a sensitive network that can be easily generalized to the whole sample space, even with underrepresented training data for heterogeneous cell populations.

Predictive Chromatin Structural Features Utilized by deCOOC Might Be Latent
To assess the contributions of chromatin features to deconvolution, we performed SHapley Additive exPlanations (SHAP) analysis. [28]SHAP assigns a number to each feature, representing the importance of the feature in machine learning models, [29,30] that is, a positive SHAP value indicates a valid feature in inferring cell types.We calculated SHAP values for all samples in both mCC and HFC datasets and asked what type of chromatin features may be associated with high SHAP values.We presented the results on chromosomes 9 and 11 from mCC and chromosomes 3 and 6 from HFC as examples (Figure 4).
We found that neither normalized Hi-C contact frequency nor chromatin accessibility significantly contribute to the predictive power of deCOOC.To investigate the potential relationship between chromatin accessibility and the predictive capability of de-COOC, we calculated the correlation coefficient between them.However, the correlation between the Hi-C map and SHAP values was extremely weak, with an average Pearson correlation coefficient (PCC) of 0.034 and 0.046, and standard deviations of 0.023 and 0.019 for mCC and HPC, respectively (Figure 4A).Specifically, we sought to determine whether the chromatin accessibility of a cell line exhibited a positive correlation with the importance of the features used to predict the proportions of the cell types.To explore this, we compared the SHAP values with bulk ATAC-seq data in four cell lines (Astro, L23, MG, and ODC) for which publicly available ATAC-seq data were accessible.Interestingly, the relationship between SHAP values and chromatin accessibility appeared to vary among different cell types.In ODC and Astro, a weak positive association was observed when SHAP values were extremely high, occurring in approximately the top 3-5% range (0.01, 0.1).However, in L23, the opposite pattern was observed, where the association shifted when the SHAP values were low (0.0001, 0.001) compared to (0.001, 0.01) (see Figure 4B).These findings indicate that the chromatin features utilized by deCOOC are not simply contact frequency or chromatin accessibility.
The features that deCOOC utilizes for prediction may, instead, largely be associated with latent functional chromatin structure that distinguishes different cell types.Although Hi-C map-based clustering can group the cells into four known categories in the frontal cortex, the borders between cell types within a category remain fuzzy. [6,31]However, even with nearly indistinguishable Hi-C maps within the categories, SHAP profiles are nearly mutually exclusive between them (Figure 4C,D; Figure S10A, Supporting Information), implying a strong association between the latent chromatin structure captured by deCOOC and cell identity.This association might be functional, as evidenced by GO analysis, that is, the genes located in bins with high SHAP values are enriched for GO terms functionally relevant to cells.For example, genes in bins of high SHAP values (top 10%) for G1 were enriched for the regulation of cilium assembly and nitric oxide biosynthetic processes [32][33][34] (Figure S10B, Supporting Information).Another example was the regulation of TOR signaling, [35,36] and actin cytoskeleton reorganization was found to be enriched in astrocyte-specific high-SHAP genes [37] (Figure S10C, Supporting Information).Together, instead of relying on cell-type-specific chromatin contacts, deCOOC successfully characterized cell type specificity from a seemingly identical Hi-C map (Figure 4D; Figure S10D, Supporting Information).In addition, we conducted model training using input matrices of various sizes (Figure S11, Supporting Information).Our findings revealed that the bins exhibiting high SHAP values were predominantly shared across different matrix sizes.This observation strongly suggests that the SHAP analysis remained largely unaffected by the size of the input matrix.

Fine-Tuned deCOOC Deconvolutes Bulk Hi-C Data of Real Tissue Samples
In most common practical scenarios, single-cell Hi-C data are rarely accessible, particularly for solid tissue samples.Therefore, we wondered whether deCOOC could deconvolute bulk Hi-C without being trained by single-cell data.By assuming that Hi-C profiles of a cell type do not substantially differ in situ between the tissue and cultured cell lines, we assessed this performance using pig adipose tissue as a model.Four major cell types appear in most adipose tissues, [38] that is, Adi (adipocytes), VEC (vascular endothelial cells), M1 (M1-type macrophages), and M2 (M2-type macrophages).We generated high-quality in situ Hi-C data for the four cell types of which Adi, M1, and M2 were derived from stem cells and VEC was isolated from the pulmonary aorta (Figure S12 and Table S3, Supporting Information).Similar to what we did with the mCC and HPC datasets, deCOOC was trained and tested with synthesized bulk data from cell lines (Experimental Section).We compared deCOOC with the two bestperforming peer predictors, CDSeq, and CS, according to the assessment we showed above.
The deCOOC outperformed CS and CDseq, as assessed by RMSE and CCC (Figure 5A).Although the tools were trained by the data generated from cultured cell lines, the performance of all tools has a similar trend and is comparable to that trained by the data from single cells.The RMSE of deCOOC was comparable to that of HPC and significantly larger than that of mCC (Figure 5A).The RMSEs of CS and CDseq were both similar to those with single-cell datasets (Figure 5A).The CCCs of de-COOC and CS were also comparable to those with both singlecell datasets, while the CCC of CDseq was nearly zero, on average, similar to that with HPC.
Next, we applied deCOOC to bulk Hi-C data generated from two pig adipose tissues, that is, the ULB at subcutaneous adipose (SAT) and GOM at abdominal visceral adipose tissue (VAT) , representing physiologically distinguishable subcutaneous and visceral fat, respectively.In total, 12 and 10 samples were from ULB and GOM, respectively, and the corresponding high-quality in situ Hi-C and RNA-seq data were generated (Table S4, Supporting Information).As the deconvolution problem has long been addressed with transcriptome data, [9,39] we used CIBERSORTx inferred cell composition as the refs.[40,  41].By transfer learning, deCOOC, which was pretrained on the above synthesized data, was fine-tuned with 17 out of 22 adipose samples (12 ULB + 10 GOM) (Experimental Section; Figure S13, Supporting Information).Fine-tuned deCOOC was then tested on the remaining five samples.Compared to CS and CDSeq, de-COOC had the highest average CCC and lowest RMSE (CCC: 0.937, RMSE: 0.052, Figure 5B).For all cell proportions, de-COOC achieved the highest CCC and lowest RMSE (CCC: 0.975, 95% CI: 0.963-0.983,RMSE: 0.070) (Figure 5C).Furthermore, to further validate our approach, we presented an additional example using the HFC dataset (see in the Supporting Information and Figure S14, Supporting Information).
The number and quality of samples used for fine-tuning may have had a minor effect on the performance of deCOOC.Generally, the more samples used the better performance one may achieve.For both RMSE and CCC, four out of five testing samples showed better performance when deCOOC was fine-tuned by more samples (Figure 5D).However, the difference remains minor; for example, the largest differences were 0.026 and 0.012 in RMSE and CCC, respectively, when fine-tuned by 8 and 17 samples, respectively.One sample (GOM_CC3) was rather stable, irrespective of the number of fine-tuned samples, which raises the question of whether sample quality may have an effect on the fine-tuning process.However, when we used different numbers of samples for fine-tuning, we found that neither CCC nor RMSE was affected to any noticeable degree (Figure S15A, Supporting Information).Nevertheless, for all sample numbers in the random sampling we tested, the CCC and RMSE of deCOOC were larger than 0.980 and less than 0.06, respectively, substantially better values than those of both CS and CDSeq (Figure 5D; Figure S15A, Supporting Information).This result suggests that deCOOC is suitable for performing the deconvolution task on real bulk Hi-C of heterogeneous tissues.

deCOOC Deconvoluted the Chromatin Architecture in M2 Cells, Reflecting Inflammation-and Energy Metabolism-Related Functionality of Visceral and Subcutaneous Fat, Respectively
Finally, we asked if the chromatin feature that deCOOC utilizes to define a cell type is functionally relevant.Because visceral fat is more inflammation-related than subcutaneous fat, [42] we used M2 cells in GOM tissue as an example to examine this question.M2 macrophages are anti-inflammatory macrophages that promote the resolution of inflammation, coordinate tissue integrity, and release anti-inflammatory mediators; [43,44] importantly, they are specifically active and enriched in visceral fat.Interestingly, within the positive SHAP regions for M2 cells in GOM, we found more inflammation-related GO terms significantly enriched for genes in GOM than in ULB (Figure S15B, Supporting Information).47] On the other hand, it is known that subcutaneous fat is more involved in energy metabolism than visceral fat.We found that energy-related metabolism GO terms (i.e., mitochondrion organization and positive regulation of angiogenesis) were significantly enriched with genes in the positive SHAP regions for M2 cells in the ULB (Figure S15B, Supporting Information).Similarly, energy metabolism-related genes, for example, MT-ND3 (crucial for mitochondrial function, [48] and CXCL2 (regulation of angiogenesis), [49] were found to be more highly expressed in SAT than in VAT (Figure 5E; Figure S15F, Supporting Information).Together, these examples suggest that deCOOC is not only a powerful tool for deconvoluting bulk Hi-C into cell type compo-nents, but also capable of detecting potential functional relevance in chromatin features.

Discussion
In this study, we proposed deCOOC, a CNN-based cellular deconvolution model, to map Hi-C of tissues to cellular compositions.Our assessment demonstrated that deCOOC outperformed existing methods in prediction accuracy and robustness.
To the best of our knowledge, Thunder is the only published method on Hi-C deconvolution. [12]As an unsupervised model, Thunder does not involve a training process, which allows it to perform Hi-C deconvolution with a predefined number of cell types.This is critical in cases where the cellular composition is complicated or unknown.However, this advantage pays the price with relatively low predictive accuracy, that is, even when the number of cell types is given, Thunder can only achieve performance similar to that of transcriptome-oriented methods (Figure 2).
We employed CNN instead of canonical feature selectionbased methods to perform Hi-C deconvolution based on the following rationale.First, Hi-C, among cell types, is known to be largely conserved, [50] and the cell type-specific features are not always easily detectable and sensitive to sequencing depth.Second, CNN-based methods are known to be capable of capturing latent characters from data.These latent characteristics are important not only for prediction, but they also hint at functionality that may lead to novel discovery.Finally, the CNN-based method is more robust to experimental variations (Figure 2).
Several aspects of deCOOC need to be improved in the future.First, cultured or single-cell Hi-C data from purified cells remain essential in deCOOC.However, in real practice, neither of these two types of training data may be easily accessible.Furthermore, fine-tuning is required as deCOOC was trained completely by simulated data, which does not couple with unavoidable and unpredictable experimental variations.That is why we employed transfer learning technology in fine-tuning our pig adipose data.However, this fine-tuning may be sensitive to sample and data quality.Second, the bin size we used in the present work was 500 kb.This bin size prevented us from examining the contributions of high-resolution chromatin structures, for example, TAD/Loop, to deconvolution.However, at this resolution, we demonstrated that deCOOC was rather robust to data volume.Moreover, we tried to associate the chromatin compartment with SHAP values with negative results (data not shown), which may imply that it might be the features beyond simple chromatin structures that contribute towards deconvolution.
53][54] However, the highly dynamic nature of chromatin loops between cell types puts another dimension of complexity into this canonical genotype-phenotype association problem. [55,56]That is, to associate a noncoding variant with its target through spatial approximation of chromatin, it is essential to refer to the chromatin structure in the target cell type instead of bulk data, which are commonly available in most conditions.With the increased categories of cell types and cell states interrogated by the scHi-C approach, [57] we anticipate that future integrative analysis of bulk data with matched scHi-C data of heterogeneous samples will generate biologically informative estimates of celltype-and states-specific interactome and help interpret findings of genome-wide association studies (i.e., variant-to-function) by assigning a previously unrecognized function to the noncoding variants at single-cell resolution in population scales.However, deCOOC serves as a starting point to address the problem with its power to deconvolute bulk Hi-C of tissue samples with heterogeneous cell components.Potential future directions, including Hi-C profile predictions and differential chromatin structure identifications between samples, should be in the range of the canon for computational genomics in the field.

Experimental Section
Animals: The seven-day-old Bama minipig used in this study was obtained from the experimental farm of Sichuan Agricultural University for primary cell isolation and culture.Pigs were treated humanely and sacrificed by anesthesia until death, and the bone marrow and pulmonary aorta were separated and used for cell culture.The animal experiment was approved by the Experimental Animal Ethics Committee of Sichuan Agricultural University (Approval No. 20200176) and performed following the guidelines for the management and use of laboratory animals.
[60] Briefly, bone marrow cells were obtained by puncture and passed through a 40 μm cell strainer (FALCON, USA, 352340).After removing erythrocytes with an ACK lysate kit (Thermo Fisher Scientific, USA, A1049201), bone marrow cells were seeded and cultured with DMEM/F12 (Thermo Fisher Scientific, USA, 11330-0320) supplied with 10% heat-inactivated fetal bovine serum (FBS) (Thermo Fisher Scientific, USA, 10099141C), 100 U mL −1 penicillin, and 100 mg mL −1 streptomycin (Thermo Fisher Scientific, USA, 030311B) (DMEM/F12 10% FBS) at 37 °C and 5% CO 2 .After 4 h, the unattached cells were enriched and inoculated in a new flask.60] Cell Culture-Isolation and Culture of Adipocytes: Primary adipocytes were obtained following a bone marrow stem cell (BMSC) differentiation protocol using the OriCell Human Umbilical Cord Blood Mesenchymal Stem Cell Adipose Differentiation Induction Medium Kit (Cyagen, USA, HUXUB-90031) as previously described. [61]ell Culture-Isolation and Culture of Vascular Endothelial Cells (VECs): In this study, VECs were isolated from the pulmonary aorta as previously described.[62] Briefly, the two ends of blood vessels were ligated with sutures and digested with 0.1% type I collagenase at 37 °C for 20 min.After digestion, collected VECs were seeded in flasks and cultured with DMEM (Thermo Fisher Scientific, USA, 11330-0320) supplemented with 10% heat-inactivated fetal bovine serum (FBS) (Thermo Fisher Scientific, USA, 10099141C), 100 U mL −1 penicillin and 100 mg mL −1 streptomycin (Thermo Fisher Scientific, USA, 030311B) (DMEM 10% FBS) at 5% CO 2 and 37 °C.[62] In Situ Hi-C Library Preparation: Hi-C libraries for the indicated cells were generated according to a previously published protocol with some minor modifications.[50] The obtained cells were incubated with 4% formaldehyde at room temperature (20-25 °C) for 30 min for chromatin crosslinking, and then glycine was added to obtain a final concentration of 0.25 mol L −1 to quench the formaldehyde.The mixture was then centrifuged at 3000 × g for 5 min at 4 °C and suspended in 100 μL of cold 1 × PBS.Approximately 100 000 cells were pipetted into a new tube for subsequent experiments, spun for 5 min at 3000 × g at 4 °C. Nucli of formaldehyde-fixed cells were permeabilized, and DNA was digested with ten units of MboI (a 4-cutter restriction enzyme) for 5 h at 37 °C.The restriction fragment overhangs were filled, labeled by biotinylated nu-cleotides, and then ligated in a small volume.After cross-link reversal, the ligated DNA was purified and sheared to a length of 300-500 bp at which point ligation junctions were pulled down with streptavidin beads and prepped for Illumina NovaSeq 6000 sequencing.
Preprocessing and Quality Control for Sequencing Reads of Pure Cell Lines: The quality of all libraries was assessed using FastQC (http: //www.bioinformatics.babraham.ac.uk/projects/fastqc/).Raw fastq files were first processed in paired-end mode with trim_galore (v.0.6.6).For Hi-C libraries, the first 15 base pairs (bp) were trimmed from the 5′-end of both read 1 and read 2 owing to their low complexity.Reads with a mean quality score less than, or equal to, 30 were removed, and fragments with lengths shorter than 30 bp were discarded.Then, the trimmed reads were converted into ".hic" files by Juicer tools. [63]ell Culture-Reproducibility Analysis for These Four Cell Lines GenomeDisco [64] scores between replicates were greater than 0.83 among all chromosomes for all cell types with a bin size ≥ 100 kb.A resolution of 500 kb was used in this study.
In Silico Generation of Extra scHi-C Data: When there were sufficient single-cell Hi-C data available for certain cell types, a straightforward approach of randomly sampling cells from the existing dataset was employed.However, a challenge of imbalanced cell type proportions in cases where certain cell types dominated some bulk samples, while having limited real single-cell data available was faced.To address this challenge and ensure a more balanced representation of cell types in the simulated bulk samples, a strategy to generate simulated single-cell Hi-C data for specific cell types using the existing scHi-C data was devised.This strategy aimed to enhance the diversity and complexity of the scHi-C data, thereby improving the overall diversity of the simulated bulk samples.
For the mCC and HFC datasets, Hi-C contacts were aggregated from all single cells belonging to the same cell type to construct the bulk Hi-C profiles for each cell type.To determine the number of simulated scHi-C contacts for a given cell type, the median number of scHi-C contacts across all single cells of that cell type was defined.Subsequently, in silico scHi-C contacts were generated by downsampling the predetermined number of contacts from the pooled bulk Hi-C data (Figure S16, Supporting Information).
Simulation of Bulk Hi-C Samples: To generate simulated bulk Hi-C samples for training, two subsets of samples were created.One subset contained all cell types, while the other subset included only a subset of cell types (e.g., randomly selecting two or three out of the four cell types).
Generation of Cell Type Compositions: The fractions of each cell type in the simulated samples were determined using the random.rand()function from the Python NumPy package.This approach allows for the generation of random proportions without prior knowledge of cellular compositions within tissues.
For each cell type (i), a random number (r i ) was chosen from a uniform distribution between 0 and 1.The random fraction of the cell type (f i , rounded to three decimal places) was then calculated as the following where K represents the number of cell types contained in the sample.
Determining the Number of Single Cells for Simulated Bulk Samples: For the mCC and HFC datasets, a total of N sum = 1000 single cells were composed in silico to mimic a bulk Hi-C sample.Hence, the number of cells required for sampling from the scHi-C dataset for each cell type i can be obtained using the equation Generating Simulated Bulk Samples for Bama Miniature Pig Data: Due to the varying number of sequencing contacts in each bulk Hi-C sample, the number of contacts for each simulated bulk sample was sampled from a uniform distribution within a specified numerical range.Specifically, the range was set between 148 054 505 (minimum) and 235 282 624 (maximum), based on the actual number of sequencing contacts in pig tissues.
Let M sum denote the total number of contacts in a simulated bulk sample.Then, the number of contacts required to be sampled from the bulk contacts of the pure cell line for cell type i could be calculated using the equation All sampled contacts from each cell line were merged together to form the contacts of a simulated bulk sample.These contacts were then converted into the ".hic" format, which is compatible with Juicer tools.
Generation of Reference Matrices for the Deconvolution: Given the Hi-C interactions of any two cell types, HiCcompare [65] (v1.8.0) was employed to derive differential chromatin interactions (e.g., chromatin region1 re-gion2: chr1 500000 55000000; adjusted p value <0.05) (Figure S3, Supporting Information).These differential interactions were integrated from all chromosomes for related cell types as chromatin interaction profiles (CIPs).For example, for the bulk sample mixed with three cell types (A, B, and C), the differential chromatin interactions between A & B, A & C and B & C were integrated to make the CIP.
Algorithm Comparison: The root mean square error (RMSE), Lin's CCC, [27] and Pearson correlation coefficient r between real and predicted cell fractions were measured.These metrics are defined by where y is the vector of ground-truth fractions, ‚ y is the vector of predicted fractions, n is the number of elements in each vector,  y is the standard deviation (SD) of y, cov(y, ‚ y) is the covariance of y and ‚ y, and  y and  ‚ y are the means of the ground-truth and predicted fractions, respectively.This calculated version of CCC was used for evaluating the predicting concordance of cell types.
As the sum value of cell type proportions for each bulk Hi-C sample equals one,  y and  ‚ y will always be equal (e.g., 1 n ), which could increase the CCC values for each bulk sample.Therefore, CCC was slightly modified by replacing ( y −  ŷ) 2 with the mean square error between the predicted fractions and ground truth (shown below) to assess the predicting concordance correlation for each Hi-C bulk sample Modified CCC (y, ‚ y) = 2cov (y, ‚ y) deCOOC Workflow: Step 1: Data preprocessing: First the KR normalized Hi-C matrix was scaled into a probabilistic matrix-like form, in which each row and column shall sum to one.Let V ij denote the (i, j) element in the matrix; it can be scaled as where S i denotes the sum of row i.
The input matrices for deCOOC are rectangular matrices concatenated from two square submatrices by rows.The two submatrices were taken from the diagonal of the Hi-C contact matrices of two different chromosomes (Figure 1B).Unless otherwise stated, the submatrices have size = 30 bins (Figure 1B).
Step 3: Training and prediction: The model was trained by fivefold crossvalidation.Parameters were optimized using Adam [67] with a learning rate of 0.001.Mean square error (MSE) was chosen as the loss function and RMSE as the metric during the training stage.The optimization objective was to minimize MSE.During the training process, a technique named ReduceLROnPlateau from the Keras.callbackspackage that monitored RMSE and optimized the learning rate (the parameter monitor was set as "val_root_mean_squared_error", patience was set as four, and the factor was set as 0.98) was also employed.Early stopping regularization was used to prevent overfitting (Figure S5, Supporting Information).
Fine-Tuning for the Pig Dataset: The model was fine-tuned based on transfer learning. [68]deCOOC was pretrained using simulated pig bulk Hi-C data.During fine-tuning, all layers, except fully connected dense layers in the trained model, were frozen so that their weights could not be updated.Then, the network was retrained on a small number of real pig samples using a very small learning rate (0.0004), so the parameters of the fully connected layers could be updated.Five-fold cross-validation was adopted in the model development stage, which produced five sets of parameters for deCOOC on simulated pig bulk samples, five models with different initializations (five sets of parameters) were fine-tuned using real samples (Figure S13, Supporting Information).Predicted results on real pig samples were averaged by these five fine-tuned models.
SHAP-Based Interpretability Analysis: SHAP (SHapley Additive exPlanations) [28] encompasses a collection of techniques employed for elucidating the output of deep learning and machine learning models.To identify the key features influencing each specific input pertaining to the various possible outputs (cell type compositions) for deCOOC, the GradientExplainer method was utilized. [28]This approach enabled to ascertain the most significant features associated with each input.
During the model training process, a 5×1-fold cross-validation technique was employed to determine SHAP values.This methodology involved creating five separate models during the training phase, where each model was trained on a unique subset of the data.To evaluate the significance of features for each potential output, the GradientExplainer method was utilized.
For each acquired model, the GradientExplainer was applied to both the model itself and the training data specific to that model.This step allowed to generate a GradientExplainer object for each model, facilitating the subsequent computation of SHAP values.
Next, the respective GradientExplainer objects were utilized to calculate SHAP values on the testing data.This entire process was repeated for all five models, resulting in five distinct sets of SHAP values.To provide a comprehensive interpretation of the SHAP values, the average of these five sets were then computed, yielding the final SHAP values for further analysis and interpretation.

Correlation of SHAP Values and Values in the Hi-C Matrix:
To show the direct relationship between SHAP values and Hi-C, the Pearson correlation between the normalized (observed/expected) Hi-C and SHAP values by the function "scipy.stats.pearsonr()"was calculated.Twenty bulk samples were randomly selected for the mCC and HFC datasets.Bulk Hi-C of each cell line was generated according to their composition in the bulk samples.For each sample, the Pearson correlation between Hi-C (obs/exp) and positive SHAP values were calculated for all cell types present in the sample.In addition, correlation coefficients for 20 samples were averaged for mCC and HFC separately.
Analysis of SHAP Values Based on ATAC-seq: For HFC, ATAC-seq data [69] of four cell lines (Table S7 where i and j represent the row and column numbers of the element located in the SHAP matrix, and O f i denotes the open fraction of the ith bin in the SHAP values matrix.GO Enrichment Analysis: RefSeq genes for mm9 and hg19 (for mCC and HFC datasets, respectively) were obtained from UCSC Table Browser, and genes were defined by the region including 2.5 kb upstream and downstream of the TSS (transcription start site) region.Bedtools [70] were used to obtain the list of genes of interest for the genomic region of interest through SHAP value analysis.Then, Metascape was used and GO biological processes in the Pathway option were chosen for enrichment analysis.
Single-Cell RNA-seq Data Analysis: Processed data files were downloaded from Gene Expression Omnibus (GEO) with accession number GSE129363. [71]Seurat v.4.2.0 was then used for downstream analysis of filtering, normalization, clustering, dimensionality reduction, and differential gene expression.Cells with low gene (<200), low unique molecular identifier (UMI, <200), and high mitochondrial gene expression (>20%) were removed.To exclude doublets, cells with high gene (>2500) and high UMI (>13750) were also removed.In addition, only nondiabetic samples were included for further analysis in this article, which contained 14 370 cells.The top 20 dimensions were used to generate the final clusters using principal component analysis (PCA) and graph-based clustering (Figure S15C, Supporting Information).Genes expressed in a minimum of 20% of cells in either of the test populations were considered for analysis.Differential gene expression analysis between cell populations was performed using the Wilcoxon rank sum test in the "FindMarkers" function.The top 100 genes in each cluster were identified in comparison with all other cells using the function "FindAllMarkers" in Seurat, while keeping a cutoff of p_adj ≤ 0.01.According to this paper, [71] immune clusters (1, 4, 7, 8, 12, and 15) were identified using overlapping the expressed markers of each cluster (Figure S15C, Supporting Information) with the immune markers it provided.The immune cell populations consisted of 5000 (34.7%) cells.Clustering was performed again for the immune cells, producing subclusters for SAT and VAT (shown as Figure S15D, Supporting Information).Cluster 6 was defined as the M2 cell type, which was marked by high expression of FOLR2 and KLF4 [71][72][73][74] (Figure S15E, Supporting Information).
Statistical Analysis: To test the correlation between the algorithms' performance and the number of contacts of samples, the Pearson correlation coefficient was derived by the function scipy.stats.pearsonr().The one-sided Wilcoxon signed-rank test was used to compare metrics (i.e., RMSE and CCC) of different algorithms evaluated on exactly the same data, in which p values were calculated by the scipy.stats.wilcoxon()function in Python.Statistical significance calculations for model's input were performed using a two-sided Wilcoxon signed-rank test (scipy.stats.wilcoxon()function in Python).For analyses combining SHAP values and ATAC-seq data, Mann-Whitney U one-sided tests were conducted using the scipy.stats.mannwhitneyu()function in Python.(Significant differences: *P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001).

Figure 1 .
Figure 1.Overview of the deCOOC model.A) Model architecture.The model consists of four convolution layers (each layer of the first three was coupled with one maxpooling layer) and two fully connected neural network layers and the last outputs layer.The input is two square-like interaction matrices derived from the Hi-C matrices of two different chromosomes.The last fully connected layer outputs the predicted cell type proportions.Model training and parameter optimization based on Hi-C data were carried out by minimizing the sum of squares of residues between predicted cell fractions and ground-truth cell fractions.B) Input design for the model.From the complete Hi-C matrix (e.g., with a resolution of 500 kb) of one chromosome, multiple square-like sub-interaction matrices with fixed sizes (e.g., 30 bins) and steps (e.g., 20 bins) are derived diagonally.Two diagonal sub-interaction matrices from two chromosomes are stitched together along the row axis.

Figure 3 .
Figure 3. deCOOC behaves more robustly on simulated HFC data than the other methods.A) Boxplots of RMSE and CCC values over all test bulk samples from deCOOC and other deconvolution algorithms for the simulated HFC test dataset.B) Lineplots of RMSE and CCC values for each cell type.Each symbol represents the RMSE or CCC value between ground-truth and predicted cell fractions for one cell type.C) Scatterplots of RMSE (CCC in bottom row) values and the number of Hi-C interaction contacts of simulated HFC bulk data with deCOOC, CS, and DeconRNASeq.Pearson correlation coefficients and p values are given above the plots.For HFC data, the number of test samples n = 486.

Figure 4 .
Figure 4. SHAP analysis for model interpretation.A) Scatterplots show weak correlation between SHAP values and Hi-C (observed/expected) for mCC and HFC examples (e.g., examples are the same as those shown in Figure4C).Pearson correlation coefficients and p values are given above the plots.B) Correlation analysis between chromatin accessibility and SHAP values based on the HFC dataset.Chromatin accessibility was significantly higher in the group with higher SHAP values (i.e., (0.01, 0.1)) than in the other two groups for Astro and ODC cell types.The L23 cell type showed that chromatin openness only in the median SHAP values group (0.001, 0.01) was dramatically greater than that in the lower SHAP values group.The correlation between SHAP values and chromatin accessibility is dependent on different cell types.P values were calculated using a one-sided Wilcoxon signed-rank test.C) Examples of paired Hi-C matrix (lower left) and SHAP value maps (upper right) for each cell type of mCC and HFC.The regions of 13-28 mb and 10-25 mb of the two chromosomes for the mCC and HFC bulk examples are shown.Cell types and chromosome numbers are labeled at the top and left of the plots, respectively, while the fraction of each cell type is presented in parentheses.For the HFC example, the fourteen cell types were sorted into four categories (labeled by four rectangles of different colors) according to the clustering of cell-type specific chromatin interactions.[6]D) Heatmap of SHAP values (example shown in C) for each cell type prediction (left for mCC example, right for HFC example).Each row in the heatmap indicates the SHAP values for an interaction site.(Significant differences: *P < 0.05, **P < 0.01, ***P < 0.001).

Figure 5 .
Figure 5. deCOOC performs better than CS and CDseq (lower RMSE, but higher CCC) on pig tissue Hi-C data.A) Boxplots of RMSE and CCC values from deCOOC, CS, and CDSeq for simulated pig bulk Hi-C data (randomly sampled experimental Hi-C contacts of four pig cell lines with artificially produced cell fractions).B) Boxplots of RMSE and CCC for assessing deconvolution performance of the three algorithms on real pig tissues.The deconvolution of CS and CDSeq was conducted on all 22 real adipose samples, and the deconvolution of deCOOC was performed five times on five samples randomly selected from 22 adipose samples (the remaining 17 samples were used to fine-tune the model), which was performed five times.C) Scatterplots of CIBERSORTx-predicted cell fractions (x-axis) and deconvoluted cell fractions (y-axis) from fine-tuned deCOOC, CS, and CDSeq on real samples.The corresponding CCC values for the three methods are presented above the plots.D) RMSE and CCC values of deconvolution on five real test tissues (x-axis) from the three deconvolution methods.deCOOC was fine-tuned using different numbers of real tissues.E) Differential expression on a log 2 scale of five genes for subcutaneous adipose tissue (SAT) and visceral adipose tissue (VAT).(Significant differences: ****P < 0.0001).
) from UCSC was obtained.The open fraction (O f ) of each bin was determined by O f = ∑ L r 500 kb (8) where L r indicates the length of open chromatin regions (at the bin), and 500 kb represents the total length of the bin.Bins with O f less than 0.008 were filtered.The open fraction for each element off the diagonal in the SHAP values matrix was defined by