DeDoc2 Identifies and Characterizes the Hierarchy and Dynamics of Chromatin TAD‐Like Domains in the Single Cells

Abstract Topologically associating domains (TADs) are functional chromatin units with hierarchical structure. However, the existence, prevalence, and dynamics of such hierarchy in single cells remain unexplored. Here, a new generation TAD‐like domain (TLD) detection algorithm, named deDoc2, to decode the hierarchy of TLDs in single cells, is reported. With dynamic programming, deDoc2 seeks genome partitions with global minimal structure entropy for both whole and local contact matrix. Notably, deDoc2 outperforms state‐of‐the‐art tools and is one of only two tools able to identify the hierarchy of TLDs in single cells. By applying deDoc2, it is showed that the hierarchy of TLDs in single cells is highly dynamic during cell cycle, as well as among human brain cortex cells, and that it is associated with cellular identity and functions. Thus, the results demonstrate the abundance of information potentially encoded by TLD hierarchy for functional regulation. The deDoc2 can be freely accessed at https://github.com/zengguangjie/deDoc2.

Using the sample rate at 0.025% with the dataset of Rao et al. as an example 7 , we examined the performance of 6 TAD predictors. First, we found that no data imputation methods we tested were universal for TAD detection. For example, Higashi can remarkably improve the performance of SpectralTAD, but it introduces a negative effect on most other TAD predictors ( Fig. S2a-b). We found that the data imputation methods RWR and scHiCluster could slightly improve the performance of deDoc2, as measured by AMI and WS (P < 0.001, paired sample t-test) ( Fig. S2a-b). However, such improvement was barely seen in the enrichment of structural protein and histone marks in the boundaries (Fig. S3a). As for other TLD predictors, we found that RWR and scHiCluster imputation could improve AMI, WS and the enrichment of CTCF.
No evident improvement could be seen for all other predictors under different imputation methods.
We then tested the effect of data imputation for simulated scHi-C data 8 (Fig.   S2c-d, S3b-c). We noted slight improvement of WS between simulated scHi-C TLDs and simulated bulk TLDs for deDoc2, IS and deDoc under RWR and scHiCluster imputation for threshold=500 (Fig. S3b-c) and threshold=750 ( Fig.S2c-d).
We also tested the effect of data imputation for experimental scHi-C data 3 .
While the improvement of modularity under different imputation methods can be seen for most predictors, the improvement of ChIP-seq peak signals enrichment is not remarkable (Fig. S2e, S3d). We also tested the improvement of ChIP-seq peak signals enrichment and CROC of TLD embedding for Nagano's data 9 and the improvement of AMI of TLD embedding for Lee's data 10 . No evident improvement can be seen for predictors for either one (Fig.   S2f, S3e-f). However, we did find that Higashi imputation showed improvement for most predictors for Nagano's data (Fig. S2f, S3e), and scHiCluster imputation showed improvement for most predictors for Lee's data (Fig. S3f).
Together, at least for the data and tools we tested, the improvement from data imputation was minor and highly method-dependent. Among all these imputation methods and predictors, we found that RWR may be the best choice for deDoc2, and in most cases we tested, it benefited deDoc2 with little defect.
Next, we examined the improvement of the identification for nested and TAD-cliques by data imputation in Tan's dataset. For nested TADs, the modularity of native deDoc2 was larger than that of SpectralTAD with data imputation (Fig. S2g), deDoc couldn't identify nested TLD at this dataset, so we didn't test this method. For TLD-adjR 2 , however, data imputation, in general, had a negative effect on deDoc2 (Fig.S2h), i.e., native deDoc2-generated TLDs had the highest TLD-adjR 2 for all genomic distances.
SpectralTAD-generated TLDs did, indeed, have a slightly higher TLD-adjR 2 when the data were imputed by Higashi. However, this improved TLD-adjR 2 remains smaller than that of native deDoc2. As for TLD-cliques, both deDoc2.w and deTOKI are more or less improved by certain data imputation methods; however, such improvements were minor, as well. For example, slightly more and cleaner TLD-cliques overlapped with bulk could be identified with RWR for deDoc2.w and with scHiCluster for deTOKI ( Fig. S2i-j). Once again, the improved number of TLD-cliques remains smaller, even after data imputation for the other predictors. Altogether, our data suggest that data imputation leads to no substantial improvement in TLD prediction with single-cell Hi-C data.

Performance of deDoc2 in bulk Hi-C data
We assessed the performance of deDoc2 in bulk Hi-C data by comparing it with IS and deDoc using 6 cell types from the dataset of Rao et al. (Fig. S8).
deDoc is a hierarchical TAD detector designed for bulk Hi-C data containing two predictors, deDoc(E) and deDoc(M). In bulk Hi-C data, we assessed both deDoc(E) and deDoc(M). In 6 cell types and 5 algorithms, all predicted a sufficient number of TADs, and the lengths of predicted TADs were all reasonable ( Fig. S8a-b). deDoc2.w and deDoc(E) predicted TADs in comparable number and length, and deDoc2.s and IS predicted TADs in comparable number and length, while deDoc(M) predicted more TADs with smaller TAD length. We calculated fold change between ChIP-seq peaks found at TAD boundaries and those at adjacent flanking regions to evaluate the enrichment of structural protein and histones (CTCF, H3K4me3 and H3K36me3) on TAD boundaries. All algorithms have some degree of enrichment on the TAD boundaries they predicted (Fig. S8c). The enrichment of CTCF on TAD boundaries predicted by deDoc2.s and deDoc2.w ranked first and second, respectively, and the enrichment of H3K4me3 and H3K36me3 on TAD boundaries predicted by deDoc2.s and deDoc2.w ranked second and first, respectively, for most cell types. This indicates that deDoc2 performed best among the three algorithms in predicting TADs. We calculated modularity and structural entropy on Hi-C contact matrices of 6 cell types and calculated the contact density on GM12878 Hi-C contact matrices. We found that TADs predicted by deDoc2.w and deDoc(E) had the highest modularity, lowest structural entropy and highest contact density and that these indices were comparable to each other ( Fig. S8d-f). Modularity, structural entropy and contact density of TADs predicted by deDoc2.s and IS were also comparable.

Difference between deDoc2 and SuperTAD
Both deDoc2 and SuperTAD 11 use dynamic programming algorithm to minimize the structural entropy of Hi-C contact matrices to predict domains, but they differ in several respects. First, deDoc2 minimizes two-dimensional structural entropy and generates partitions as results, while SuperTAD minimizes high-dimensional structural entropy and generates hierarchical TADs as results directly. SuperTAD is not suitable for super sparse scHi-C data, since there are not enough contacts supporting high-dimensional structural entropy minimization. Multilevel hierarchical TAD structure is the advantage of SuperTAD; however, it has greater time and space complexity (Fig S9a). The To compare the performance of deDoc2 and SuperTAD on sparse Hi-C data, we tested these two algorithms on simulated scHi-C data, downsampled Hi-C data and experimental scHi-C data. Since SuperTAD outputs hierarchical TADs, we used the smallest TADs in the hierarchical structure in the following analysis to compare with deDoc2.
We first tested SuperTAD and its variant SuperTAD(F) (SuperTAD with filter), SuperTAD(2) (SuperTAD with height constraint ℎ = 2 ), and deDoc2 on simulated single-cell and bulk Hi-C data (see Methods). These Hi-C data are simulations of 5Mb chromosomes with binsize=40kb, which means that the Hi-C graph contains 125 vertices. We found that SuperTAD could predict TLDs of these short Hi-C fragments using time comparable to that of deDoc2.
SuperTAD(2) spent about 1000-fold more CPU time than deDoc2 and SuperTAD (Fig. S9b). To evaluate the accuracy of predictions, we calculated the similarity between TLDs of simulated single-cell Hi-C and TLDs of simulated bulk Hi-C, and we found that the similarity of TLDs predicted by deDoc2 was clearly higher than that of TLDs predicted by SuperTAD, SuperTAD(F) and SuperTAD (2). For D=750, D=1000, and D=500, the similarity of TLDs predicted by deDoc2 was higher than that of SuperTAD (2) ( Fig. S9c-d).
Second, we tested SuperTAD, SuperTAD(F) and deDoc2 on down-sampled IMR90 Hi-C data from Rao's dataset at the sampling rate of 0.025% and on experimental GM12878 scHi-C data from Tan's dataset. In the 9 shortest chromosomes (chr14 to chr22), we found that SuperTAD spent about 10000-fold more time than deDoc2 to predict TLDs ( Fig. S9a and e) and that TLD boundaries of downsampled Hi-C data predicted by SuperTAD and SuperTAD(F) had only slightly higher enrichment than those of deDoc2.w and deDoc2.s. The enrichment of TLD boundaries of experimental single-cell Hi-C data predicted by deDoc2.w was comparable to that of SuperTAD and SuperTAD(F).

Treatment of isolated vertices
Taking each Hi-C contact matrix as a graph, isolated vertices are bins that have no interactions with other bins. These isolated vertices could be bins in unmappable regions or bins with no reads mapped. Isolated vertices were either dropped or put into a TLD according to their genomic context. All bins in unmappable regions were simply dropped (Fig. S10a). The remaining isolated vertices will also be dropped in cases where only a few Hi-C reads connect the flanking region of bins (Fig. S10b). We add a self-loop value = 2 0 , where 0 is the number of non-isolated vertices in the matrix, to increase the aversion of these nodes to form communities, which helps deDoc2 to determine whether drop isolated vertices or not.  a. Improvement of AMI between TLDs from downsampled Hi-C at downsample rate of 0.025% with Rao's 7 data.
b. Improvement of WS between TLDs from downsampled Hi-C at downsample rate of 0.025% with Rao's data.
c. Improvement of AMI between TLDs from simulated single-cell Hi-C at threshold 750 and bulk Hi-C.
d. Improvement of WS between TLDs from simulated single-cell Hi-C at threshold 750 and bulk Hi-C.
e. Improvement of the modularity of predicted TLDs from GM12878 experimental single-cell Hi-C from Tan's data.
f. Improvement of CROCs of embedding of mES cells in Nagano's data (400 cells). g. Improvement of TLD-modularity of predicted nested TLDs. h. Improvement of TLD-adjR2 of predicted nested TLDs. Data are mean ± SEM (shadow), n = 16 cells.
i. Improvement of enrichment of inter-TLD contacts in TLD-cliques of predicted TLDs. j. Improvement of credible TLD-clique (size>5) percent of predicted TLDs. # means P < 0.05, * means P < 0.001, paired t-test. Figure S3. Improvement of performance with data imputation.
a. Improvement of ChIP-seq peak signal (CTCF, H3K4me3 and H3K36me3) enrichment at TLD boundaries. TLDs are called from downsampled Hi-C at downsample rate of 0.025% in Rao's data.
b. Improvement of AMI between TLDs from simulated single-cell Hi-C at threshold 500 and bulk Hi-C.
c. Improvement of WS between TLDs from simulated single-cell Hi-C at threshold 500 and bulk Hi-C.
d. Improvement of ChIP-seq peak signal (CTCF, H3K4me3 and H3K36me3) enrichment at TLD boundaries. TLDs are called from experimental GM12878 single-cell Hi-C in Tan's data.
e. Improvement of ChIP-seq peak signal (CTCF, H3K4me3 and H3K36me3) enrichment at TLD boundaries. TLDs are called from experimental mES single-cell Hi-C in Nagano's 9 data.
f. Improvement of AMIs of embedding of TLDs from humanPFC cells in Lee's 10 data (560 cells). # means P < 0.05, * means P < 0.001, paired t-test for samples. Figure S4. Assessment of deDoc2 with simulated data, and experimental scHi-C data.
a. Number of TLDs from simulated scHi-C data with threshold D=500 and simulated bulk Hi-C data.
b. Number of TLDs from simulated scHi-C data with threshold D=500 after imputation. c. Structural entropy of predicted TLDs in Nagano's data. d. Modularity of predicted TLDs in Nagano's data (5 cells randomly selected from each cell type, 20 cells in total).
e. Fold change of ChIP-seq peak signals (CTCF, H3K4me3 and H3K36me3) between TLD boundaries and background (regions away from boundaries). TLDs are called from Nagano's data (5 cells randomly selected from each cell type, 20 cells in total). Data are mean ± SEM, n = 20 × 20 chromosomes. Figure S5. Assessment of deDoc2, HiCS, deDoc, and SpectralTAD with downsampled data.
b. Number of nested TLDs predicted by deDoc2, HiCS, deDoc, and SpectralTAD in three sampling rates.
c. Length of nested TLDs predicted by deDoc2, HiCS, deDoc, and SpectralTAD in three sampling rates.
d. Length of nested subTLDs predicted by deDoc2, HiCS, deDoc, and SpectralTAD in three sampling rates. Figure S6. Assessment of the robustness of the hierarchical TLD across single cells with Tan's data. a. WS of TLDs structure across single cell data, bulk data, pooled data and shuffled-control. b. WS of subTLDs structure across single cell data, bulk data, pooled data and shuffled-control. c. WS of nested TLDs structure across single cell data, bulk data, pooled data and shuffled-control. d. WS of subTLDs structure within nested TLDs across single cell data, bulk data, pooled data and shuffled-control. Mann-Whitney U test ***p<0.001, **p<0.01, *p<0.05.   a. Running time used for predicting TLDs of downsampled Hi-C from Rao's data and experimental scHi-C from Tan's data, 9 chromosomes total (chr14 to chr22).
b. Running time used for predicting TLDs from simulated data. c. AMI between TLDs from simulated single-cell Hi-C at different simulation thresholds and bulk Hi-C.
d. WS between TLDs from simulated single-cell Hi-C at different simulation thresholds and bulk Hi-C.
e. Fold change of ChIP-seq peak signals (CTCF, H3K4me3 and H3K36me3) between TLD boundaries and background (regions away from boundaries). TLDs are called from downsampled Hi-C from Rao's data and experimental single-cell Hi-C in Tan's data in 9 chromosomes total (chr14 to chr22). Data are mean ± SEM, n = 9 chromosomes. Figure S10. Treatment of isolated vertices in Hi-C data.
a. An example of unmappable regions potentially recognized as gap regions.
b. An example of isolated vertices potentially placed inside TLDs.