Circadian transcriptome processing and analysis: a workflow for muscle stem cells

Circadian rhythms coordinate biological processes with Earth's 24‐h daily light/dark cycle. In the last years, efforts in the field of chronobiology have sought to understand the ways in which the circadian clock controls transcription across tissues and cells. This has been supported by the development of different bioinformatic approaches that allow the identification of 24‐h oscillating transcripts. This workflow aims to describe how to isolate muscle stem cells for RNA sequencing analysis from a typical circadian experiment and introduces bioinformatic tools suitable for the analysis of circadian transcriptomes.

Circadian rhythms coordinate biological processes with Earth's 24-h daily light/dark cycle. In the last years, efforts in the field of chronobiology have sought to understand the ways in which the circadian clock controls transcription across tissues and cells. This has been supported by the development of different bioinformatic approaches that allow the identification of 24-h oscillating transcripts. This workflow aims to describe how to isolate muscle stem cells for RNA sequencing analysis from a typical circadian experiment and introduces bioinformatic tools suitable for the analysis of circadian transcriptomes.
Animals, plants, fungi, and bacteria have all evolved circadian timing mechanisms to align and adapt physiological processes and behaviors in response to daily environmental cues. Circadian, from the Latin 'circa diem' ('about a day'), refers to a period of roughly 24 h, the time that Earth takes to perform a single rotation on its axis. Virtually every cell within the body possesses a molecular circadian clock that is capable of selfsustaining oscillations yet can adapt in response to exogenous zeitgebers (time givers), such as light, time of food intake, and exercise. At the molecular level, oscillations of circadian genes are regulated by a complex feedback loop controlled by a core set of transcription factors and their regulators. The essential regulator in mammals is BMAL1 (brain and muscle ARNT-like 1), which dimerizes with CLOCK (circadian locomotor output cycles kaput) and binds to E-box elements to activate the transcription of controlled clock genes (CCGs). Among the CCGs regulated by the BMAL1/ CLOCK complex are period (Per) and cryptochrome (Cry), whose encoded proteins also dimerize and act as inhibitors of BMAL1/CLOCK transcriptional activity.
Proteasomal degradation of PER and CRY subsequently releases the repression of BMAL1/CLOCK. A series of auxiliary loops also exist, including nuclear receptor subfamily 1 group D member 1 (Nr1d1 or REV-ERB-a) and member 2 (Nr1d2 or REV-ERBb), which inhibit BMAL1 transcription, and retinoic acid receptor (RAR)-related orphan receptor (ROR), which conversely activates BMAL1 transcription [1]. The transcriptional effects of BMAL1 regulation help to drive the 24-h rhythms of transcription of between 10% and 30% of genes in the genome, depending on the tissue [2]. Indeed, a key feature of circadian rhythms is that the circadian transcriptional program, that is, the 24-h oscillating genes, can vary greatly from one tissue to another due, in part, to the contribution of cell typespecific transcription and epigenetic regulators. Recent findings also reveal that communication both between tissues [3,4] and within a tissue type [5,6] is critical in defining the circadian transcriptional program within each organ [7]. For tissues in the periphery, such communication may also include local clock-independent effects, driven by systemic effects from SCN-generated behavior cycles such as feeding-fasting [3,6,8]. These conceptual advances have been supported by bioinformatic analyses that have classified rhythmic genes in each tissue.

Bioinformatic tools to define circadian transcriptomes
Since the late 1970s, a variety of methods for rhythmic gene detection have been developed and evaluated by chronobiology researchers, including parametric methods, such as Lomb-Scargle [9] and ARSER [10], as well as numerous nonparametric methods, of which the Jonckheere-Terpstra-Kendall algorithm for rhythmic gene detection (JTK_CYCLE) [11] is most widely used. These important tools have been invaluable in driving the field, and the bioinformatic tools are currently evolving to better capture circadian programs in a quantitative manner. Each tool has its own limitations. For example, the JTK_CYCLE detection scope is limited to symmetric cosine waveforms and excludes any asymmetric rhythms, such as sawtooth-shaped. As such, it will not capture the full profile of day/night differences in a given dataset, and this limitation should be considered accordingly. However, a number of other nonparametric algorithms, such as RAIN [12] (Rhythmicity Analysis Incorporating Nonparametric methods) and empirical-JTK_CYCLE-with-asymmetry [13], have been developed to fill this gap by detecting both symmetric and asymmetric waveforms. Among the recently developed methods, BioCycle utilizes a deep neural network trained with both real-world and synthetic data, which enables the detection of symmetric, asymmetric, periodic, and aperiodic waveforms [14].
Another challenge the field has become aware of in recent years concerns difficulties in determining whether the rhythmicity of a particular gene is different between two or more conditions. For example, it is now well established that the classically used 'Venn diagram' approach (overlapping 24-h oscillating genes identified using algorithmic detection performed separately for two or more groups, often compared regardless of their rhythmic phase) can lead to an overestimation of differences [15]. In parallel, phase set enrichment analysis (PSEA) [16] can be used for phasespecific enrichment of gene sets for each group and, thus, for further comparison of rhythmic processes and their phase alignment between the groups. However, the definition of rhythmic processes for PSEA within individual groups has the same weaknesses as the above algorithms for the definition of rhythmic genes, and only the most significant gene sets can be implicitly trusted. In response to such limitations, specific algorithms for differential rhythmicity analyses have been developed that focus on differences in rhythmic expression based on linear regression with rhythmic category classification into the amplitude change (gain or loss of rhythmicity), phase change, or unaltered rhythms [15]. Earlier methods in this direction, such as DODR [17] (Detection Of Differential Rhythmicity) and LimoRhyde [18], test the hypothesis of whether two rhythms are statistically different, while CircaCompare [19] provides a means to quantify differences specific to the desired rhythmic characteristic (mesor, amplitude, and phase). However, prior to hypothesis testing, a set of prefiltered rhythmic transcripts is defined using nonparametric (e.g., JTK_CYCLE or RAIN) or parametric (e.g., limma [20]) methods. By contrast, model selection frameworks, such as dryR [21] and compar-eRhythms [15], do not require prior rhythmic transcriptome definition and fit different linear models for each rhythmic category (including arrhythmic genes), with a subsequent choice of the best model based on an information-theoretic criterion. Interestingly, besides the model selection approach, compareRhythms also allows various traditional tools to be tried for hypothesis testing, while dryR allows multiple groups to be compared at a time and to include more than one categorical co-variate for batch correction. It must be noted that the problem of arbitrary thresholds remains to an extent, despite advancements in novel differential rhythmicity analysis methods. As such, each method must also be validated by manual inspection of data (see Tips below).

Circadian rhythms in stem cells
One understudied aspect of circadian regulation is the role of the circadian clock in stem cells. For instance, the circadian transcriptome of muscle stem cells (satellite cells) is subject to extensive rewiring during aging, which interestingly can be rescued in part through caloric restriction [22,23]. However, the role that circadian clocks themselves play in satellite cells is poorly understood. To facilitate such research, we have recently optimized a protocol and workflow (from isolation to bioinformatic analyses) for determining the rhythmic transcriptome in this technically challenging cell type. PE-conjugated anti-a7-integrin (AbLab, Vancouver, Canada, 53-0010-05).

Methods
Sample preparation: from harvesting muscles to muscle stem cell isolation To generate a circadian transcriptome, samples are collected at equal intervals over the circadian cycle (for full guidelines on guidelines for circadian experimental setups, see Hughes et al. [24]) Typically, a minimum of 6-time points is collected with a biological number of replicates of 3 or more. Special care must be taken to account for sex differences, as these have been reported to exert a significant effect on the identity of circadian output genes [25].
To isolate satellite cells from mice: 1 Euthanize mice by cervical dislocation and excise skeletal muscles. Dissect muscles from fore and hind limbs (abdominal muscle can also be included if necessary; gluteus and diaphragm are not usually collected) and place them in a 50 mL Falcon tube containing about 20 mL of DMEM plus 1% Penicillin-streptomycin (P-S), preferably placed on ice. 2 Harvest and mince muscles manually.

Tip 1
To ensure a rapid and clean filter step, gently centrifuge (2 min 50 g 4°C) to separate the remnants of tissues. Filter first the supernatant and then the resuspended pellet (in 10 mL of the DMEM, 1% P-S, and 10% FBS).

Tip 2
Cells can be frozen after antibody incubation by adding FACS buffer, centrifuging 10 min at 670 g 4°C, and covering the pellet in a freezing medium (FBS/10% DMSO).
14 Once the incubation with antibodies is completed, add FACS buffer up to 30 mL, centrifuge 10 min 670 g 4°C, and then resuspend the cell pellet in 450 mL of FACS buffer for satellite cell sorting.

Tip 3
The number of satellite cells sorted per mouse can vary (from about 50 000 to 200 000 cells), depending on the mouse model and age. The number of cells required can be sorted directly in the RLT buffer for RNA extraction, taking into account the efficiency of the buffer.

Tip 4
The day of the RNA extraction can produce a batch effect to be corrected in the bioinformatic analyses. If possible, it is recommended to perform the RNA extraction on the same day or randomize the samples in order to reduce the introduction of biases.

RNA extraction and quality control
Given the paucity of satellite cells, they can be first sorted via fluorescence-activated cell sorting (FACS), and RNA can then be extracted using an RNeasy micro kit (Qiagencat. no./ID: 74004) following the manufacturer's instructions. It is highly recommended to perform the DNase incubation step, even though it is suggested as an optional step in the protocol. Clean RNA is essential for the following assessments. After RNA extraction, RNA quality and concentration are analyzed using a 2100 Bioanalyzer or a Fragment Analyzer and an Agilent RNA 6000 pico kit suitable for RNA of an estimated concentration of 50-50 000 pgÁmL À1 . To be suitable for sequencing, the RNA should have a RIN (RNA Integrity Number) of at least 7/ 10 and an rRNA Ratio (28S/18S) of 2.

Quality check of the reads and bioinformatic tools for circadian analysis
The Nextflow nf-core/rnaseq v.3.2 pipeline [26] is used (a) to map the raw FastQ reads to the reference genome using STAR ALIGNER [27], (b) to project the alignments onto the transcriptome, and (c) to perform the downstream transcript-level quantification with Salmon [28]. Gene-level summarization of read counts and transcript per million (TPM) abundances is done using the R package TXIMPORT [29].
The following sample attributes are considered to filter high-quality samples: the number of reads mapped, the percentage of guanine-cytosine (GC) content, the percentage of reads mapped to the mitochondrial genome, and the distance from other samples of the group/ZT (Zeitgeber time) in principal component analysis (PCA). Correction for batch effects is applied: (a) to read counts at the stage of linear modeling by R packages DESEQ2 [30] or by DRYR for differential expression and differential rhythmicity analyses, or (b) directly to TPM using the 'ComBat' function of the SVA R package [31] prior to identifying rhythmic genes in individual groups. Batch-corrected TPM matrices are also used as an input to the gene set enrichment analysis (GSEA) [32].
DESeq2 can be used with default parameters and can include samples from all time points into the model, to define the lists of genes considered as differentially expressed between two groups, if the P-adjusted threshold is < 0.05. To evaluate biological functions differentially affected regardless of their rhythmicity, GSEA software is used with the following specific parameters: 'gene_set' permutation type and the 'T-test' metric for ranking genes using the median for class metrics instead of the mean. An FDR q-value threshold of 0.25 is used to delineate significant gene set enrichment.
Many algorithms are available for differential rhythmicity analysis of circadian datasets. Classically used methods in the field include JTK_CYCLE [24] (symmetric waveforms) and RAIN [12] (asymmetric waveforms). Caution must be taken when using such algorithms for comparing differential gene expression between groups, due to reported overestimation of differences [15]. In response, the field has developed algorithms more suitable for differential comparison such as LIMORHYDE [18] (linear models for rhythmicity, design) and DRYR [21]. We have still found the classic algorithmic approaches to be of descriptive use and biologically useful information may still be extracted by using an additional q-value cutoff for the PSEA (such as described below for JTK_CYCLE in combination with PSEA). Using JTK_CYCLE, which provides phase and amplitude outputs for detected rhythmic genes with cosine expression waveform over a period of 24 h, rhythmic genes are first selected by the adoption of a threshold (See Tips below) such as adjusted P-value < 0.05 as has been used previously for muscle stem cells [23]. To map rhythmic pathways and processes of muscle stem cells to the temporal scale, PSEA is performed on JTK_CYCLE-identified genes with the following parameters: domain from 0 to 24 h, a minimum of 10 genes per gene set, maximum 1 million simulations, with a Kuiper q-value <0.05. The enrichment is tested against a uniform background distribution to summarize any overall synchronization of peak phases within gene sets. To account for asymmetric waveforms, as an alternative method to call genes rhythmic in each genotype/condition, we choose RAIN, which can capture not only sinusoidal oscillations but also 'sawtooth' or 'spiky' patterns of gene expression [12]. As for JTK, caution must be taken during downstream analyses on gene sets identified using RAIN if comparing between conditions. Indeed, due to the multi-group nature of our analyses, we now use the dryR R package to define rhythmic classes of genes; additional details on its use can be found at this link https://github.com/naef-lab/dryR. For pairwise comparisons using DRYR R package, it is also necessary to apply cut-offs during data processing-here termed Bayesian information criterion weight (BICW). The choice of BICW that is appropriate to use changes according to the number of groups. For example, for two group analyses, BICW > 0.95 has been used [21] whereas BICW > 0.4 was used for four group comparisons. In our experience, we found a BICW > 0.6 to also be appropriate for four group comparisons. In accordance with the original publication, we use these BICW cut-offs with the following: amplitude > 0.25 and Cook's distance < 1. Enrichment of MsigDB gene sets with rhythmic genes corresponding to gain, loss, phase change, and same rhythm categories are performed using a hypergeometric test, with significance defined by Benjamini-Hochberg (BH) adjusted P-value < 0.05. Network representation and clustering of differentially rhythmic gene sets are performed based on common genes and semantic similarity using EnrichmentMap [37] and Auto-Annotate [38] plugins for Cytoscape [39]. References corresponding to the methodological tools discussed in this paragraph are summarized in Table 1.

Tips & tricks/troubleshooting
Quality control

Tip 1
Algorithms used to define the rhythmic transcriptome are often sensitive to outliers. Considering the high number of samples used in the circadian transcriptome analysis and the low amounts of RNA from muscle stem cells, the chance of having an outlier among the replicates of a single or a few time points is high. Furthermore, cell isolation and FACS sorting procedures could considerably affect the viability of cells, which may further skew the circadian transcriptome toward the markers of cell damage. Thus, it is necessary to include spare replicates and to perform extensive quality control of the sequenced samples prior to identifying rhythmic genes.
is essential for decreasing the intra-group variability of the gene expression peaks [40]. It is worth mentioning that a higher number of rhythmic genes could be detected in sexually homogeneous than in sexually heterogeneous experiments.

Tip 3
At the end of quality control, cleaning, and batch correction procedures, the arrangement of high-quality samples in the 2D PCA space might resemble a clock face (see Fig. 1).

Tip 1
There is no gold standard for the P-value threshold in chronobiology due to intrinsic differences between the algorithms [24]. For example, the P-adjusted threshold < 0.05 will be very permissive for BH-corrected RAIN results and extremely conservative for filtering JTK results by 'BH.Q', or BioCycle results by 'Q_VALUE'. In order to define a relevant P-value threshold, we utilize the heatmaps of scaled and zero-centered rhythmic gene expression (see Fig. 2) and visual inspection of the expression plots for individual genes.

Tip 2
Amplitude size (Fig. 3) is also an important parameter when evaluating the differences in circadian rhythmicity between conditions. The density of log-transformed amplitude distribution serves as a good means for visual comparison, while two-sided t-test statistics applied to the nontransformed amplitude distributions help to quantitatively estimate the difference in pairwise comparisons.

Tip 1
While high numbers of simulations are necessary to decrease the number of gene sets with P-values equal to zero, setting this parameter to 1 million may result in an excessively long computational time proportional to the length of the input gene list. The maximum number of simulations can be decreased if the input files contain high numbers of rhythmic genes.

Tip 2
As an alternative to using all rhythmic genes as an input for PSEA, one may try taking the top N rhythmic genes ranked by significance for each ZT. This could be useful for flattening the distribution of rhythmic genes (by cutting down the biggest peak) to give PSEA less chance to center the phase of all pathways at the biggest peak, thus spreading out phase distribution more evenly. This may also help to standardize the number of gene sets enriched across the experiments if the same N is used. However, the pitfall is that this threshold is very arbitrary, and many relevant genes (and thus gene sets) may be missed.

Tip 3
We are using a semi-automated approach to aggregate individual gene sets into broad muscle stem celloriented functional categories based on semantic similarity, positions in hierarchical trees of corresponding databases, and the analysis of gene set descriptions. This allows us to get distributions of peak phases (vector-average values) within the main rhythmic functions and to visualize their span and temporal synchronization across the diurnal cycle using R package CIRCLIZE [41] (see Fig. 4).

Tip 1
An increase in the number of group sizes also can lead to an increase in the number of rhythmic categories for each newly-included group; BICW values may have to be adjusted to account for this.

Tip 2
To facilitate the definition of the threshold for the BICW, we recommend plotting its distribution density for each comparison. A schematic workflow is provided in Fig. 5.