Aging clock based on nucleosome reorganisation derived from cell‐free DNA

Abstract Aging induces systematic changes in the distribution of nucleosomes, which affect gene expression programs. Here we reconstructed nucleosome maps based on cell‐free DNA (cfDNA) extracted from blood plasma using four cohorts of people of different ages. We show that nucleosomes tend to be separated by larger genomic distances in older people, and age correlates with the nucleosome repeat length (NRL). Furthermore, we developed the first aging clock based on cfDNA nucleosomics. Machine learning based on cfDNA distance distributions allowed predicting person's age with the median absolute error of 3–3.5 years.

the Cristiano cohort, including samples with low sequencing coverage to create a more robust classifier (Supplementary Table S6).The latter regions were located outside of annotated GRCh37 genes.The manipulations with BED files were performed using BedTools (Quinlan & Hall, 2010).PCA was calculated using R.
Age prediction based on nucleosome-nucleosome distances.Two types of machine learning (ML) models were constructed, as represented by Figures 2C and 2D.The age prediction model was developed by training a linear regression algorithm on 80% of the dataset.This approach assumes a linear relationship between the biological age of the donors and selected features based either on the distribution of nucleosome-nucleosome distances (Figure 2D) or cfDNA fragment sizes (Figure 2C).For the model based on nucleosome-nucleosome distances (Figure 2D), the distributions as in Figure 2B were calculated for each individual sample in the range [50bp, 2000bp] and each sample was normalised by subtracting the mean and dividing by the maximum value of the Y-coordinate within that sample.To address the high dimensionality inherent in our dataset, we employed a feature selection strategy based on the identification of the peaks of the nucleosomenucleosome distance distributions.The median over density values at each nucleosomenucleosome distance was calculated across all Christiano et al. samples with high sequencing coverage used in the analysis, and the Savitzky-Golay filter with a polynomial order of 2 and a window size of 123 was applied to smooth the distribution.Subsequent to smoothing, we identified local maxima, using the find_peaks method from the Scipy Python library, employing a threshold of 0.000001 for the relative difference in the normalised Ycoordinate values and a minimal distance between two features of 90 bp.This yielded a set of 10 peaks, which were further refined with SelectKBest feature selection method from sklearn Python library based on the ANOVA F-statistic, upon which one of the features was excluded.Then the linear regression function of the sklearn library was applied using these 9 features.
Age prediction based on cfDNA fragment size distributions.For ML model based on cfDNA fragment sizes (Figure 2C), the procedure described above was applied with the following modifications.The input data was in the form of the density of fragment sizes as in Figure 2B, calculated separately for each sample using data from 85 healthy people reported by Cristiano et al.Using the mean over density values at each fragment size, we applied a Gaussian filter requiring a minimal distance between two peaks of 45 bp and a threshold of 0.0000001 for the relative difference in Y-coordinate values and determined the local maxima on this curve using the find_peaks method from the Scipy Python library.This resulted in 33 features, which were further refined with SelectKBest feature selection method from sklearn Python library based on ANOVA F-statistic, upon which one of the features was excluded.Then the linear regression function of sklearn was applied using these features.
For both ML models the data was randomly split into training and testing as 80% and 20% correspondingly.The performance of ML models was evaluated using mean squared error (MSE) and Pearson correlation (r).
Age classification using ML.We have created ML models that classify age into the following two groups: 1) ≤ 55 y.o. 2) > 55.This selection of ages was made in order to have a balanced number of samples for each age group based on the available age distribution in the Cristiano et al dataset.Similar to age prediction using linear regression, we have constructed two models, based on cfDNA fragment sizes, and based on nucleosomenucleosome distances.The data preprocessing was done in the same way as for the linear regression ML models described above.For the classification model based on nucleosomenucleosome distances, we identified 9 significant peaks of the distribution.The normalized ycoordinates of these peaks were used as features in a logistic regression algorithm.We employed the logistic regression algorithm from the scikit-learn library, with the following parameters: regularization strength (C) set to 1000, the solver as 'newton-cg', the penalty as 'l2', and the maximum number of iterations (max_iter) set to 1000.This model achieved F1 = 0.9 and AUC = 0.96.For the predictor based on cfDNA fragment sizes, the classification model retained the same data preprocessing, but a different smoothing parameter of sigma 2.2 was applied which yielded 6 peaks of the fragment size distribution.Moreover, we have deployed the Random Forest algorithm implemented in the scikit-learn library, with hyperparameters configured as follows: min_samples_leaf=1, min_samples_split=2, max_depth set to unlimited, and a total of 100 trees (n_estimators=100).
. A) The distribution of distances between centers of cfDNA fragments calculated with NucTools for cfDNA sample from a 25 years old person (SRA accession number SRR7170698), shown separately for each chromosome.B) Linear regression of the locations of the peak summits of the average genome-wide profile of the distribution of distances between centers of cfDNA fragments from (A).The slope of the linear fit line is equal to the nucleosome repeat length value. .A) The distribution of distances between centers of cfDNA fragments calculated with NucTools for cfDNA sample from a 25 years old person (SRA accession number SRR7170699), averaged across all chromosomes.B) Linear regression of the locations of the peak summits of the average genome-wide profile of the distribution of distances between centers of cfDNA fragments from (A).The slope of the linear fit line is equal to the nucleosome repeat length value. .A) The distribution of distances between centers of cfDNA fragments calculated with NucTools for cfDNA sample from a 25 years old person (SRA accession number SRR7170700), averaged across all chromosomes.B) Linear regression of the locations of the peak summits of the average genome-wide profile of the distribution of distances between centers of cfDNA fragments from (A).The slope of the linear fit line is equal to the nucleosome repeat length value. .A) The distribution of distances between centers of cfDNA fragments calculated with NucTools for cfDNA sample from a 75 years old person (SRA accession number SRR7170701), averaged across all chromosomes.B) Linear regression of the locations of the peak summits of the average genome-wide profile of the distribution of distances between centers of cfDNA fragments from (A).The slope of the linear fit line is equal to the nucleosome repeat length value. .A) The distribution of distances between centers of cfDNA fragments calculated with NucTools for cfDNA sample from a 75 years old person (SRA accession number SRR7170702), averaged across all chromosomes.B) Linear regression of the locations of the peak summits of the average genome-wide profile of the distribution of distances between centers of cfDNA fragments from (A).The slope of the linear fit line is equal to the nucleosome repeat length value. .A) The distribution of distances between centers of cfDNA fragments calculated with NucTools for cfDNA sample from a 75 years old person (SRA accession number SRR7170703), averaged across all chromosomes.B) Linear regression of the locations of the peak summits of the average genome-wide profile of the distribution of distances between centers of cfDNA fragments from (A).The slope of the linear fit line is equal to the nucleosome repeat length value.