LDJump: Estimating variable recombination rates from population genetic data

Abstract As recombination plays an important role in evolution, its estimation and the identification of hotspot positions is of considerable interest. We propose a novel approach for estimating population recombination rates based on genotyping or sequence data that involves a sequential multiscale change point estimator. Our method also permits demography to be taken into account. It uses several summary statistics within a regression model fitted on suitable scenarios. Our proposed method is accurate, computationally fast, and provides a parsimonious solution by ensuring a type I error control against too many changes in the recombination rate. An application to human genome data suggests a good congruence between our estimated and experimentally identified hotspots. Our method is implemented in the R‐package LDJump, which is freely available at https://github.com/PhHermann/LDJump.

(1) This transformation performed best under our considered transformations including also logarithmic and exponential transformations. In order to tune the model with respect to homogeneity and normality of the residuals as well as high prediction accuracy, we compared the performance under dierent combinations of parameters. The considered grid of values for γ and ϵ for the Box-Cox transformation (1) was {0.001, 0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1} and {0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.15}, respectively. To assess the quality of t, the plots of Figure 2 are produced with the chosen model and the trained data. The left plot of Figure 2 shows the scatter plot of the predicted values (y-axis) and and maximum deviation to the mean of 20%. The right plot of Figure 2 shows the QQ-plot for the residuals of the model.   points to the same choice of the parameters γ and ϵ. We nally chose γ = 0.5 and ϵ = 0 due to the much better performance in terms of the variance homogeneity measures given slightly smaller value of R 2 and very similar value for the Shapiro-Wilk statistics.

Bias Correction and Homoscedasticity Check
We applied a simulation based bias correction due to an observable bias especially for setups with small background rates. Therefore, we simulated recombination maps of length 1000 kb (1Mb) with in total 15 hotspots of lengths of 1kB (7)  Values smaller than -2 after bias-correction are set to -2, such that they equal to zero after the back-transformation.

Detailed Quality Assessment for Simple Setups
In Table 2 we provide a detailed quality assessment between the considered methods for simple setups. More specically, we computed the mean, median, and standard deviation (across We compare the performance of FastEPRR under the simple setups with respect to segment lengths in Figure 9. Here, we can see the increasing variation based on the estimation results of larger segments. In contrast, the median per group decreases with segment length.
3 Detailed Quality Assessment for Natural Setups Figure 10 shows our considered quality measures depending on the background recombination rates. We provide the average performance over 20 replicates. We can see that LDhat has constant PCB and decreasing PCH as the background rate increases. LDJump shows constant values for PCH and slightly increasing PCB for higher background rates. The overall measure AP slightly increases for LDJump and decreases for LDhat with increasing background rates, respectively. The weighted RMSE is also plotted. It can be seen that LDhat leads to a slightly smaller weighted RMSE with decreasing dierences for larger ρ.  Comparison with respect to RMSE between segment lengths using FastEPRR

Quality Assessment for Natural Setups with FastEPRR
Here we compare the results of LDJump with FastEPRR based on the natural setups. Notice that due to the very high error share of 88% in FastEPRR using segment lengths of 1kb we only compare the results of actually estimated recombination maps. For the sake of visibility, we assess LDJump using our recommended quantile of 0.35 in the bias correction and compare across the number of segments of 500, 1000, 1500, and 2000. Figure 11 shows that LDJump estimates recombination maps with smaller WRMSE, irrespective of the segment lengths considered and has a much higher share of correctly identied hotspots (PCH), but a lower share of correctly identied background rates (PCB).

Runtime under Simple Setups
Based on the summary statistics mean (top), median (middle), and SD (bottom) of our measured runtimes we compare the runtimes between the considered software packages in Table 3.

Eect on Runtime by Increasing Sample Size and Sequence Length
In Table 4 we explore the eects of sample size and sequence length on the runtime. We compared the aforementioned methods with respect to their mean and median runtimes again for our simple setups. The runtimes for LDhat and LDhelmet are strongly aected by sequence length and sample size. Interestingly LDhat seems to have more problems dealing with longer sequences, whereas LDhelmet shows an especially large increase in runtime when the sample size increases. The runtime of LDJump (using segments of length 500 and 1000 bp) seems to be less sensitive to such increases. Doubling the sequence length only leads to additional 16% of average runtime. Increasing the sample size has almost no eect on the runtime of LDJump. We observe a similar behavior of FastEPRR (using a segment length of 1kb) with more pronounced eects on the double initial runtime for the smallest sample size and sequence lengths.