Utilizing random regression models for genomic prediction of a longitudinal trait derived from high‐throughput phenotyping

Abstract The accessibility of high‐throughput phenotyping platforms in both the greenhouse and field, as well as the relatively low cost of unmanned aerial vehicles, has provided researchers with an effective means to characterize large populations throughout the growing season. These longitudinal phenotypes can provide important insight into plant development and responses to the environment. Despite the growing use of these new phenotyping approaches in plant breeding, the use of genomic prediction models for longitudinal phenotypes is limited in major crop species. The objective of this study was to demonstrate the utility of random regression (RR) models using Legendre polynomials for genomic prediction of shoot growth trajectories in rice (Oryza sativa). An estimate of shoot biomass, projected shoot area (PSA), was recorded over a period of 20 days for a panel of 357 diverse rice accessions using an image‐based greenhouse phenotyping platform. A RR that included a fixed second‐order Legendre polynomial, a random second‐order Legendre polynomial for the additive genetic effect, a first‐order Legendre polynomial for the environmental effect, and heterogeneous residual variances was used to model PSA trajectories. The utility of the RR model over a single time point (TP) approach, where PSA is fit at each time point independently, is shown through four prediction scenarios. In the first scenario, the RR and TP approaches were used to predict PSA for a set of lines lacking phenotypic data. The RR approach showed a 11.6% increase in prediction accuracy over the TP approach. Much of this improvement could be attributed to the greater additive genetic variance captured by the RR approach. The remaining scenarios focused forecasting future phenotypes using a subset of early time points for known lines with phenotypic data, as well new lines lacking phenotypic data. In all cases, PSA could be predicted with high accuracy (r: 0.79 to 0.89 and 0.55 to 0.58 for known and unknown lines, respectively). This study provides the first application of RR models for genomic prediction of a longitudinal trait in rice and demonstrates that RR models can be effectively used to improve the accuracy of genomic prediction for complex traits compared to a TP approach.


41
With the advent of next-generation sequencing technologies, the biology community has metric has been shown to be an accurate representation of shoot biomass (Campbell et al., 137 2015;Golzarian et al., 2011;Knecht et al., 2016). Prior to downstream analyses, outlier 138 plants at each time point were detected for each trait using the 1.5(IQR) rule. Plants that 139 were flagged as potential outliers were plotted and inspected visually. Those that exhibited 140 abnormal growth patterns were removed. A total of 32 plants were removed, leaving a total 141 of 2,604 plants for downstream analyses. PSA was modeled across all twenty time points using several RR models. Following the 144 notation of Mrode (2014), the RR models can be summarized as Here β is the fixed second-order Legendre polynomial to model the overall trend in the trait 146 overtime, u jk and s jk are the k th random regression coefficients for additive genetic effect and 147 random experiment of line j, nr is the order of polynomial for the random effects, and e tjk 148 is the random residual. The order of β was selected based on visual inspection of the trends.

149
Various polynomial functions and residual variance structures were evaluated for line and 150 experiment, and residuals, respectively. A complete description of the models is provided 151 in Table 1. For each trait, the models were ranked based on goodness-of-prediction using 152 Akaike's information criterion (AIC) scores (Akaike, 1974).
Here, y is the PSA at time t; Z and Q are incidence matrices corresponding to the random additive genetic effect (u), and random experimental effect (s), respectively; and e is the random residual error. For the random terms we assume u ∼ N (0, Gσ 2 g ), s ∼ N (0, Iσ 2 s ), and e ∼ N (0, Iσ 2 e ). Here, σ 2 g is the additive genetic variance; σ 2 s is an environmental variance associated with experiment; and σ 2 e is the residual variance. A genomic relationship matrix (G) was calculated using VanRaden (2008).
Here, Z cs is a centered and scaled n × m matrix, where m is 33,674 SNPs and n is the 357 157 genotyped rice lines. 158 2.6 Genomic selection using random regression 159 For each trait, the "best" random regression model was used to predict gBLUPs. The 160 following mixed model was used to predict gBLUPs The variables are the same as in Selection of random regression models, however note that nr 162 has been replaced with 2 and 1 for the additive genetic and experiment effect, respectively.

163
Thus the random additive genetic effects are described using a second-order Legendre poly-164 nomial, while a first-order Legendre polynomial is used to describe the experiment effects 165 across time points.
In matrix notation, the model is with all vectors and matrices defined as above. However here u is now a vector of random 167 regression coefficients for the additive genetic effects. For the random terms we assume  where shoot biomass increases nearly exponentially ( Figure 2A, Figure S1).

239
These models are particularly advantageous in that differences in the shape of the curve 240 can be accounted for, and can be solved using the conventional mixed model framework.

241
Thus, in the scope of genetics, these models allow for inter-individual variation in the mean 242 trend to be estimated. Here, the overall mean growth trend was modeled using a second-243 order Legendre polynomial. A total of eight models were evaluated to identify a model 244 that adequately described the data and could be used for GS. Each model included a fixed 245 second-order Legendre polynomial to describe the overall mean growth trend, while several 246 Legendre polynomials ranging from zero to second-order Legendre polynomials were fitted for 247 random genetic and experimental effects. The residual effects were assumed to be constant 248 or heterogeneous across time points using an identity or diagonal matrix, respectively. The 249 "best" model was selected based on the smallest AIC value. Table S1 provides an overall 250 summary of the models and the corresponding AIC values. The "best" model (Model 8) was 251 one that included a fixed second-order polynomial to model the mean trend in shoot growth, 252 a second-order Legendre polynomial for the random additive genetic effect, a first-order

253
Legendre polynomial for the experimental effect, and the residual variance was assumed to 254 be heterogeneous over time points. Figure 2B shows the predicted PSA obtained with model 255 8 for two lines with contrasting contrasting genetic values for the RR coefficients.  Figure 3B). Interestingly, a strong genetic correlation was observed between day 264 1 and day 20 (r = 0.91), indicating that shoot growth (e.g. PSA) may be driven by similar 265 genetic mechanisms at the early seedling and active tillering stage in rice.

266
To evaluate the ability of the longitudinal RR approach to capture additive genetic vari-267 ance, the narrow sense heritability of PSA was estimated using the RR model described identical h 2 estimates on day 1, however at later time points h 2 of RR was considerably 276 higher than TP. These results suggest that the RR approach captures more additive genetic 277 variance for PSA than the TP approach.  Here, the the objective is to predict future phenotypes for lines with phenotypic trajectories 327 recorded earlier in the growing season or development. To this end, the dataset was separated 328 into two, with the first ten time points serving as a training set to predict the phenotypes 329 for the last ten days. This approach is described in Figure 1B. The RR model described above was fit to the data. To assess the accuracy of prediction, two-fold cross validation was 331 performed in which 50% of the lines were randomly selected for training and prediction, and 332 the resampling process was repeated ten times. The accuracy of prediction was very high, 333 ranging from 0.79 to 0.82 for the last ten time points without phenotypic records ( Figure   334 5B). A slight decline in prediction accuracy was observed after day 10, with day 11 exhibiting 335 the highest accuracy (r = 0.82) and the lowest accuracy on day 20 (r = 0.79). This trend in 336 prediction accuracy is expected, given that the phenotypic records at day 11 should be very 337 highly correlated with those at day 10, with the correlation declining as time progresses.

338
The high predictive ability observed indicates that the first ten time points is sufficient to 339 accurately predict future phenotypes for known lines. Here, the aim is to predict future phenotypes for new lines with no phenotypic records using 347 early phenotypic records for existing lines. To this end, the dataset was partitioned into 348 two temporal datasets, with the first ten time points serving as a training set to predict the 349 phenotypes for the last ten days ( Figure 1C). As above, a two-fold cross validation approach 350 was used to assess prediction accuracy. Half the lines were randomly assigned to each fold, 351 and the first ten time points from the first fold were used to predict the phenotypes at the 352 last ten time points in the second fold. The prediction accuracies for scenario C were very 353 similar to those observed for scenario A. Accuracies ranged from 0.48 to 0.57, with the pre-354 diction accuracy ranging from 0.55 to 0.57 in the last ten days ( Figure 5C). The prediction 355 accuracies showed a slight increase from day 1 to day 9. The highest prediction accuracy 356 was observed at day 15, while the lowest accuracy was observed at day 1. These results

357
suggest that future phenotypes can be forecast for new lines with reasonable accuracy using 358 phenotypic records from earlier time points for a set of known lines.

379
The prediction accuracy was high with r values ranging from 0.51 to 0.59 ( Figure 5D).

380
The prediction accuracy was relatively constant, but showed a slight increase in accuracy  approaches were nearly equivalent ( Figure S2). Thus, the higher prediction accuracy is due 410 to the higher h 2 of the RR approach relative to the TP approach. it is important to note that the current study utilized a diversity panel with considerable 415 population stratification and the prediction models did not account for population structure.

416
Accounting for population structure is important in genome wide association studies to   and were used to predict the phenotypes at the last ten time points for the same 179 lines.

574
In (C), a forecasting approach was again used, however the lines were randomly split in two, 575 and the first ten time points were used to predict phenotypes in the last ten time points for 576 a group of new lines. In (D) the first 20 time points was used to predict gBLUPs at a later 577 time points in an independent study. Here, a publicly available dataset was used as a testing 578 set in which 357 lines were phenotyped from 20 to 40 days after transplant, thus a 13 day 579 overlap was available for the two datasets. Here, the independent dataset is indicated with 580 P SA LaterV eg. . Excluded indicates that these data points were not included for analyses. included a fixed second-order Legendre polynomial, the random additive genetic effect were 607 modeled using a second-order Legendre polynomial, a first-order random effect was used for 608 experiment, and the residual variance was assumed to be heterogeneous over time points.

609
For both models, the experimental term was considered as an environmental effect.      Table S1: Random regression model selection. Each of the four random regression models included a fixed second-order polynomial to model the mean trend in PSA over the twenty time points, indicated by the column f . G refers to the random additive genetic effect, Exp the random experimental effect, and e error term. Models with Diag assumed heterogeneous residual variance over time points, while those with I assumed the residual variance was constant. pol n refers to a Legendre polynomial of order n.  Figure S2: Predictive ability of the random regression (RR) and single time point (TP) approaches expressed as a function of heritability: The analysis followed the same approach as that for scenario A, however for each fold the correlation between gBLUP and observed PSA was divided by the square root of heritability. The error bars represent the standard deviation where n = 20.