Metabolomic prediction of yield in hybrid rice
Summary
Rice (Oryza sativa) provides a staple food source for more than 50% of the world's population. An increase in yield can significantly contribute to global food security. Hybrid breeding can potentially help to meet this goal because hybrid rice often shows a considerable increase in yield when compared with pure‐bred cultivars. We recently developed a marker‐guided prediction method for hybrid yield and showed a substantial increase in yield through genomic hybrid breeding. We now have transcriptomic and metabolomic data as potential resources for prediction. Using six prediction methods, including least absolute shrinkage and selection operator (LASSO), best linear unbiased prediction (BLUP), stochastic search variable selection, partial least squares, and support vector machines using the radial basis function and polynomial kernel function, we found that the predictability of hybrid yield can be further increased using these omic data. LASSO and BLUP are the most efficient methods for yield prediction. For high heritability traits, genomic data remain the most efficient predictors. When metabolomic data are used, the predictability of hybrid yield is almost doubled compared with genomic prediction. Of the 21 945 potential hybrids derived from 210 recombinant inbred lines, selection of the top 10 hybrids predicted from metabolites would lead to a ~30% increase in yield. We hypothesize that each metabolite represents a biologically built‐in genetic network for yield; thus, using metabolites for prediction is equivalent to using information integrated from these hidden genetic networks for yield prediction.
Introduction
Rice breeders have been struggling to improve yield via selection due to low heritability of the trait. Fortunately, such traits often demonstrate great heterosis (Zhou et al., 2012). Therefore, hybrid breeding is key to increasing yield in rice by taking advantage of heterosis. Although superior hybrids have been identified and are being used in rice production, there are about 10 000 rice cultivars available in the world and only a small proportion of all potential crosses have been evaluated in the field. Experimental evaluation of all crosses would allow us to identify the most valuable hybrids. However, any experiment requires temporal and spatial replications, and such a large‐scale experiment is impractical due to limited resources. Genomic hybrid breeding (Bernardo, 1994) provides a solution for predicting all potential hybrids. Theoretically, we can predict all potential hybrids of a given set of parents using a subset of crosses.
Advanced molecular technologies allow us to measure the expression of many metabolites and transcripts. Metabolome and transcriptome provide new sources of data for hybrid prediction. Previously, microarray data were analyzed largely in relation to genomic data, called eQTL mapping (Bing and Hoeschele, 2005; Rockman and Kruglyak, 2006; Keurentjes et al., 2007; West et al., 2007; Wang et al., 2014). Additionally, metabolites can be detected and quantified by chromatographic mass spectrometry (Dunn and Ellis, 2005). These metabolomic profiling data have also been analyzed in relation to genomic data, called mQTL mapping (Keurentjes et al., 2006; Schauer and Fernie, 2006; Rowe et al., 2008; Keurentjes, 2009; Lisec et al., 2009; Carreno‐Quintero et al., 2013; Gong et al., 2013; Chen et al., 2014). An issue with these types of analysis is that the genetic networks inferred from eQTL and mQTL mapping are generic because they are not necessarily associated with any traits.
For the first time, Frisch et al. (2010) predicted hybrid performance of maize using transcriptomic data measured from the parents, and showed that transcriptome‐based prediction was more accurate than prediction from DNA markers. Fu et al. (2012) further compared different methods of hybrid prediction in maize yield using parental transcriptomic. Use of metabolites to predict plant phenotypes has been reported (Meyer et al., 2007; Gärtner et al., 2009; Steinfath et al., 2010; Riedelsheimer et al., 2012; Feher et al., 2014). Three of these reports addressed hybrid prediction using both genomic and metabolomic data. Gärtner et al. (2009) predicted biomass of Arabidopsis thaliana using two backcross populations from 359 recombinant inbred lines (359 × 2 = 718 hybrids). They found that the predictability of biomass was 0.17 from genome and 0.16 from metabolome, where the predictability was defined as the squared correlation between the observed and the predicted phenotypes. Riedelsheimer et al. (2012) predicted traits in hybrid maize using 278 inbred lines crossed with two testers (285 × 2 = 570 hybrids), and found that the average predictability of seven traits was 0.53 from genome and 0.32 from metabolome. We converted their correlations into squared correlations to comply with our definition of predictability. They claimed that the predictabilities were very similar from the two predictors (Appendix S1). The important message from the above studies is that metabolites are useful predictors for quantitative traits. Feher et al. (2014) also proposed to use metabolites to predict hybrid biomass and they demonstrated the method using a toy example of maize, with only four parents and
hybrids.
One characteristic of these experiments is that all hybrids from the current parents have been evaluated, leaving no future crosses for prediction. To predict future crosses, we need a cross‐experiment with a portion of the hybrids evaluated in the field to predict the remaining portion. The hybrid experiment of Hua et al. (2003) set an example of this kind. This experiment involved 210 inbred parents with 21 945 potential hybrids. Of all the hybrids, only 278 were evaluated in the field and the remaining 21 667 have yet to be tested. We proposed to use the 278 hybrids as a training sample and predict all potential hybrids. This technology is directly transferrable to commercial hybrid breeding. Recently, we predicted yields of all hybrids using SNP data with a predictability of 0.20 (Xu et al., 2014). We predicted that if the top 10 crosses were selected for hybrid breeding, the yield would increase by 16%. The predictability for a diverse panel would be higher given the much increased genetic variation. The parents initiating the cross‐experiment of Riedelsheimer et al. (2012) represent a diversity panel with a wider inference space. Unfortunately, all potential hybrids were evaluated in that study and thus no future hybrids were predicted.
Results
Predictability of yield and its components traits in hybrids
The predictabilities drawn from 10‐fold cross‐validation are illustrated in Figure 1, from which we conclude the following. (i) The predictability is highly correlated to the heritability of the trait, with KGW having the highest predictability (0.58; average across all methods and omic data) followed by GRAIN (0.31) and YIELD (0.20), and TILLER being the least predictable trait (0.16). The corresponding heritability for each of the four traits (on the plot mean basis) was 0.79, 0.62, 0.43 and 0.31, respectively (Table S1). The correlation between heritability and predictability is 0.9524 (P = 0.047). (ii) For YIELD, metabolome had the highest predictability followed by transcriptome and genome. With the least absolute shrinkage and selection operator (LASSO) method, the predictabilities from metabolome, transcriptome and genome were 0.35, 0.23 and 0.20, respectively. Metabolomic prediction for YIELD was almost twice as efficient as genomic prediction. (iii) For KGW, genomic prediction was most efficient followed by transcriptomic and metabolomic predictions. KGW is a high heritability trait, and genomic prediction remained dominant over other omic predictions.

Predictabilities of four traits from three omic data and six methods in the IMF2 population.
The four traits are labeled as YIELD, KGW, GRAIN and TILLER. The three omic data are genomic, transcriptomic and metabolomic data. The six statistical methods are least absolute shrinkage and selection operator (LASSO), best linear unbiased prediction (BLUP), stochastic search variable selection (SSVS), support vector machine using the radial basis function (SVM‐RBF), support vector machine using the polynomial kernel function (SVM‐POLY) and partial least squares (PLS).
Analysis of variances for predictabilities
Table 1 is the anova table for predictability. All main effects and two interaction effects are significant. We also performed multiple comparisons for the main effects, and the results are depicted in Figure 2. Overall, transcriptomic prediction is better than genomic prediction, while metabolomic prediction is not different from either one (Figure 2a). Predictions of the four traits are significantly different, KGW > GRAIN > YIELD > TILLER (Figure 2b). The six methods are classified into three levels of predictability, best linear unbiased prediction (BLUP) being the best, stochastic search variable selection (SSVS) the worst, and other methods ranging between the two (Figure 2c). Overall, BLUP and LASSO are the best methods for prediction. Detailed information of the anova in terms of main and interaction effects is given in Data S1.
| Source | d.f. | Sum of square | Mean square | F‐test | P‐value |
|---|---|---|---|---|---|
| Predictor | 2 | 0.0174 | 0.0087 | 4.81 | 0.0155 |
| Trait | 3 | 1.9265 | 0.6421 | 354.64 | <0.0001 |
| Method | 5 | 0.0508 | 0.0102 | 5.62 | 0.0009 |
| Method × Predictor | 10 | 0.0648 | 0.0065 | 3.58 | 0.0032 |
| Method × Trait | 15 | 0.0321 | 0.0021 | 1.18 | 0.3350 |
| Predictor × Trait | 6 | 0.1130 | 0.0188 | 10.41 | <0.0001 |
| Residualaa
Residual is the predictor × method × trait interaction, whose mean square is the denominator for all the F‐tests.
|
30 | 0.0543 | 0.0018 |
- a Residual is the predictor × method × trait interaction, whose mean square is the denominator for all the F‐tests.

Multiple comparisons illustrated by box‐plots.
In each box‐plot, the line in the middle of the box represents the median denoted by Q2 (50%). The open diamond in each box indicates the mean. The upper and lower ages of the box represent Q3 (75%) and Q1 (25%) of the sample, respectively. The whiskers define Q3 + 1.5 × IQR and Q1 − 1.5 × IQR, where IQR is the interquartile range. The small open circles represent outliers.
(a) Comparison of the means of predictability for the three omic data on average over all four traits and six methods. The capital letters above the group labels represent the test results, with different letters representing significant difference between groups. For example, transcriptomic prediction (A) is significantly better than genomic prediction (B), but metabolomic prediction (AB) is not significantly different from either of the other two predictions.
(b) Comparison of the mean predictabilities of the four traits on average across all three omic data and six methods.
(c) Comparison of the mean predictabilities of all six methods on average over all three omic data and four traits.
Significance test of predictability
We performed significance tests for the predictability of yield under the LASSO method. The empirical distributions of the predictabilities drawn from 1000 permuted samples are presented in Figure 3. The P‐value is zero in every case except for TILLER, where the P‐value is 2/1000 = 0.002.

The null distributions of predictabilities obtained from the least absolute shrinkage and selection operator (LASSO) prediction.
The dark triangle in each panel represents the observed predictability, which is far beyond the null distribution. The null distribution was obtained from 1000 permuted samples by randomly shuffling the phenotype. Each column of the figure represents a trait (Yield, KGW, Grain or Tiller) and each row of the figure represents a data source (genome, transcriptome or metabolome).
Comparison of leave and seed metabolites
We also compared predictabilities of 683 metabolites from leaves and 317 metabolites from seeds separately. On average, across all traits and all methods, the predictability was 0.278 from leaves, 0.216 from seeds and 0.308 from all metabolites. To test whether the higher predictability of leaves is due to the larger number of metabolites, we randomly selected 317 metabolites from leaves to match the number from seeds. On average of 20 replicated samples, the predictability of leaves was 0.242, still higher than that of the seeds (Table S2). We concluded that the higher predictability of leaves was not entirely due to the larger number of metabolites. This follows the nominal expectation as flag leaves play a crucial role in yield production. As suggested by a reviewer, we reanalyzed the data for metabolomic prediction using the 100 metabolites that are measured from both leaves and seeds using all six methods for all four traits. The results are shown in Table S2 (last two columns). The general conclusions were that: (i) the 100 metabolites predicted poorly compared with the predictions using all 1000 metabolites; (ii) leaf metabolites predicted better than seed metabolites for yield, grain number and tiller number, but worse for KGW.
Combined prediction using all sources of omic data
The combined prediction rarely outperformed the best single data prediction (Figure S1). This result is consistent with Riedelsheimer et al. (2012) who did not find any benefit from combining genomic and metabolomic data. However, Gärtner et al. (2009) found a considerable increase in predictability by combining the two predictors. A possible reason for the benefit observed by Gärtner et al. may be the small number of SNPs (110) used in that study.
Predicting untested crosses
The 278 crosses are a small subset of all 21 945 crosses. Using parameters estimated from this training sample, we predicted all potential hybrids for YIELD. Data S2 gives the predicted yields of all crosses from all omic data using the LASSO method. This dataset also shows the predicted yields of all crosses sorted by the predicted yield using each of the three omic data. Shanyou 63 (the original hybrid of Zhenshan 97 and Minghui 63) had an average yield of 52.60. The metabolomic data predict that there would be 160 potential hybrids with yield higher than Shanyou 63, with a proportion of 0.7%. According to transcriptomic prediction, there would be 72 potential crosses with yield higher than Shanyou 63. Based on genomic prediction, none of the potential crosses would outperform Shanyou 63. Among the 278 field‐evaluated IMF2 crosses, 13 of them had yield greater than 52.6 (the yield of Shanyou 63), with a proportion of 4.7%. This proportion is larger than 0.7% in the whole predicted population. Unless the predictability is 100%, the predicted and observed yields have different distributions, with the predicted yield having a much smaller variance (as demonstrated in Figure S2). Both the predicted and observed yields had means of 43.48, but the standard deviations were 3.73 and 5.84, respectively. Because of the smaller standard deviation for the predicted yield, the tail above 52.6 covers a much smaller proportion of the sample. For yield, the predictability was 0.35, far less than 100%; therefore, the upper tail above 52.6 was only 0.7%.
Figure 4 shows the average predicted yield and the percent gain when selecting the top crosses for hybrid breeding. For example, if the top 10 crosses predicted from metabolites are used for hybrid breeding, the average predicted yield of the 10 crosses would be 56.38, which represents a 29.6% gain in yield. The 29.6% gain appears to be exaggerated for such a low heritability trait; however, this high gain is largely due to the high selection intensity of the crosses represented by the extremely small proportion selected (10/21 945 = 0.000456).

We examined yield prediction using the LASSO method for metabolites and transcripts for each year separately. For metabolomic prediction, the predictabilities are 0.15 and 0.29 for the years 1998 and 1999, respectively, and both are lower than the 2‐year combined prediction (0.35). Similarly, for transcriptomic prediction, the predictabilities are 0.03 and 0.15 for the years 1998 and 1999, respectively. Again, both are lower than the combined prediction (0.23). We also predicted yields for 1999 using data from 1998 and vice versa for both metabolomic and transcriptomic predictions using the LASSO method. The average predictabilities from the cross‐year prediction are (0.27 + 0.23)/2 = 0.25 for metabolomic prediction and (0.19 + 0.20)/2 = 0.195 for transcriptomic prediction. These cross‐year predictions are lower than the 2‐year combined predictions, reflecting the environmental effects on the predictions.
Metabolites significantly effect yields
The LASSO method allows us to detect metabolites with significant effects on yield. We choose a P < 0.01 criterion to declare significant metabolites. Among the 1000 metabolites, 76 of them are significant and they are listed in Data S3. Among the 76 significant metabolites, 46 are from leaves and 30 from seeds. The top 22 metabolites with the smallest P‐values are all from leaves. We then used only the 76 metabolites to predict hybrid yield. The predictability drawn from cross‐validation is 0.206, which is smaller than 0.35, the predictability using all 1000 metabolites. The conclusion was that the small‐effect metabolites that failed to reach the significant level do contribute to prediction. In other words, variable selection can decrease the prediction.
Discussion
We found that metabolomic prediction for hybrid yield is more effective than genomic prediction. Yield is a trait with low heritability but is the most important trait in rice breeding. Even a slight improvement in prediction will translate into a significant increase in yield. We performed similar prediction for the 210 inbred lines (RIL) and observed the same trend of metabolomic prediction being consistently better than genomic prediction (Figure S3). We also observed that the predictability in RILs was higher than that in hybrids, which may be explained by the fact that the metabolites were directly measured in the parents. Metabolites and transcripts are intermediate phenotypes correlated to yield. They are biologically and developmentally closer to yield than SNPs, and this may partially explain why their prediction is higher than genomic prediction.
Looking back to hybrid rice, on average across all traits and methods, the predictability was 0.31 from metabolomic data and 0.30 from genomic data. In the maize hybrid prediction of Riedelsheimer et al. (2012), the average predictability across traits was 0.53 from genomic data and 0.32 from metabolomic data (Appendix S1). This discovery came as a surprise to the authors. The fact that metabolomic prediction is better than genomic prediction for hybrid rice is even more surprising, considering: (i) the small number of metabolites relative to the large number of SNPs; and (ii) the metabolic profile being a snapshot at a specific moment in time (Riedelsheimer et al., 2012). Heritability describes the linear relationship between trait and genotype (Falconer and Mackay, 1996), and genomic prediction is able to capture this relationship. The expression level of each metabolite may be a complicated non‐linear function of the SNP genotypes, but this complicated function is captured by the metabolite biologically, not mathematically (Gärtner et al., 2009). The unknown complex functions are biologically built‐in, and using information integrated from these hidden functions for prediction is much more powerful than using any mathematical functions inferred by our models. Although it is biologically interesting to find the functional relationships, breeders are perhaps more interested in the results: as long as it helps prediction, the technology can be adopted. Breeders may not want to wait years to find the biological reasons before applying the technology.
There are about 200 000 different metabolites in the plant kingdom (Bino et al., 2004) but, in each experiment, only a few hundred of them can be measured. This small proportion already predicts the yield better than genomic data. If we can increase the number of metabolites to a few thousands, the predictability may be further improved. In addition to metabolites, we found that transcripts also improved yield prediction, although not as efficient as metabolites. Proteomic data have been analyzed in a hybrid rice and its parents (Xiang et al., 2013). Such data may also be used to predict hybrid yield, albeit no such a study has been reported. For the same token, small RNA data or any other molecular data can be used for prediction of yield (Zhang et al., 2014). Advanced technologies are available to measure a large array of phenotypes, and the data are called phenomes (Houle et al., 2010). In general, metabolome, transcriptome and proteome also belong to phenomes, which are quantitative traits and can be used as secondary traits for indirect selection of yield. These data are further closer to yield and may be significantly better predictors of hybrid yield. One obvious question is that ‘yield’ itself is just one of the phenomic data and using yields of the parents to predict the yield of hybrids is no better than any other phenotypic traits. We emphasize the multivariate nature of the phenomes. If we can measure thousands of phenotypes from a single plant then these phenotypes may be included in a single model for prediction. Therefore phenomic prediction can be performed the same way as the metabolomic prediction.
The narrow genetic background of the materials in this study may limit the application of the result to hybrid breeding. The parameters reported can only be used to predict hybrids from the same 210 lines. However, this study provides a proof of concept for prediction of hybrid yield. In practical hybrid breeding, a balanced random partial rectangle cross‐design (BRPRCD) may be implemented using the majority of available rice cultivars. A toy example of crosses using eight male and 14 female lines is demonstrated in Table S3, which can be extended to any numbers of male and female lines. Imagine that if we choose 500 male and 1000 female lines of all rice cultivars in the world to create a BRPRCD experiment. Although it is impossible to evaluate all 500 000 crosses in the field, using 800 crosses for field evaluation, for example, would allow us to predict all potential crosses. The predicted top crosses, not yet evaluated, would then be field evaluated. This omic‐data guided hybrid breeding can more effectively identify the best hybrids in the world, leading to improved yields and global food security.
Experimental procedures
Material collection
We analyzed a hybrid population of rice (Oryza sativa) derived from the cross between Zhenshan 97 and Minghui 63 (Hua et al., 2002, 2003; Xing et al., 2002). This hybrid (Shanyou 63) is the most widely grown hybrid in China. A total of 210 RILs were derived by single‐seed descent from this hybrid. We created 278 crosses by randomly pairing the 210 RILs. This population is called an immortalized F2 (IMF2) because it mimics an F2 population with a 1:2:1 ratio of the three genotypes (Hua et al., 2003). We analyzed four traits to evaluate the efficacy of hybrid prediction: (i) yield (YIELD); (ii) 1000‐grain weight (KGW); (iii) grain number per plant (GRAIN); and (iv) tiller number per plant (TILLER). For the RIL population, each trait was measured from four replicated experiments (1997 and 1998 from one location, 1998 and 1999 from another location). In each replicate, eight plants from each line were sampled and the average phenotype represented the phenotypic value of the line (Xing et al., 2002; Yu et al., 2011). For the IMF2 population, each trait was measured in two consecutive years (1998 and 1999). Each year, eight plants from each cross were measured and the average yield of the eight plants was treated as the original data point. Both experiments were conducted under a randomized complete block design with replicates (years and locations) as the blocks.
Three omic (genomic, transcriptomic and metabolomic) data collected from the 210 RILs were used for prediction. The genomic data are represented by 1619 bins inferred from approximately 270 000 SNPs of the rice genome (Yu et al., 2011). All SNPs within a bin have exactly the same segregation pattern (perfect LD). The bin genotypes of the 210 RILs were coded as 1 for the Zhenshan 97 genotype and 0 for the Minghui 63 genotype. Genotypes of the hybrids were deduced from genotypes of the two parents. The transcriptomic data consisted of 24 994 gene expression traits measured in tissues sampled from flag leaves for all the 210 RILs in 2008 (Wang et al., 2014). Each line had two biological replicates, but RNA extracted from the two replicates was mixed in a 1:1 ratio before microarray expression profiling was conducted. The original expression levels were log2 transformed before analysis. The metabolomic data consisted of 683 metabolites measured from flag leaves and 317 metabolites measured from germinated seeds (Gong et al., 2013). The metabolomic data were collected in 2009 and 2010 (two replicates). For metabolic profiling, germinated seeds were sampled in one biological replicate in 2009 and one in 2010, and flag leaves were sampled in two biological replicates in 2009. In both tissues, the expression level of each metabolite was log2 transformed. For each line, we took the average of expression levels measured from the two replicates as the measurement of the metabolites.
Methods of prediction
We used six statistical methods for prediction: (i) LASSO developed by Tibshirani (1996) and implemented in the GlmNet/R program (Friedman et al., 2010); (ii) Henderson's (1975) BLUP adopted to genomic data analysis (VanRaden, 2008) and implemented in our own R program (Xu, 2013); (iii) SSVS developed by George and McCulloch (1993); (iv) support vector machine using the radial basis function (SVM‐RBF); (v) support vector machine using the polynomial kernel function (SVP‐POLY); and (vi) partial least squares (PLS). The SSVS method is also called Bayes B (Meuwissen et al., 2001) and was implemented using an R package called BGLR (Perez and de los Campos 2014). The two SVM methods were implemented in an R program called kernlab (Karatzoglou et al., 2004). The PLS was implemented using an R package called pls (Mevik and Wehrens, 2007).
Models of prediction
Let y be a n × 1 vector of the phenotypic values for the trait of interest, where n is the sample size (n = 210 in RIL and n = 278 in IFM2). Let X be a n × m matrix of predictors used to predict y, where m is the number of predictors in the model and it depends on the source of data and the model. The first three methods (LASSO, BLUP and SSVS) use a random model, as shown by y = Xβ + ɛ where β is a m × 1 vector of model effects and ɛ is a n × 1 vector of residual errors. The model effect βk was treated as a random effect with either a normal distribution or a mixture of two normal distributions. The LASSO method can be reformulated as a Bayesian hierarchical model,
and
for all k = 1,…,m, where λ is a shrinkage parameter (Tibshirani, 1996), although the original LASSO was not formulated this way. The BLUP method assumes
for all k = 1,…,m, where φ2 is called the ‘polygenic variance’. The SSVS method assumes that βk is sampled from one of two normal distributions with an unknown label of the two distributions. Mathematically, it is described as βk ~ ηkN(0, Δ) + (1 − ηk)N(0, δ) where Δ is the variance of the first normal distribution sampled along with other parameters, δ = 10−5 is the variance of the second normal distribution, and ηk is the cluster label having a Bernoulli distribution with probability ρ. The missing cluster label ηk takes a value of 1 if βk belongs to the first distribution and 0 otherwise. The probability of the missing label ρ was modeled by a Beta(1,1) distribution. All parameters were sampled from their posterior distributions. The above three models (LASSO, BLUP and SSVS) are all linear. The two SVM methods use a non‐linear relationship between y and X described as y = f(X|β) + ɛ, where
and Kh(X, Xj) is a kernel chosen by the users. We chose the Gaussian kernel (SVM‐RBF) and the polynomial kernel (SVM‐POLY) functions (Karatzoglou et al., 2004). The PLS method is a hybrid method between principal component analysis (PCA) and multiple regression analysis. It uses the first few latent scores of the X matrix as predictors to predict the phenotype. However, it differs from PCA in that the weights of the latent scores are calculated by maximizing the covariance between y and the scores (Gelandi and Kowalski, 1986). The number of latent components was determined by a 10‐fold cross‐validation to have a minimum prediction error.
Predictability drawn from cross‐validation
The predicted trait is denoted by
for LASSO, PLS, BLUP and SSVS, and
for SVM‐RBF and SVM‐POLY. We used a 10‐fold cross‐validation to evaluate the predictability of each method, where individuals predicted do not contribute to parameter estimation. The predictability is defined as the squared correlation coefficient between y and
(Appendix S1). The predictability depends on how the sample is partitioned into the 10‐folds. It also depends on the number of folds (Figure S4). Therefore, we replicated the cross‐validation analysis 10 times to monitor the variation among the replicates. The predictability increases as the number of folds increases, but often reaches a plateau at fold 10, while further increase of the fold number only improves the predictability slightly. Figure S4 shows the plots of predictability against the number of folds from 10 replicates for YIELD from the metabolomic data using the LASSO method in both the RIL and IMF2 populations.
Defining the X matrix
For the RIL population,
is a n × m matrix, where n = 210, m = 1000 for the metabolomic data, m = 24 994 for the transcriptomic data and m = 1619 for the genomic data. The jth row and the kth column of matrix X is defined as the gene expression level for the transcriptomic data and the level of the kth metabolite for the metabolomic data. For the genomic data, Xjk = 1 for the Zhenshan 97 genotype and Xjk = 0 for the Minghui 63 genotype.
matrix has n = 278 rows and m = 2 × 24 994 columns for the transcriptomic data, m = 2 × 1000 columns for the metabolomic data and m = 2 × 1619 columns for the genomic data. The Xjk value for each IMF2 cross is a function of the corresponding predictors of their RIL parents. Let
and
be the predictors of the male and female RIL parents, respectively. For the genomic data,
for the Zhenshan 97 genotype and
for the Minghui 63 genotype of the RIL parents. For the transcriptomic and metabolomics data,
and
are the corresponding measurements of the expression levels of the two parents. We first defined
and
for the IMF2 cross, where Zjk is a predictor for the ‘additive’ effect and Wjk is a predictor for the ‘dominance’ effect. Take the genomic data for example, let A1 be the Zhenshan 97 allele and A2 be the Minghui 63 allele. The predictors of an IMF2 cross are defined as
(1)Please refer to Table S4 for a summary of the genomic coding system. This particular coding system for transcriptomic and metabolomic data is consistent with the classical coding for hybrid genotypes. The biological justification for the ‘dominance’ is that it may capture information about the difference between the two parents. As the difference can be positive or negative, the absolute value will make the difference positive in either case.
There are two matrices, Z and W, for the IMF2 crosses. Corresponding to the two matrices, there are two types of effects, α represents the additive effects and δ represents the dominance effects. The X matrix for the IMF2 crosses takes the horizontal concatenation of the two matrices,
. Let β = [α//δ] be the vertical concatenation of α and δ. The model for prediction is
or
depending on the prediction method. We performed centering and scaling for the predictors by calling the scale() function in R. For the BLUP method, two kinship matrices were fitted to the model, one for the additive variance and one for the dominance variance. The kinship matrices were calculated using the method of (Xu, 2013).
Combining three sources of data
(2)
(3)
is the column concatenation of the three predictor matrices and β = [βSNP//βEXP//βMET] is the row concatenation of the effects of the three predictors. The BLUP method was based on a linear model in an implicit manner. It explicitly requires three covariance structures (called kinship matrices), one for each data source (dominance covariance structures have been ignored).
Analysis of variance of predictability
(4)Permutation test for predictability
To test whether or not the observed predictability is significantly different from zero, we randomly shuffled the phenotypes and conducted a prediction under each scenario with the LASSO method. The shuffling process was replicated 1000 times so that the predictabilities formed a null distribution. We then compared the observed predictability against the null distribution to calculate an empirical P‐value for each predictability.
Accession codes
These data have been deposited in figshare (figshare.com/s/0773080c122d11e58b6306ec4bbcf141).
Author contributions
S.X. and Q.Z. designed the research; S.X. and Q.Z. performed the research; S.X., Y.X. and L.G. analyzed the data; S.X. and Q.Z. wrote the paper.
Acknowledgements
This project was supported by National Science Foundation Collaborative Research Grant DBI‐1458515 to S.X., Chinese National Natural Science Foundation Grant 31330039 to Q.Z., and China's 111 Project Grant B07041 to Q.Z.
Conflict of interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
References
Citing Literature
Number of times cited according to CrossRef: 40
- Zheng Chu, Jiong Yu, An end-to-end model for rice yield prediction using deep learning fusion, Computers and Electronics in Agriculture, 10.1016/j.compag.2020.105471, 174, (105471), (2020).
- Taotao Shi, Anting Zhu, Jingqi Jia, Xin Hu, Jie Chen, Wei Liu, Xifeng Ren, Dongfa Sun, Alisdair R. Fernie, Fa Cui, Wei Chen, Metabolomics analysis and metabolite‐agronomic trait associations using kernels of wheat (Triticum aestivum) recombinant inbred lines, The Plant Journal, 10.1111/tpj.14727, 103, 1, (279-292), (2020).
- Jacob D. Washburn, Merritt B. Burch, José A. Valdes Franco, Predictive breeding for maize: Making use of molecular phenotypes, machine learning, and physiological crop models, Crop Science, 10.1002/csc2.20052, 60, 2, (622-638), (2020).
- Yang Xu, Ying Ma, Xin Wang, Cheng Li, Xuecai Zhang, Pengcheng Li, Zefeng Yang, Chenwu Xu, Kernel metabolites depict the diversity of relationship between maize hybrids and their parental lines, The Crop Journal, 10.1016/j.cj.2020.05.009, (2020).
- Feng Lin, Guoan Qi, Ting Xu, Xiangyang Lou, Yongbo Hong, Haiming Xu, Joint association analysis method to dissect complex genetic architecture of multiple genetically related traits, The Crop Journal, 10.1016/j.cj.2020.06.007, (2020).
- Shibo Wang, Yang Xu, Han Qu, Yanru Cui, Ruidong Li, John M Chater, Lei Yu, Rui Zhou, Renyuan Ma, Yuhan Huang, Yiru Qiao, Xuehai Hu, Weibo Xie, Zhenyu Jia, Boosting predictabilities of agronomic traits in rice using bivariate genomic selection, Briefings in Bioinformatics, 10.1093/bib/bbaa103, (2020).
- Meiyue Wang, Ruidong Li, Shizhong Xu, Deshrinking ridge regression for genome-wide association studies, Bioinformatics, 10.1093/bioinformatics/btaa345, (2020).
- Hsin-Yuan Tsai, Fabio Cericola, Vahid Edriss, Jeppe Reitan Andersen, Jihad Orabi, Jens Due Jensen, Ahmed Jahoor, Luc Janss, Just Jensen, Use of multiple traits genomic prediction, genotype by environment interactions and spatial effect to improve prediction accuracy in yield data, PLOS ONE, 10.1371/journal.pone.0232665, 15, 5, (e0232665), (2020).
- Omar Vergara-Diaz, Thomas Vatter, Rubén Vicente, Toshihiro Obata, Maria Teresa Nieto-Taladriz, Nieves Aparicio, Shawn Carlisle Kefauver, Alisdair Fernie, José Luis Araus, Metabolome Profiling Supports the Key Role of the Spike in Wheat Yield Performance, Cells, 10.3390/cells9041025, 9, 4, (1025), (2020).
- Yang Xu, Yue Zhao, Xin Wang, Ying Ma, Pengcheng Li, Zefeng Yang, Xuecai Zhang, Chenwu Xu, Shizhong Xu, Incorporation of parental phenotypic data into multi‐omic models improves prediction of yield‐related traits in hybrid rice, Plant Biotechnology Journal, 10.1111/pbi.13458, 0, 0, (2020).
- Zhiwu Dan, Yunping Chen, Weibo Zhao, Qiong Wang, Wenchao Huang, Metabolome-based prediction of yield heterosis contributes to the breeding of elite rice, Life Science Alliance, 10.26508/lsa.201900551, 3, 1, (e201900551), (2019).
- Xia Shi, Xuehai Zhang, Dakun Shi, Xiangge Zhang, Weihua Li, Jihua Tang, Dissecting Heterosis During the Ear Inflorescence Development Stage in Maize via a Metabolomics-based Analysis, Scientific Reports, 10.1038/s41598-018-36446-5, 9, 1, (2019).
- Xuehai Hu, Weibo Xie, Chengchao Wu, Shizhong Xu, A directed learning strategy integrating multiple omic data improves genomic prediction, Plant Biotechnology Journal, 10.1111/pbi.13117, 17, 10, (2011-2020), (2019).
- Ian Mackay, Hans‐Peter Piepho, Antonio Augusto Franco Garcia, Statistical Methods for Plant Breeding, Handbook of Statistical Genomics, 10.1002/9781119487845, (501-20), (2019).
- Matthias Westhues, Claas Heuer, Georg Thaller, Rohan Fernando, Albrecht E. Melchinger, Efficient genetic value prediction using incomplete omics data, Theoretical and Applied Genetics, 10.1007/s00122-018-03273-1, (2019).
- Xiaogang Liu, Hongwu Wang, Xiaojiao Hu, Kun Li, Zhifang Liu, Yujin Wu, Changling Huang, Improving Genomic Selection With Quantitative Trait Loci and Nonadditive Effects Revealed by Empirical Evidence in Maize, Frontiers in Plant Science, 10.3389/fpls.2019.01129, 10, (2019).
- Giovanni Melandri, Hamada AbdElgawad, David Riewe, Jos A Hageman, Han Asard, Gerrit T S Beemster, Niteen Kadam, Krishna Jagadish, Thomas Altmann, Carolien Ruyter-Spira, Harro Bouwmeester, Biomarkers for grain yield stability in rice under drought stress, Journal of Experimental Botany, 10.1093/jxb/erz221, (2019).
- Julong Wei, Weibo Xie, Ruidong Li, Shibo Wang, Han Qu, Renyuan Ma, Xiang Zhou, Zhenyu Jia, Analysis of trait heritability in functionally partitioned rice genome, Heredity, 10.1038/s41437-019-0244-9, (2019).
- Yong Li, Ling Long, Jing Ge, Haocong Li, Meng Zhang, Qun Wan, Xiangyang Yu, Effect of imidacloprid uptake from contaminated soils on vegetable growth, Journal of Agricultural and Food Chemistry, 10.1021/acs.jafc.9b00747, (2019).
- François Vasseur, Louise Fouqueau, Dominique de Vienne, Thibault Nidelet, Cyrille Violle, Detlef Weigel, Nonlinear phenotypic variation uncovers the emergence of heterosis in Arabidopsis thaliana, PLOS Biology, 10.1371/journal.pbio.3000214, 17, 4, (e3000214), (2019).
- Shibo Wang, Julong Wei, Ruidong Li, Han Qu, John M. Chater, Renyuan Ma, Yonghao Li, Weibo Xie, Zhenyu Jia, Identification of optimal prediction models using multi-omic data for selecting hybrid rice, Heredity, 10.1038/s41437-019-0210-6, (2019).
- Zhengcao Li, Ning Gao, Johannes W. R. Martini, Henner Simianer, Integrating Gene Expression Data Into Genomic Prediction, Frontiers in Genetics, 10.3389/fgene.2019.00126, 10, (2019).
- Zhiwu Dan, Yunping Chen, Yanghong Xu, Junran Huang, Jishuai Huang, Jun Hu, Guoxin Yao, Yingguo Zhu, Wenchao Huang, A metabolome‐based core hybridisation strategy for the prediction of rice grain weight across environments, Plant Biotechnology Journal, 10.1111/pbi.13024, 17, 5, (906-913), (2018).
- Maarten Ginkel, Rodomiro Ortiz, Cross the Best with the Best, and Select the Best: HELP in Breeding Selfing Crops, Crop Science, 10.2135/cropsci2017.05.0270, 58, 1, (17-30), (2018).
- Wenchao Zhang, Xinbin Dai, Shizhong Xu, Patrick X. Zhao, 2D association and integrative omics analysis in rice provides systems biology view in trait analysis, Communications Biology, 10.1038/s42003-018-0159-7, 1, 1, (2018).
- Felix Seifert, Alexander Thiemann, Tobias A. Schrag, Dominika Rybka, Albrecht E. Melchinger, Matthias Frisch, Stefan Scholten, Small RNA-based prediction of hybrid performance in maize, BMC Genomics, 10.1186/s12864-018-4708-8, 19, 1, (2018).
- Tobias A. Schrag, Matthias Westhues, Wolfgang Schipprack, Felix Seifert, Alexander Thiemann, Stefan Scholten, Albrecht E. Melchinger, Beyond Genomic Prediction: Combining Different Types of omics Data Can Improve Prediction of Hybrid Performance in Maize , Genetics, 10.1534/genetics.117.300374, 208, 4, (1373-1385), (2018).
- Jiazhi Shen, Zhongwei Zou, Xuzhou Zhang, Lin Zhou, Yuhua Wang, Wanping Fang, Xujun Zhu, Metabolic analyses reveal different mechanisms of leaf color change in two purple-leaf tea plant (Camellia sinensis L.) cultivars, Horticulture Research, 10.1038/s41438-017-0010-1, 5, 1, (2018).
- Saleh Alseekh, Luisa Bermudez, Luis Alejandro de Haro, Alisdair R. Fernie, Fernando Carrari, Crop metabolomics: from diagnostics to assisted breeding, Metabolomics, 10.1007/s11306-018-1446-5, 14, 11, (2018).
- Francisco de Abreu e Lima, Lydia Leifels, Zoran Nikoloski, Regression-Based Modeling of Complex Plant Traits Based on Metabolomics Data, Plant Metabolomics, 10.1007/978-1-4939-7819-9_23, (321-327), (2018).
- Guillaume P. Ramstein, Sarah E. Jensen, Edward S. Buckler, Breaking the curse of dimensionality to identify causal variants in Breeding 4, Theoretical and Applied Genetics, 10.1007/s00122-018-3267-3, (2018).
- Julong Wei, Aiguo Wang, Ruidong Li, Han Qu, Zhenyu Jia, Metabolome-wide association studies for agronomic traits of rice, Heredity, 10.1038/s41437-017-0032-3, 120, 4, (342-355), (2017).
- Carola Zenke‐Philippi, Matthias Frisch, Alexander Thiemann, Felix Seifert, Tobias Schrag, Albrecht E. Melchinger, Stefan Scholten, Eva Herzog, Transcriptome‐based prediction of hybrid performance with unbalanced data from a maize breeding programme, Plant Breeding, 10.1111/pbr.12482, 136, 3, (331-337), (2017).
- B. J. Hayes, J. Panozzo, C. K. Walker, A. L. Choy, S. Kant, D. Wong, J. Tibbits, H. D. Daetwyler, S. Rochfort, M. J. Hayden, G. C. Spangenberg, Accelerating wheat breeding for end-use quality with multi-trait genomic predictions incorporating near infrared and nuclear magnetic resonance-derived phenotypes, Theoretical and Applied Genetics, 10.1007/s00122-017-2972-7, 130, 12, (2505-2519), (2017).
- Matthias Westhues, Tobias A. Schrag, Claas Heuer, Georg Thaller, H. Friedrich Utz, Wolfgang Schipprack, Alexander Thiemann, Felix Seifert, Anita Ehret, Armin Schlereth, Mark Stitt, Zoran Nikoloski, Lothar Willmitzer, Chris C. Schön, Stefan Scholten, Albrecht E. Melchinger, Omics-based hybrid prediction in maize, Theoretical and Applied Genetics, 10.1007/s00122-017-2934-0, 130, 9, (1927-1939), (2017).
- Ulrike Beukert, Zuo Li, Guozheng Liu, Yusheng Zhao, Nadhigade Ramachandra, Vilson Mirdita, Fabiano Pita, Klaus Pillen, Jochen Christoph Reif, Genome-Based Identification of Heterotic Patterns in Rice, Rice, 10.1186/s12284-017-0163-4, 10, 1, (2017).
- Francisco Abreu e Lima, Matthias Westhues, Álvaro Cuadros‐Inostroza, Lothar Willmitzer, Albrecht E. Melchinger, Zoran Nikoloski, Metabolic robustness in young roots underpins a predictive model of maize hybrid performance in the field, The Plant Journal, 10.1111/tpj.13495, 90, 2, (319-329), (2017).
- Hui Wang, Cheng Xu, Xiaogang Liu, Zifeng Guo, Xiaojie Xu, Shanhong Wang, Chuanxiao Xie, Wen-Xue Li, Cheng Zou, Yunbi Xu, Development of a multiple-hybrid population for genome-wide association studies: theoretical consideration and genetic mapping of flowering traits in maize, Scientific Reports, 10.1038/srep40239, 7, (40239), (2017).
- Vikas Belamkar, Andrew D. Farmer, Nathan T. Weeks, Scott R. Kalberer, William J. Blackmon, Steven B. Cannon, Genomics-assisted characterization of a breeding collection of Apios americana, an edible tuberous legume, Scientific Reports, 10.1038/srep34908, 6, 1, (2016).
- Andrea Matros, Guozheng Liu, Anja Hartmann, Yong Jiang, Yusheng Zhao, Huange Wang, Erhard Ebmeyer, Viktor Korzun, Ralf Schachschneider, Ebrahim Kazman, Johannes Schacht, Friedrich Longin, Jochen Christoph Reif, Hans-Peter Mock, Genome–metabolite associations revealed low heritability, high genetic complexity, and causal relations for leaf metabolites in winter wheat ( Triticum aestivum ) , Journal of Experimental Botany, 10.1093/jxb/erw441, (erw441), (2016).




