Phenomic versus genomic prediction—A comparison of prediction accuracies for grain yield in hard winter wheat lines

Common bread wheat (Triticum aestivum L.) is a key component of global diets, but the genetic improvement of wheat is not keeping pace with the growing demands of the world's population. To increase efficiency and reduce costs, breeding programs are rapidly adopting the use of unoccupied aerial vehicles to conduct high‐throughput spectral analyses. This study examined the effectiveness of multispectral indices in predicting grain yield compared to genomic prediction. Multispectral data were collected on advanced generation yield nursery trials during the 2019–2021 growing seasons in the Colorado State University Wheat Breeding Program. Genome‐wide genotyping was performed on these advanced generations and all plots were harvested to measure grain yield. Two methods were used to predict grain yield: genomic estimated breeding values (GEBVs) generated by a genomic best linear unbiased prediction (gBLUP) model and phenomic phenotypic estimates (PPEs) using only spectral indices via multiple linear regression (MLR), k‐nearest neighbors (KNNs), and random forest (RF) models. In cross‐validation, PPEs produced by MLR, KNN, and RF models had higher prediction accuracy ( r¯:0.41≤r¯≤0.48$\bar r:\;0.41 \le \bar r \le 0.48$ ) than GEBVs produced by gBLUP ( r¯=0.35$\bar r = \;0.35$ ). In leave‐one‐year‐out forward validation using only multispectral data for 2020 and 2021, PPEs from MLR and KNN models had higher prediction accuracy of grain yield than GEBVs of those same lines. These findings suggest that a limited number of spectra may produce PPEs that are more accurate than or equivalently accurate as GEBVs derived from gBLUP, and this method should be evaluated in earlier development material where sequencing is not feasible.


INTRODUCTION
Common bread wheat (Triticum aestivum L.) is a diet staple across cultures and plays a key role in international food security.The global population is expected to rise to nearly 10 billion people by the year 2050, yet annual genetic gain in wheat yield is not on track to meet this demand, which implies that a biocapacity deficit is inevitable unless significant action is taken (Alexandratos & Bruinsma, 2012;Berners-Lee et al., 2018;Ray et al., 2013).Genomic prediction is the process of using historical genomic and phenotypic data to produce genomic estimated breeding values (GEBVs) for lines that have only been sequenced and have not been evaluated via phenotyping (Meuwissen et al., 2001).Since the advent of high-throughput genomic technologies, breeders have been using these predictions to implement genomic selection (GS) to increase selection intensity and heritability of selection in earlier generation material.
While GS greatly assists breeders in making selections, it is not applied across all generations due to cost and labor limitations.High-throughput spectral analysis via unoccupied aerial vehicles (UAVs) is becoming more common in modern breeding programs, generating large datasets of spectral reflectance and other phenotypic data.While the cost of collecting genome-wide markers on lines becomes prohibitive as the number of lines increases exponentially in earlier generation trials, spectral data may be taken on all planted lines in all fields in a program at a fraction of the cost.
Recent research has suggested the utility of "phenomic selection" (PS) in breeding.Phenomic selection is the process of using electromagnetic spectral reflectance of organisms, captured by multispectral or hyperspectral sensors, to produce phenomic phenotypic estimates (PPEs) for traits of interest (Krause et al., 2019;Lane et al., 2020;Rincent et al., 2018;Robert et al., 2022;Sandhu et al., 2022).Recent publications have illustrated that spectral reflectance can create relationship matrices among individuals; furthermore, it has been shown that spectral reflectance matrices perform at or near the predictive ability of additive relationship matrices derived from either molecular markers or pedigree information when predicting grain yield (Krause et al., 2019).This may imply that PS could stand in place of GS for selection candidates in earlier generations than those lines in the sequencing pipeline.
The potential application of PS in programs will be directly related to access of relevant technologies.Many of the publications on PS either rely on hyperspectral imaging derived from UAVs or near-infrared (NIR) spectrometry taken in the field or in a lab setting (Krause et al., 2019;Lane et al., 2020;Rincent et al., 2018;Sandhu et al., 2022).These methods provide many different bands of reflectance to use for the creation of relationship matrices.However, their use within breeding programs may be limited due to the cost, labor, time, and the processing pipelines required for these techniques.Inves-

Core Ideas
• Phenomic prediction accuracy is comparable to genomic when spectra are heritable and correlated to yield.• Accurate phenomic phenotypic estimates can be made in unobserved years using prior collected spectral data.• A limited number of spectral bands collected from multispectral drones can be used in phenomic prediction.
tigating if phenomic predictions may be made using a limited number of spectra as predictors on more accessible equipment may address this gap in application.
In the 2019, 2020, and 2021 growing seasons, the Colorado State University (CSU) Wheat Breeding Program flew UAVs in cooperation with the CSU Drone Center (https:// www.research.colostate.edu/csudronecenter/)over fields containing late generation lines (F 3:5 , F 3:6 , and F 3:7+ ) at the Agricultural Research, Development and Education Center (ARDEC, Fort Collins, CO).Concurrently, those late development lines were sequenced for genome-wide markers.Using this unbalanced dataset which models a true breeding system, the objectives of this study were to (1) investigate the use of limited spectral information across years to produce PPEs, (2) compare calculated PPEs and GEBVs to adjusted means of yield, and (3) test several scenarios to observe which models produced optimal results for calculations of PPEs.

Germplasm
The CSU Wheat Breeding Program pipeline is a modified bulk-pedigree method (Figure 1).All lines which occur in the later nurseries of the CSU breeding program are representative of the contemporary hard red and white winter wheat germplasm adapted to the targeted breeding areas of CSU; these areas include eastern Colorado, western Nebraska, western Kansas, and northwestern Oklahoma.
The final single-head selection occurs in an F 3 bulk and selected heads are evaluated in rows in the following year.Final among-row selection occurs in the F 3:4 and preliminary yield trials begin with F 3:5 lines.The preliminary yield nursery (PYN; F 3:5 ) is then followed by the advanced yield nursery (AYN; F 3:6 ).All lines which make it past the PYN and AYN are then placed in the CSU Elite Trial (ELITE; F 5:7+ ), where they remain until they are culled or released as a cultivar.Since 2013, doubled haploids (DHs) have been routinely generated by the CSU Wheat Breeding Program via the F 1 corn (Zea maize L.) pollen hybridization method (Santra et al., 2017).After DH seeds have been harvested from greenhouse plants, they are planted in rows and subjected to selection based on visual ratings and genomic predictions.DH lines which are advanced from the row stage are then treated as AYN lines.These nurseries are conventionally titled AYND1 or AYND2 ("D" for DH) within a given year.
Spectral data were collected for 3 years of trials planted at the ARDEC facility in Fort Collins, CO, through collaboration with the CSU Drone Center (https://www.research.colostate.edu/csudronecenter/).Data from the following trials were used in the current work: the PYN from 2021; the AYN from 2019, 2020, and 2021;the AYND1 and AYND2 from 2019, 2020, and 2021, and the ELITE from 2019, 2020, and 2021.These trials resulted in a total dataset of 4557 individual observations tied to a yield plot within a trial, within a year.A detailed description of the number of individuals in each nursery for the 2019, 2020, and 2021 season is provided (Table 1).

Field management and phenotyping
All nurseries used in this study were planted at the ARDEC facility located in Fort Collins, CO, at approximately 1525 m above sea level.Within each growing season, all grain yield plots were planted in mid-September using a cable guided no-till drill seeder at 1.5-m wide with 4.9-m centers.During green up in the spring, plots were end trimmed using a 1.2m wide hooded sprayer and glyphosate (Monsanto), leaving a harvestable area of 1.5 × 3.7 m.All nurseries were planted in a statistical design which used row-column effects as blocking factors.The PYN, AYN, AYND1, and AYND2 trials were planted in an augmented block design and the ELITE trials were planted in a partially replicated design across all years.Each year, soil samples were taken across fields to inform pre-plant fertilization regiments.In 2019, 64.61 kg ha −1 of 11-52-0 monoammonium phosphate (MAP) granular fertilizer with ammoniacal nitrogen was applied along with 349.83 kg ha −1 46-0-0 urea granular nitrogen on September 6, 2018, prior to planting.Post planting, 0.95 L ha −1 of Husky (BASF) was applied on April 24, 2019, to control weeds.In 2020, 18.52 kg ha −1 of 0-0-62 high potassium potash was applied along with 32.63 kg ha −1 of 11-52-0 high phosphate MAP granular fertilizer with ammoniacal nitrogen and 358.99 kg ha −1 46-0-0 urea granular nitrogen on September 13, 2019, prior to planting.Post planting, 1.04 L ha −1 of Husky was applied on April 28, 2020, to control weeds.In 2021, after sampling to determine a proper fertilizing regiment, 32.62 kg ha −1 of 11-52-0 high phosphate MAP granular fertilizer with ammoniacal nitrogen was applied along with 358.18 kg ha −1 46-0-0 urea granular nitrogen and 15.16 kg ha −1 of granular 90% sulfur on September 6, 2020, prior to planting.Post planting, 1.04 L ha −1 of Husky was applied on May 1, 2021, to control weeds.Irrigation was applied, as needed, as soon as temperatures allowed for irrigation lines to thaw in the spring.Prescriptive irrigation continued into grain fill and was terminated prior to physiological maturity and harvest.Detailed information on irrigation regiments is provided (Table S1).
Within each growing season, at physiological maturity, lines were harvested using a Zürn 150 (Zürn Harvesting GmbH and Company) research plot combine and measured for test weight, grain moisture, and grain yield using a Har-vestMaster H2 Classic Graingage (Juniper Systems).Grain yield was adjusted by moisture content so that the grain yields reflected 12% grain moisture, and yield was calculated using plot size metrics and reported in metric tons per hectare on a per-plot basis (t ha −1 ).

Sequencing
For each line sequenced, 10 seeds were planted, and a 2to 3-cm sample of leaf tissue was taken from each plant and bulked together.Genomic quality deoxyribonucleic acid (DNA) was extracted using MagMax (ThermoFisher Scientific) plant DNA kits.Extracted DNA was quantified using PicoGreen (ThermoFisher Scientific) kits and normalized to a concentration of 20 ng μL −1 using an automated liquid handling system.Sequencing libraries were prepared according to Poland et al. (2012).
All multiplexed libraries were sequenced on a NovaSeq 6000 (Illumina) at 384-plex density per lane.Reads were aligned to the RefSeq Chinese Spring wheat reference sequence v2.0 using the Burrow-Wheeler Aligner version 0.7.17 (Appels et al., 2018;Li & Durbin, 2009).Reads obtained were processed using the TASSEL 2.0 standalone pipeline and single nucleotide polymorphisms (SNPs) were organized into compressed variant calling format files (Danecek et al., 2011;Glaubitz et al., 2014).
Variant calling format files were filtered using the following parameters: all SNPs which aligned to an unknown chromosome were removed; SNPs with a read depth of greater than one and less than 100 were retained; all monomorphic SNPs, insertions, and deletions were removed; all SNPs with a minor allele frequency of 5% or less were removed; all SNPs with a heterozygous frequency of 10% or higher were removed; and all SNPs with a missing data frequency of 85% or higher were removed.To produce full-rank genotypic matrices for genomic relationship matrix calculation, missing data were imputed using the Beagle algorithm V5.4 and the recombination distance-based map used for imputation was derived from a synthetic wheat biparental cross between "W7984" and "Opata" (Browning et al., 2018;Gutierrez-Gonzalez et al., 2019).

Spectral reflectance acquisition
The processing pipeline for multispectral data consists of three distinct phases: spectral data collection via UAVmounted cameras, image stitching and processing, and data extraction from individual plots.Between each year, flight parameters were changed at the discretion of the current drone technician to increase data quality and this resulted in an unbalanced method of sampling from year-to-year.
While individual processes and settings may have changed from year-to-year to improve the quality and efficiency of the pipeline over time, the general framework remained similar.Drone flights were conducted at a height of approximately 120 m above ground level (AGL) in 2019, 100 m AGL in 2020, and 60 m AGL in 2021.This resulted in a resolution of 8.2 px cm −1 in 2019, 7.1 px cm −1 in 2020, and 4.2 px cm −1 in 2021 for the multispectral flights.The increased resolution of the images allowed for more data points per plot when calculating the averages for various spectral indices.In 2019 and 2020, ground control points (GCPs) in the form of plastic bucket lids were placed throughout the field including one at each corner, one at the midpoint of each edge, and at least one in the middle of the field.In 2021, no GCPs were used, which significantly increased the amount of manual post-flight processing.
Flights were conducted at the discretion of the current leader of the CSU Wheat Breeding Program, and flights were taken when scheduling was available through the CSU Drone Center, leading to a wide variance among sampling dates and times.Over several flights taken between 2019 and 2022, the curated flight data have the narrowest window possible.A list of flight dates utilized in this study by year is provided (Table 2).
All UAVs utilized real time kinematics positioning for accurate geotagging of images.Images were collected in the nadir direction at ±1 h to solar noon.Immediately before and after each flight, the multispectral camera was calibrated using its accompanying calibration panel.The calibration panel used for each flight was a MicaSense RP04 model panel, which had albedo values of 51.3 for blue at 475 nm, 51.3 for green at 560 nm, 51.3 for red at 668 nm, 51.2 for red edge at 717 nm, and 51 for NIR at 840 nm.
Pix4Dmapper (Pix4D SA) was used to stitch the images together to create an orthomosaic for each of the spectral bands, including the visual bands collected with the RGB camera.These orthomosaics were then uploaded to a geodatabase in ArcGIS Pro (Esri) for further editing and orthorectifying.A single RGB image was selected as the image to georeference all other orthomosaics, which minimizes the geospatial error between orthomosaics and flights over the course of the growing season; this was done by georeferencing each orthomosaic's GCPs to those visible in the reference RGB.In 2021, since no GCPs were placed in the field, other permanent, physical landmarks had to be used for reference.
Starting in 2021, excess green index (ExGI) was used as the source dataset for the ArcGIS machine learning classification function.The goal of using this feature was to remove the potential inaccuracy of the data in previous years by including soil surrounding the plot in the average spectral data calculations.The algorithm was run unsupervised with a high focus on grouping pixels based on both spatial proximity and spectral similarity.The algorithm was tasked with grouping the field into two self-identified categories using the ExGI orthomosaic for reference.
With only plants and soil present in the field, the algorithm successfully classified the field into these two groups.The result of this process was an additional raster layer in the geodatabase that specified boundaries around the plant matter in the field.Using these boundaries, the soil was cropped out of the orthomosaic images, ensuring that the only data remaining in the orthomosaics were related to plant matter.While retroactive processing of orthomosaics in 2019 and 2020 in this way is possible, the resolution of previous flights was not conducive to separating soil from plant material, thus this procedure was only applied to 2021.
The average reflectance of each band was extracted per plot and ArcGIS Pro was used to calculate spectral indices including normalized difference vegetation index (NDVI).Sampling areas were defined using a Python script that retrieves field coordinates and dimensions from a CSV file and draws a polygon for each plot directly into the ArcGIS geodatabase that hosts the orthomosaics.Using the plot polygons as boundaries for the calculation of the averages, data tables containing the mean reflectance for each plot, date, band, and index were exported for use in further analysis.

Validation scenarios
Three separate scenarios were explored to assess how PPEs calculated for each line from plot-level spectral data compared to GEBVs and adjusted mean yield.In all scenarios presented, no genotypes were replicated between the training and test sets to avoid bias in the calculation of GEBVs or PPEs.Scenario one (S1) involved taking the total available data across years, partitioning 80% of the available yield data to training across years and trials, and validating the remaining 20% while including all spectral data in both training and validation sets.This scenario most closely resembles a classic cross-validation scheme in genomic prediction.This process was replicated 30 times to obtain distributions of cross-validated accuracies.
Scenario two (S2) involved calculating adjusted means on a nursery-by-nursery basis and predicting yield in a trial where yield and spectral data from the same environment are in the training data.In S2, we assume that we have historical spectral and yield information on fields, and we have spectral and yield data on plot-level observations of other nurseries in the same field as a nursery which was not measured for yield.In S2, a single trial is left out of the dataset and predicted using the rest of the available data.This may be applicable in a scenario where harvest of a trial is not feasible and grain yields must instead be predicted.This may be denoted as "hold-one-trialout forward validation." Scenario three (S3) involved taking the total available data within 2 of the 3 available years and using those data to train a model to predict the remaining year.For example, to predict the yield of all genotypes in 2021, both yield and spectral data from 2019 and 2020 and only spectral data from 2021 were used to predict yield data from 2021.This method implies that the program would have historical spectral data on a specific field, and that historical data could be used to train a model to predict yield in an unobserved year for an unobserved genotype with spectral data.This scenario more closely resembles forward validation in GS and may be denoted as "hold-one-year-out forward validation."

Statistical analysis-Phenotypic analysis and genomic prediction
All data management and statistical analysis was performed in R statistical software version 4.2.2 (R Core Team, 2022).All distributions for spectra and yield were checked for quality and normality using the qqplots function via the Base and Stats packages in R (R Core Team, 2022).All Q-Q plots exhibited near normality, warranting no transformation of any spectral indices or grain yield to fit normality.All mixed linear and genomic prediction models were performed using the package ASReml-R (Butler et al., 2009).
For calculation of adjusted means for both spectral indices and yield, the following univariate mixed linear model was used in every trial: where   is the response, μ is the intercept,   is the fixed genotypic response,   is the random row effect where  ∼ N(0, σ 2  ),   is the random column effect where   ∼ N(0, σ 2  ), and ε  is the residual where ε ∼ N(0, σ 2 ε ).To calculate adjusted means across trials within a year or across the total dataset, the following mixed linear model was employed using the adjusted means calculated on a per-trial basis: where   is the response, μ is the intercept,   is the fixed genotype effect,   is the random environment effect where  ∼ N(0, σ 2  ), and ε  is the residual where ε  ∼ N(0, σ 2 ε ).An "environment" in this case is the year that the trial was planted, concatenated to the name of the trial, and the location of that trial (e.g., "ARDEC_2019_AYN").The adjusted means obtained across the entire dataset were used in S1, the adjusted means calculated with-in each trial were used in S2, and the adjusted means calculated within each year were used in S3.
For calculation of GEBVs in either S1, S2, or S3, the following genomic best linear unbiased prediction (gBLUP) model was employed: where y is the response, X is the design matrix for fixed effects, b is the vector of fixed effects, Z is the design matrix for random genetic effects, and u is the vector of additive genetic effects defined by the genomic relationship matrix G ( ∼ (0, σ 2  )) (VanRaden, 2008).Relationship matrix G was calculated from markers and weighted by allele frequency.
Genomic, narrow-sense, per-plot, heritabilities (h 2 g ) were estimated for each trial in each year using the error and genotypic variances derived from gBLUP model.The gBLUP model used to estimate heritability is the same as presented above; however, no random effects were fit for either row or column effect.Derived variances were then used to calculate ℎ 2  as such: where ℎ 2  is genomic, narrow sense, per plot heritability; σ 2  is genetic variance; and σ 2 ε is the residual variance.This estimate of genomic, narrow sense, per plot heritability is derived from the genetic variance of a gBLUP model divided by the total variance.The genotypic variance was defined by the additive relationship matrix among individuals, and this estimate of heritability is an approximation of narrow-sense heritability, similar to estimates derived by population design when all markers in linkage with causal effects are present in the genetic dataset (de Los Campos et al., 2015).Heritabilities were calculated using the vpredict function in ASReml-R (Butler et al., 2009).

Statistical analysis-Phenomic prediction
PPEs were calculated via multiple linear regression (MLR), k-nearest neighbors (KNNs), and random forest (RF).All models used in calculation of PPEs were implemented using the package caret in R (Kuhn, 2008(Kuhn, , 2022)).In MLR, spectral indices were used as direct regressors on the response (adjusted means for grain yield).The MLR model may be visualized as: where   is the response, β 0 is the intercept multiplied by a constant of one, subsequent β  are the regression coefficients on spectral variables 1 through ,  represents a spectral variable (a spectral index), and the residual error is identically and independently distributed (ε  ∼ N(0, σ 2 ε )).The estimate derived from the linear model on observations, which only have spectral indices is interpreted as the PPE.
In the case of the nonparametric machine learning algorithms to follow, a seed was set for reproducibility in the case of S2 and S3, and 1000 permutations of fivefold crossvalidation in the training set was used to select the optimal hyper-parameters based on lowest achieved residual mean square error.When KNN models are used for regression of a continuous response, the data in a training set are used to construct a rule on how many nearest neighbors (k) to use in estimating the response in a test set.Estimation of the response is done directly by taking the k number of individuals in the training set most like the unobserved individual and directly averaging them to estimate the unobserved individual (Kramer, 2013;Taunk et al., 2019).
In the KNN algorithm used in this study, tested values of k ranged from 1 to 50 in groups of five (e.g.,  = 1, 5, 10, … , 50).Based on cross-validation hyper-parameter tuning of the training data, we chose the value of k which minimized the residual mean square error.The average among the KNNs used for estimating the unobserved individual in a validation set is interpreted as the PPE in this case.
RF algorithms are based on binary recursive partitioning to generate randomly designed decision trees.RF models can predict categorization by using these randomly generated decision trees to "vote" on categorical responses.However, in RF models used for regression, decision trees cannot take a direct voting approach when the response is continuous.Rather, RF models which are used for regression create cut-off points in the continuous response and use predictor variables (in this case, spectral reflectance data) to define an estimate of the response (grain yield) by averaging observations in the training data set that meet the same criterion.
For continuous predictor variables, finding the best possible split involves organizing the values of the predictors and identifying unique splits between every distinct pair of consecutive values.Normally, the midpoint of the interval between continuous response points is used as the point of division.These splits are then used to bifurcate the tree.In practice, many hundreds of trees are made using this method, and unobserved individuals are passed down these regression trees to form an estimation of the response.The prediction of the continuous response of the unobserved individual is equivalent to the summation of each estimation derived from each random regression tree in the random forest (Biau & Scornet, 2016;Breiman, 2001).This summation is interpreted as the PPE.
For the RF model used in this study, defining variables (spectra) used for randomly splitting nodes in regression trees ranged from 1 to 20 in groups of five (e.g., 1, 5, 10, . . . , 20).The number of random decision trees in the RF was optimized using the caret R package.Based on cross-validation hyper-parameter tuning, we selected the n number of random splitting nodes that resulted in lowest residual mean square error in the training set.
To compare the predictive ability of GEBVs against PPEs, the GEBVs and PPEs were regressed on the adjusted means.Genomic and phenomic prediction accuracies were equivalent to the correlation of the corresponding GEBV or PPE to the adjusted means of grain yield.Both the Pearson's correlation (r) and the coefficient of determination (r 2 ) were reported as well as the residual mean square error (RMSE).The RMSE of any model mentioned was calculated as such:

𝑁
where   is the ith prediction,   is the ith observations, and  is the total number of observations.In the current work, predictive ability is defined as the Pearson's correlation coefficient between either the GEBVs or PPEs and the adjusted mean relevant to the scenario proposed.

Heritabilities and correlations of spectra and grain yield
In general, the genomic, narrow-sense, per-plot heritability of spectra was lower in 2019 and 2020 in comparison to the 2021 growing season (Figure 2).In the 2021 growing season, most of the ranges for estimated heritability of spectra did not include zero in their interval and had smaller ranges than in previous years.The mean, minimum, maximum, and standard deviation of all heritabilities observed within a year for grain yield and all spectra are provided (Table S2).
Correlation of spectral adjusted means to grain yield across trials within each year indicated that spectra were significantly correlated with grain yield in 2021 (Figure 3).However, correlations were, for the most part, sparsely correlated or near zero for the 2019 and 2020 growing seasons.In 2021, most spectral data (RGB, RE, and NIR) collected from later flights had significant negative correlations with grain yield, while NDVI alone had significant positive correlations to yield.The significant positive correlation of grain yield to NDVI implies that plots with higher photosynthetic activity tended to yield higher, a trend consistent with other research in wheat (Freeman et al., 2003;Hassan et al., 2019).

Scenario 1: Cross-validation
Randomly partitioned 80:20 training-test cross-validation across 30 permutations revealed that gBLUP models, which only use additive genetic relationship information derived from molecular markers, underperformed in comparison to MLR, KNN, and RF models using only the five spectral indices recorded over four flights (Figure 4a).The average cross-validated prediction accuracy of GEBVs derived from gBLUP was r = 0.35.By comparison, the average crossvalidated prediction accuracies of PPEs derived from the MLR, KNN, and RF models were r = 0.41, r = 0.44, and r = 0.48, respectively.The standard deviations of the gBLUP, MLR, KNN, and RF predictions were σgBLUP = 0.028, σMLR = 0.027, σKNN = 0.028, and σRF = 0.024, respectively.A visualization of the prediction accuracy per permutation is provided (Figure 4b).The RMSEs for the gBLUP, MLR, KNN, and RF predictions were RSM E gBLUP = 266.3,RSM E MLR = 358.8,RSM E KNN = 317.9,and RSM E RF = 332.7,respectively.None of the models had noticeable differences in the standard deviation of their prediction accuracies, indicating that the range in prediction accuracies across permutations were similar.Of note, there was one prediction accuracy that significantly deviated in permutation seven of the MLR model (Figure 4b).A tabular summary of parameters from cross-validation is also provided (Table S3).

Scenario 2: Leave-one-trial-out forward validation
In the 2019 season, PPEs underperformed in terms of prediction accuracy (r) in comparison to GEBVs (Figure 5).In the 2020 trials, the only cases where the prediction accuracies for PPEs were comparable to or higher than those derived from GEBVs were in the AYND1 and AYND2.The gBLUP prediction accuracy for the AYND1 in 2020 was  = 0.33 as compared to  = 0.41 for the MLR model and  = 0.35 for the KNN model.Likewise, in the AYND2 in 2020, the prediction accuracy for gBLUP was  = 0.26, while the accuracies for MLR and KNN were  = 0.35 and  = 0.29, respectively.
The trends observed in the 2019 and 2020 data were not apparent in the 2021 data.Overall, PPE prediction accuracies outperformed GEBV prediction accuracies.In the PYN of 2021, the predictive accuracy of the gBLUP model was  = 0.21; by comparison, MLR, KNN, and RF had substantially higher prediction accuracies ( = 0.56,  = 0.45, and  = 0.57, respectively).In the AYN, AYND1, AYND2, and ELITE, the prediction accuracies derived from PPE estimates either performed on-par with or better than the prediction accuracy of GEBV in most cases (Figure 5).A table of all model statistics is provided as well (Table S4).

DISCUSSION
In the present work, GEBVs were compared to predictions from spectral data (denoted as PPEs) through comparisons of prediction accuracies across several scenarios.In S1, we performed a cross-validation study across 30 permutations and observed that models that produce PPEs using only five spectral indices over four flights were more accurate in terms of correlation with the adjusted mean than GEBVs produced from gBLUP models.In S2, we saw that PPEs were more accurate than GEBVs when the heritability of the spectral indices was moderate to high and when spectra were significantly correlated with grain yield.In S3, we saw that using data from different years to produce PPEs for an unobserved year produced more accurate results than GEBVs when heritability of spectral indices was moderate to high.Rincent et al. (2018) andRoberts et al. (2022) suggest that spectral reflectance captures the endophenotypes of organisms and it is this phenomenon that allows for accurate predictions in PS.Furthermore, spectral reflectance can give estimates of plant health, disease pressure, and environmental stress (Du & Noguchi, 2017;Francesconi et al., 2021;Montesinos-López et al., 2017;Qin et al., 2022).This may suggest that PPEs include nonadditive effects like environmental variation in their estimation, which may improve PPE model predictive accuracy.Krause et al. (2019) illustrated that relationship matrices can be calculated using hyperspectral reflectance and that those relationships derived from those matrices can be just as effective at predicting grain yield as relationship matrices derived from genetic data.Rincent et al. (2018) found that wheat grain yield and heading date could be accurately predicted within and among environments using NIR spectroscopy to produce PPEs.Moreover, Krause et al. (2019) reported that the level of prediction accuracy was highly correlated with the strength of relationship between hyperspectral reflectance and grain yield.Krause et al. (2019) also found that heritability was heterogenous across timepoints and that this affected prediction accuracy, which is consistent with the current work's results.Montesinos-López et al. (2017) found that removing "noisy bands" of hyper-spectral reflectance which had heritability lower than  = 0.50 increased the predictive accuracy of their models for grain yield in wheat.Furthermore, Montesinos-López et al. (2017) found that prediction accuracies were higher when they used all bands of spectra rather than one or a few vegetative indices.Rincent et al. (2018) illustrated that accurate predictions for poplar tree (Genus Populus) bud flesh could also be achieved through phenomic prediction, meaning that this method could have practical application outside wheat.In support of that point, it was found by Lane et al. (2020) that NIR spectroscopy readings from corn can accurately predict grain yield in environments under water stress and well-watered conditions.
In the present work, it was observed that the 2021 data showed higher heritability, higher correlations with grain yield, and improved prediction accuracies compared to those in 2019 and 2020.Two major differences in drone flights and data processing pipelines may be attributed to 2021 being the highest quality data in our dataset.First, the reduction in flight height from 2019 to 2020, and then again in 2020 to 2021, resulted in higher image resolution with each subsequent year, which increased the amount of data points per plot in the F I G U R E 5 Bar graphs of leave-one-trial-out forward validation depicting the prediction accuracy (r) for each trial within each year.The y-axis depicts the prediction accuracy as an expression of the Pearson's correlation coefficient between the estimated breeding value and the best linear unbiased estimate for that trial.Each sub-graph is labeled by the trial to which the contents belong and the year that the trial was planted.Trial abbreviations are as follows: preliminary yield nursery (PYN), advanced yield nursery (AYN), advanced doubled haploid yield nursery one (AYND1), advanced doubled haploid yield nursery two (AYND2), and CSU Elite Trial (ELITE).Bars are colored to denote the model used.The number displayed above each bar represents the prediction accuracy of that model, in that year, for that trial.Subgraphs which do not contain content symbolize the unavailability of that data within that year (e.g., PYN 2019 and 2020).gBLUP, genomic best linear unbiased prediction.calculation of average reflectance values.This most likely increased the accuracy of spectral averages derived from plots due to an increase in number of data points (pixels) per plot.
The second pipeline improvement which most-likely contributed to 2021 being the most informative data was the implementation of the ExGI to categorize and remove soil from the analysis in 2021.This process implemented in 2021 eliminated the risk of soil within the plot area skewing the averages away from representative plant reflectance values.While this process of classifying and removing soil could have been retroactively performed in previous years for a direct comparison, the flight heights in previous years led to a reduction in resolution, which hinders the programs' ability to discriminate between soil and plant matter accurately.Therefore, we recommend reduction of flight heights to approximately 60 m or less to increase resolution and the use of a similar machine learning method to remove nonplant material from images prior to extraction of spectra from images.
Another contributing factor for varying results across years may be related to flight dates (Table 2).In the current analysis, we took existing drone data taken within the CSU breeding program and aligned flight dates that were closest to each other in time.These data, while apparently informative, are not balanced.Moreover, growth stage in wheat is known to affect spectral reflectance (Sun et al., 2010;Xie et al., 2020).Therefore, to reduce variability in the dataset and potentially improve reproducibility of results, it is suggested to assess fields at the same growth stage across years.
In S1, where we compared 80:20 split cross-validation of GEBVs versus PPEs, we observed that all methods of estimation of PPEs were superior in their prediction accuracies in comparison to GEBVs.Thus, it appears that PPEs have similar predictive ability to GEBVs across years.However, it is important to note that the GEBVs estimated by our gBLUP models are based on additive effects, whereas the PPEs most-likely include environmental and nonadditive effects in their calculation.Therefore, it would be important to conduct follow-up studies on response to selection for PPEs versus GEBVs in early generation material.
While S1 demonstrates the predictability of the dataset upon itself, S2, which involved predicting yield for plots where harvest is not possible, may have some direct utility in breeding programs.These trials in S2, which would be otherwise abandoned due to harvest conditions or labor constraints, may be flown with drones and predictions can be made without genetic information.This information could be used for making selections in limited datasets as well as supplementing training population data in genomic prediction by increasing the number of total observations for the phenotype of interest, yet these applications must be vetted in further studies.
Although S1 and S2 yielded fascinating results, the scenario with the highest utility is S3, where we attempted to predict observations in an unobserved year with historical training data.While it appears that performing this procedure in years where spectral data collection is suboptimal (e.g., spectral data collected in 2019) leads to underperfor-mance in comparison to GEBVs (Figure 6), we did observe that in following years spectral data were more heritable and better correlated to yield, that PPEs were comparable to or better than GEBVs.This may indicate that PPEs may be made on unobserved genotypes in unobserved years so long as the criteria of correlation and heritability are met.
A major component of the cost and labor of phenotyping trials for yield data involves the planting and harvesting of plots.Calculation of PPEs for S3 may allow breeders to plant larger trials of early-generation material and selectively harvest a subset of lines based on collection of spectral data.Although it is possible to use spectral data from plots prior to harvest to produce PPEs, this method only saves effort by allowing for selective harvest, and the same amount of financial and work resources would still need to be invested in planting.In contrast, GS offers a way to reduce planting by using genotyping out of field and prediction, which can be more cost effective over time.
Selective harvest of plots can be challenging with mechanized harvest methods, and this may be less than optimal for programs which do not selectively harvest preliminary yield trials.In lieu of applying this method in the yield plot stage, perhaps the method proposed in S3 would be best applied in the earlier generations where the number of individuals to genotype would be far too large for GS and plots are harvested by hand.
Hypothetically, single plants or headrows of selection candidates could be planted with a subset of lines (a training population), which have replicated yield observations over years.The spectral information and historical yield data of the training population could then be used to produce PPEs for single plants or headrows which only have spectral data.This method is similar to the one proposed by Roberts et al. (2022).Regardless, this method has yet to be tested for within-family prediction/selection, and the response to PS in comparison to visual or GS has not been reported in the literature.
The results of the current work, in correspondence with previously written literature, support the hypothesis that PPEs can be produced from spectra and that PPE accuracies are comparable to or better than GEBV accuracies, if the spectra are heritable and correlated with the trait of interest.More interestingly, we have demonstrated that a limited number of spectral bands, rather than the wide range of spectra derived from hyperspectral analysis, may be used to create comparable predictions.
The presented results indicate that the 2019 spectral data were suboptimal due to sampling methods; however, models in S3, which included the 2019 and 2020 spectral data to predict grain yield in 2021, managed to produce prediction accuracies higher than that of gBLUP (Figure 6), which seems to suggest that large datasets over time can compensate for suboptimal years of data collection.Thus, with further investigation and restructuring of breeding methods, phe-nomic prediction may be an effective alternative to genomic prediction in earlier generation material.

F
I G U R E 1 A diagram of the Colorado State University (CSU) breeding program.Boxes represent germplasm/trials associated with the first few generations of inbreeding (F 1 to F 3:4 ), doubled haploids (DHs), preliminary yield nursery (PYN), advanced yield nursery (AYN), double haploid advanced yield nursery one (AYND1) and two (AYND2), CSU Elite Trial (ELITE), and individuals that stay in the CSU Elite Trial for 3-4 years (ELITE+).Colors represent the plot type of each germplasm/trial.All generations after the F 3:4 are F 3 -derived lines (e.g., PYN = F 3:5 and AYN = F 3:6 ).

F I G U R E 2
Bar-chart of average heritability across trials with each growing cycle.On the y-axis is the narrow-sense, per-plot, and genomic heritability (h 2 ).The x-axis depicts grain yield, red-green-blue (RGB) spectral reflectance, red edge (RE), near-infrared (NIR), and normalized difference vegetation index (NDVI).The flight from which each spectral index was collected and is noted next to the spectrum name (e.g., Red [Flight 1]).Each bar represents the average heritability of the corresponding trait on the x-axis.The black error lines surrounding each bar represent the observed range for that heritability across nurseries.

F
I G U R E 3 Bar-chart of Pearson's correlation coefficient (r) of spectra versus grain yield.The x-axis depicts different combinations of spectra and flights.Each bar represents the Pearson's correlation coefficient of the corresponding trait on the x-axis to grain yield.The color of each bar represents the significance of that correlation.The figure legend denoting significance can be found to the right of the graphs.NDVI, normalized difference vegetation index; NIR, near-infrared, NS, not significant; RE, red edge.F I G U R E 4 Boxplots of cross-validation prediction accuracies (a) and a visualization of prediction accuracy (r) over 30 permutations (b).On the y-axis of panel (a) is the name of the model.On the x-axis of panel (a) is the prediction accuracy.The sub-header of all graphs in panel (b) displays the name of the model.On the y-axis of panel (b) is the prediction accuracy.On the x-axis of panel (b) is the index for each permutation.The figure legend on the far right displays the corresponding colors for each model.gBLUP, genomic best linear unbiased prediction.

F
I G U R E 6 Bar graphs of leave-one-year-out forward validation prediction accuracies.The y-axis depicts the prediction accuracy (r).Each subplot has a subtitle above it indicating the year to which the accuracy belongs.Each bar within each graph has been filled with a color that corresponds to what model that bar represents.The figure legend to the right denotes what model is represented by each color.Above each bar, in plain text, is the prediction accuracy.gBLUP, genomic best linear unbiased prediction.
Flight date within year.
T A B L E 2