Sparse multi-trait genomic prediction under balanced incomplete block design

Sparse testing is essential to increase the efficiency of the genomic selection methodology, as the same efficiency (in this case prediction power) can be obtained while using less genotypes evaluated in the fields. For this reason, it is important to evaluate the existing methods for performing the allocation of lines to environments. With this goal, four methods (M1–M4) to allocate lines to environments were evaluated under the context of a multi-trait genomic prediction problem: M1 denotes the allocation of a fraction

For the reason above, many approaches have been evaluated to increase the efficiency of the GS methodology, some of which are enumerated hereafter: Approach 1 is centered on evaluating many statistical machine learning methods, which have been used in GS to identify robust methods that guarantee high accuracy. For this reason, mixed models, penalized methods like Ridge regression and Lasso Regression, Bayesian methods (BayesA, BayesB, BayesC, Bayesian Lasso, GBLUP, etc.), random forest, Gradient boosting machine, artificial deep neural networks, and others have been tested in GS (Montesinos-López, Montesinos-López, & Crossa, 2022). Approach 2 consists of guaranteeing highquality phenotypic and genotypic data to obtain high-quality outputs. Approach 3 focuses on optimizing the training and testing set to guarantee predictions in the testing set according to the complexity of the traits; many publications corroborate that the composition of the training sets is essential in improving prediction accuracy Rio et al., 2022). Under this third approach, many strategies for selecting a training set given a testing set have been proposed, while many other strategies for creating a testing set given a training set have also been studied and evaluated (Lopez-Cruz et al., 2020;Lopez-Cruz & de los Campos, 2021) Under the third approach, when the genotypes are evaluated in multi-environmental trials (METs), candidate genotypes are evaluated under different environmental conditions, and breeders can select stable genotypes across environments and in specific environments since the genotype × environment (GE) interaction is modeled. Nevertheless, evaluating all genotypes in all the environments is financially costly as it requires extensive field-testing evaluations (Smith, Butler, et al., 2015;Smith, Ganesalingam, et al., 2015).
Consequently, to reduce plant breeding costs in the implementation of the GS methodology in METs, "sparse testing" strategies have been proposed. Sparse testing means that some genotypes have been evaluated in some environments but not in others. As such, the use of sparse testing has the potential to increase the number of environments under evaluation in early-stage yield testing without a significant increase in the cost of phenotyping. Since sparse testing does not evaluate all

Core Ideas
• Genomic selection increases genetic gain in plant breeding programs. • Sparse testing is essential to increase genomic prediction accuracy. • Incomplete block design for sparse testing improves genomic-enables prediction accuracy.
lines in each environment and only a fraction of lines are allocated for each environment, it is very attractive for improving the following two goals: (a) increasing the number of lines tested across multiple, diverse environments and (b) increasing the number of testing environments while maintaining the same selection intensity (Atanda et al., 2021;Burgueño et al., 2012;Jarquin et al., 2020). For a sparse testing, different approaches could be followed for distributing lines in environments. Let us suppose 10 lines and four locations with half of lines (5) for training and half (5) for testing training. The most simple sparse testing method, M1, consists of allocating a subset of genotypes to all locations (e.g., the same five lines tested in four locations); M2 allocates a subset of lines in such a way that some lines are shared between some locations (e.g., four locations with the same three lines, three locations with two different lines and one line tested in only one location); M3 randomly allocates lines to locations such that all five lines are randomly distributed in each location (e.g., some lines are in three locations, other lines are in two locations and others in one location); and M4 allocates a subset of lines to locations followed a balanced incomplete block design (BIBD) scheme. Thus, the difference between M2 and M4 is the arrangement of the subset of lines (BIBD, M4 vs. random M2).
The BIBD method for sparse allocations of lines to locations is very attractive since it uses all the virtues of experimental designs, which are powerful tools that have been long used in plant breeding programs to improve the accuracy of parameter estimates. The BIBD (M4) proposed by Montesinos-López, Montesinos-Lopez, Acosta, et al. (2022) showed through five real data sets that the prediction accuracy of the sparse testing method is better than the random allocation of lines to locations (M3); however, this method was proposed and evaluated under a univariate (single-trait) framework and was only compared to M3. It should be noted that these four sparse testing methods (M1, M2, M3, and M4) have not yet been compared in the same study.
Therefore, based on the previous considerations, the main objective of this study is to evaluate the four methods of sparse testing under a multivariate (multi-trait) context, that is, when we are interested in predicting more than one response variable simultaneously. For comparing the genomic prediction ability of the four methods of sparse testing (including the BIBD, M4), we study seven real multiple-traits and multi-environments data sets.

Data sets
In this section, we present the seven data sets used for benchmarking the four sparse allocation methods that are described in the next section. The quality control for the genomic data consists of removing markers that had more than 15% missing values and those with a minor allele frequency (MAF) of < 0.05.

Data set 1. Disease
In this data set with 438 wheat lines, three traits (diseases) were measured: Pyrenophora tritici-repentis, which causes a disease known as yellow spot, also known as yellow leaf spot, tan spot, yellow leaf blotch, or helminthosporiosis; Parastagonospora nodorum, a major wheat fungal pathogen that affects the leaves and other parts of the plants; and Bipolaris sorokiniana, the cause of seedling diseases, common root rot and spot blotch in several crops such as barley and wheat. Over a long period during the same year, these 438 lines were evaluated in the greenhouse for several replicates, which were subsequently considered environments (Env1, Env2, Env3, Env4, Env5, and Env6) as the experiments were established in different greenhouses and planting dates. For the three traits measured, the total number of observations was 438 × 6 = 2628. It should be noted that this data set was also used by Montesinos-Lopez et al. (2020). The response was measured on a continuous scale from 1 (resistance) to 5 (susceptible). DNA samples were genotyped using 67,436 single nucleotide polymorphisms (SNPs). For a given marker, the genotype for each line was coded as the number of copies of a designated marker-specific allele carried by the line (absence = zero and presence = one). The markers that had more than 15% missing values were removed, as well as markers with MAF < 0.05. A total of 11,617 SNPs were still available for analysis after quality control and imputation.
The experimental design used was an alpha-lattice design, and the lines were sown in 39 trials, each covering 28 lines and two checks in six blocks with three replications. Several traits were available for some environments and lines in each data set. In this study, we included four traits measured for each line in each environment: days to heading (DTHD, number of days from germination to 50% spike emergence), days to maturity (DTMT, number of days from germination to 50% physiological maturity or the loss of the green color in 50% of the spikes), PH, and GY. For full details on the experimental design and how the best linear unbiased estimated were computed, see Juliana et al. (2018). Since wheat lines included in data sets 2-5 are different (as they included different years), we decided to treat each cycle (year) separately.
The lines examined in data sets were evaluated in four environments. For data set 2 (EYT_1), the environments were bed planting with five irrigations (Bed5IR), early heat (EHT), flat planting and five irrigations (Flat5IR), and late heat (LHT). For data set 3 (EYT_2), the environments were bed planting with two irrigations (Bed2IR), Bed5IR, EHT, and Flat5IR. For data set 4 (EYT_3), the environments evaluated were Bed2IR, Bed5IR, Flat5IR, and flat planting with drip irrigation (FlatDrip). For YET_4 data set 5, the four environments were Bed5IR, EHT, Flat5IR, and FlatDrip. Genome-wide markers for the 3495 (776 + 775 + 964 + 980) lines in the four data sets were obtained through genotyping-by-sequencing (Elshire et al., 2011) at Kansas State University using an Illumina HiSeq2500. After filtering, 2038 markers were obtained from an initial set of 34,900 markers. The imputation of missing markers data was carried out using LinkImpute (Money et al., 2015) and implemented in TASSEL (Trait Analysis by Association Evolution and Linkage) version 5 (Bradbury et al., 2007). Lines that had more than 50% missing data were removed, and 3495 lines were used in this study (776 lines in the second data set, 775 lines in the third data set, 964 lines in the fourth data set, and 980 lines in the fifth data set).

2.1.3
Data set 6. Groundnut The phenotypic data set reported by Pandey et al. (2020) includes information on the phenotypic performance of 318 groundnut lines for various traits in four locations. We assessed genomic-enabled predictions for the following four traits: pods per plant (NPP), pod yield per plant (PYPP) measured in grams, seed yield per plant (SYPP) measured in grams, and yield per hectare (YPH) measured in kilograms. The locations are denoted as Aliyarnagar_R15, Jalgoan_R15, ICRISAT_R2015, and ICRISAT_PR15-16. The data set is balanced, giving a total of 1272 assessments with each line included once in each location. Marker data were available for all lines, and 8268 single-nucleotide polymorphism (SNP) markers remained after quality control (with each marker coded with 0, 1, or 2).
2.1.4 Data set 7. Japonica Monteverde et al. (2019) also reported a rice data set, belonging to the tropical Japonica population with four traits (GY; percentage of head rice recovery; percentage of chalky grain; PH, plant height) as for the Indica population (data set 7) but containing 4 years (2010, 2011, 2012, and 2013). A total of 127 lines were evaluated each year, and 16,383 SNP markers remained after coding the marker as 0, 1, and 2. In this vein, a total of 508 assessments were evaluated in the four environments. In this data set, the lines evaluated were 127.

Sparse allocation methods of lines to environments
In this section, we will describe the four sparse testing methods used to allocate lines to locations.

Method 1 (M1): Allocation of a fraction (subset) of lines in all locations
This is the simplest allocation method where a fraction (subset) of lines is taken, then allocated in all locations as a training set, and the remaining lines are used as the testing set. In Figure 1a, we can see how the partition is formed with this method. The green color represents the lines used as the training set, and the white color represents the lines used as the testing set. Training lines are grouped at the beginning of Figure 1a, but lines are not in numerical order, thus showing that they were randomly selected.

2.2.2
Method 2 (M2): Allocation of a fraction (subset) of lines with some shared lines in locations Similar to M1, M2 takes a fraction (subset) of lines to be used as training set and the remaining as testing set in one location. For the other locations, the testing lines are divided into the number of locations-one part that is ideally with the same size and one part is interchanged from testing to train-ing for each location. In this way, each location shares most of the training lines but contains some lines that are only in the training, as show in Figure 1b.

Method 3 (M3): Random allocation of a fraction of lines to locations
Starting from a balanced data set with lines and locations, the conformation of the random allocation of lines to locations was done in such a way that approximately each line will be repeated in out of locations, and all locations will be of the same size ( ). The algorithm (Montesinos-López, Montesinos-Lopez, Acosta, et al., 2022) of this random allocation is as follows: 1. First, we compute = × (least integer greater than or equal to × ). Then, lines out of lines are randomly allocated to the first location, and is the number of replications of each line. 2. Next, for the second location, out of the lines were once again randomly allocated. 3. This process is repeated until the th location is completed, with the caveat that the lines allocated to a particular location are only present in less than or equal to locations, ideally in exactly locations. The lines that do not satisfy this restriction are not candidates for being allocated to a particular location.
In Figure 1c, one example of this method is shown when used with 10 lines and four locations. As can be seen, some locations appear up to three times, while others appear only twice or once since it is a random process.

Method 4 (M4): Allocation of lines to locations using the BIBD
A balanced BIBD design is where all pairs of lines occur together within a location and an equal number of times (λ). In general, we will specify λ as the number of times line occurs within a location. To generate this sparse allocation of lines to locations (Montesinos-López, Montesinos-Lopez, Acosta, et al., 2022), we can use the function find.BIB() using the R package crossdes (R Core Team, 2022). Let us suppose there are = 10 lines and = 4 locations, this means that we need 40 plots to allocate the 10 lines to the four locations under a complete block design. However, we will use a BIBD and a training set of size _ = 20, which accounts for (50%) of the total plots required under a randomize complete block design. Therefore, the number of lines by locations can be obtained by solving ( = _ ) for , which results in = _ ∕ . This means that = 20∕4 = 5 lines F I G U R E 1 Allocation methods with 10 lines and four locations for a partition with 50% of lines as training and 50% of lines as testing. M1 denotes the allocation of some lines in all locations, M2 denotes the allocation of a fraction (subset) of lines with some shared lines in locations, M3 denotes the random allocation of some lines to locations, and M4 denotes the allocation of a fraction of lines to locations using the balanced incomplete block design method per location. Then, the corresponding elements for the training set can be obtained with the function find.BIB(10, 4, 5) using the package crossdes. The numbers used in the function find.BIB() denote the lines, the locations, and the lines per locations, respectively. Figure 1d shows how a particular allocation may look using this method. One important aspect is that all lines are used at the same time (2) and all locations contain the same number of lines (5). This is an important difference with respect to method M3 where some lines are tested in three locations, other lines are tested in two locations and one line is evaluated in only one location.

Statistical model
This model is given by where is the matrix of phenotypic response variables of order × and ordered first by environments and then by lines; denotes the number of traits; 1 is a matrix of ones of order × , where denotes the number of traits; is a vector of intercepts for each trait of length , denotes the transpose of a vector or matrix, that is, is the design matrix of environments of order × ; denotes the number of environments (locations); is the matrix of beta coefficients for environments with a dimension of × ; is the design matrix of lines of order × ; denotes the number of lines; is the matrix of random effects of lines of order × distributed as ∼ × ( , , ), that is, with a matrix-variate normal distribution with parameters = , = and = ; is the genomic relationship matrix as proposed by Van-Raden (2008) built with marker data of order × ; is the variance-covariance matrix of traits of order × ; is the design matrix of the GE interaction of order × ; is the matrix of the GE interaction random effects distributed as ∼ × (0, ⊗ , ), where is a diagonal variance-covariance matrix of environments of order × and ⊗ is the Kronecker product of the lth type of kernel matrix of lines and the environmental relationship matrix; and ε is the residual matrix of dimension × distributed as ε ∼ × (0, , ), where R is the residual variancecovariance matrix of order × . The implementation of this model was carried out in the BGLR library of Pérez and Campos (2014).
Under the four types of allocation methods, we use the notation as the number of lines, as the number of lines per location, as the number of locations, and as the number of replications of each line in the entire design. It should be noted that in BIBDs will be less than , since not all the lines in each location can be assigned. An equal number of entry replication is the best way to ensure minimum variance when making all possible pairwise comparisons. Therefore, since = for all lines, the total number of observations in the experiment is , where = × = × .

Cross-validation strategy
To evaluate and compare the predictive performance of the allocation methods, we used cross-validation with 10 partitions and 50% of the data for training and 50% for testing. The average normalized root mean squared error (NRMSE) was computed (Montesinos-López, Montesinos-López,  with the 10 random partitions, and this metric was used to assess the predictive performance in each data set for each allocation method. NRMSE is defined as: where is the ith observed value,̂is the ith predicted value, and is the total number of observations. Additionally, the average Pearson's correlation (APC) was reported as a metric of prediction accuracy using observed and predicted values. Across locations, the NRMSE and Pearson's correlations in each partition was computed between averages of true and predicted phenotypic values over locations; subsequently, the average of the NRMSEs and APC of the 10 partitions was reported as prediction performance in each data set. Since the allocation methods were evaluated with a multi-trait model, the NRMSE was computed for each trait separately and then averaged within each fold.

RESULTS
The results are given in seven sections, one for each data set, and in each section we compare the prediction performance obtained with the four methods of sparse allocations of lines to F I G U R E 2 Prediction performance in terms of normalized root mean squared error (NRMSE) (A) and average Pearson's correlation (APC) (B) for data set 1 (disease). M1 denotes the allocation of lines in all locations, M2 denotes the allocation with some shared lines in locations, M3 denotes the random allocation of lines to locations, and M4 denotes the allocation of lines to locations using the balanced incomplete block design (BIBD) method. Ten partitions were implemented with 50% of the data for training and 50% for testing locations. This comparison was made in terms of the NRMSE and the APC.

Data set 1 (disease)
Under this data set, the best results were obtained with method M4 (BIBD method) in terms of NRMSE, according to results of Table A1 and Figure 2A. This best method outperformed M1, M2, and M3 by 14.05%, 2.92% and 0.45%, respectively. In terms of APC, this gain is also observed with the same pattern: M4 gives the greatest APC (0.4545), while M1 gives the lowest (0.1509) ( Figure 2B).

Data set 2. EYT_1
Like the results in data set 1, the best prediction performance was obtained with M4, followed by M3, M2, and finally M1. M4 was superior to M1, M2, and M3 by 16.46%, 5.1%, and 2.1%, respectively ( Figure 3A). For APC, the same pattern is observed since M4 produces the highest APC followed by methods M3, M2, and M1, respectively ( Figure 3B). Details are given in Table A1.

F I G U R E 3 Prediction performance in terms of normalized root mean squared error (NRMSE) (A) and average Pearson's correlation (APC) (B) for data set 2 (EYT_4). M1 denotes the allocation of lines in
all locations, M2 denotes the allocation with some shared lines in locations, M3 denotes the random allocation of lines to locations, and M4 denotes the allocation of lines to locations using the balanced incomplete block design (BIBD) method. Ten partitions were implemented with 50% of the data for training and 50% for testing

Data set 3. EYT_2
This data set contains four traits (DTHD, DTMT, GY, and Height), and the results show that M4 (BIBD) gives the lowest error in terms of NRMSE across traits. Comparing the M4 method with all other methods, M4 is more efficient than M1 by 19.85%, M2 by 5.08%, and M3 by 2.6% ( Figure 4A and Table A1). As expected, the methods with lowest errors are the ones with higher APC, with M4 having an APC = 0.8444, M3 with an APC = 0.8322, M2 with an APC = 0.8156, and finally M1 with an APC = 0.6994 ( Figure 4B).

Data set 4. EYT_3
In Figure 5A, we present the results for data set 4 in terms of NRMSE where we can observe that M4 and M3 were superior to M1 and M2 since they present lower prediction errors. Specifically, M4 is superior to M3 by 1.71%, to M2 by 5.55%, and to M1 by 14.52%, according to Figure 5A and Table A1.
In terms of APC, the results in Figure 2 show that APC is higher for those methods with lower NRMSE. For example, M4 gives the lowest NRMSE (0.0459) and the highest APC (0.9363); on the other hand, M1 gives the highest NRMSE (0.0537) and the lowest APC (0.8587) ( Figure 5). See details in Table A1.

F I G U R E 4 Prediction performance in terms of normalized root mean squared error (NRMSE) (A) and average Pearson's correlation (APC) (B) for data set 3 (EYT_1). M1 denotes the allocation of lines in
all locations, M2 denotes the allocation with some shared lines in locations, M3 denotes the random allocation of lines to locations, and M4 denotes the allocation of lines to locations using the balanced incomplete block design (BIBD) method. Ten partitions were implemented with 50% of the data for training and 50% for testing

Data set 5. EYT_4
This is the largest data set of all those evaluated. In this data set, the gain of methods M2, M3, and M4 with respect to M1 is considerably large ( Figure 6A). M2, M3, and M4 give similar results to each other, but the gain in terms of NRMSE to M1 is of 15.94% (M2), 16.69% (M3), and 17.89% (M4), respectively, where we can observe that the largest gain was observed with the method M4 yet again. In terms of APC, the gain of M2, M3, and M4 is also displayed in Figure 6B, where we can observe that these three methods have similar APC, but M1 is considerably lower. Details are given in Table A1.

Data set 6. Groundnut
The results for groundnut data set follow the same patterns described in the previous data sets but with smaller differences. As in other data sets, M4 gives the best NRMSE (0.366) that is superior to M1 by 3.37%, M2 by 2.19%, and M3 by 0.57% ( Figure 7A). As can be seen in Figure 7A, F I G U R E 5 Prediction performance in terms of normalized root mean squared error (NRMSE) (A) and average Pearson's correlation (APC) (B) for data set 4 (EYT_2). M1 denotes the allocation of lines in all locations, M2 denotes the allocation with some shared lines in locations, M3 denotes the random allocation of lines to locations, and M4 denotes the allocation of lines to locations using the balanced incomplete block design (BIBD) method. Ten partitions were implemented with 50% of the data for training and 50% for testing the NRMSEs are similar but M4 is still the best. In terms of APC, the gain of M4 with regard to M1 was of 4.92%; with respect to M2, the gain was of 4.6%, and compared with M3 the gain was of 2.12% ( Figure 7B). See details in Table A1.

Data set 7. Japonica
In this data set, we can observe that M4 is superior, in terms of NRMSE, to M1, M2, and M3 by 7.84%, 1.67%, and 1.53%, respectively ( Figure 8A). The difference can be seen in Figure 8, and we can observe that M2 and M3 give similar results. In terms of APC ( Figure 8B), while M2 and M3 provide similar results, the highest APC is also observed by M4 and the worst by M1, as expected. Details are in Table A1.

DISCUSSION
A successful implementation of GS strongly depends on its ability to guarantee high prediction accuracy of the genetic merit of candidate genotypes. However, the quality of prediction accuracy strongly depends on the number of replications (or environments) evaluated in plant breeding programs. As

F I G U R E 6
Prediction performance in terms of normalized root mean squared error (NRMSE) (A) and average Pearson's correlation (APC) (B) for data set 5 (EYT_7). M1 denotes the allocation of lines in all locations, M2 denotes the allocation with some shared lines in locations, M3 denotes the random allocation of lines to locations, and M4 denotes the allocation of lines to locations using the BIBD method. Ten partitions were implemented with 50% of the data for training and 50% for testing such, sparse testing methods offer the opportunity to increase the selection intensity in early yield breeding programs even with the same number of testing environments. For this reason, it is important to evaluate the existing sparse testing strategies to use the more optimal one for the selection of candidate genotypes with high accuracy. Therefore, we performed a benchmarking study to compare the following four sparse testing methods: M1 where we allocated some lines in all locations, M2 where we allocated a fraction of lines with some shared lines in locations, M3 where the allocation of some lines to location was random, and M4 were the allocation of a subset of lines to locations was performed using a balanced incomplete block experimental design. See Figure 1 for details of these four methods.
We found that the allocation of lines to locations (or environments) using the BIBD method (M4) is superior to the allocation of lines in all locations (M1), allocation with some shared lines in locations (M2), and random allocation of lines to locations (M3) in all the evaluated data sets. With the biggest data sets, the superiority of methods-M4 and M3over the conventional method (M1) is clearer. For example, in terms of NRMSE, in data set 5 (EYT_4), which is the biggest data set, method M4 outperformed M1 by 15.94%, while for data set 7 (Japonica), the smallest data set, M4 outperformed The Plant Genome F I G U R E 7 Prediction performance in terms of normalized root mean squared error (NRMSE) (A) and average Pearson's correlation (APC) (B) for data set 6 (Groundnut). M1 denotes the allocation of lines in all locations, M2 denotes the allocation with some shared lines in locations, M3 denotes the random allocation of lines to locations, and M4 denotes the allocation of lines to locations using the balanced incomplete block design (BIBD) method. Ten partitions were implemented with 50% of the data for training and 50% for testing F I G U R E 8 Prediction performance in terms of normalized root mean squared error (NRMSE) (A) and average Pearson's correlation (APC) (B) for data set 7 (Japonica). M1 denotes the allocation of lines in all locations, M2 denotes the allocation with some shared lines in locations, M3 denotes the random allocation of lines to locations, and M4 denotes the allocation of lines to locations using the balanced incomplete block design (BIBD) method. Ten partitions were implemented with 50% of the data for training and 50% for testing M1 only by 7.84%. In terms of APC, the following pattern is also observed: M4 gives the greatest APC in all data sets while M1 gives the lowest APC. In general, we observed that the methods M3 and M4 are not affected largely by the size of the data set.
The original method (M4) was initially proposed under a univariate (single-trait) setting (Montesinos-López, Montesinos-Lopez, Acosta, et al., 2022), that is, for univariate genomic prediction and in this publication, it was only compared with method M3. These authors reported that the M4 was superior in terms of prediction accuracy to M3. However, with the goal to study if this proposed M4 can be extrapolated to the multivariate (multi-trait) context, that is, when the goal is to predict more than one trait, which is very important in genomic prediction since it can help to increase the prediction accuracy taking advantage of the degree of correlation between traits, we compared not only M3 and M4, but also M1 and M2. This study shows that the findings of Montesinos-López, Montesinos-Lopez, Acosta, et al. (2022) for single-trait are also valid under a multi-trait genomic prediction framework, but now, we did not find very large differences between M3 and M4. However, since these methods (M3 and M4) were also compared to methods M1 and M2, we found that the worst method was M1, followed by M2, while the best method was M4, followed by M3.

Optimizing sparse testing by random allocation of lines in environments
For genomic-assisted breeding, it is of great importance to optimize markers and field evaluation to increase genetic gains. Breeders define the cost of genotyping in terms of field plot units and then set different strategies for defining the number of lines to genotype and evaluate in the field using a field sparse allocation scheme. Although several strategies could be studied, the main goal of sparse testing scheme is increasing the intensity of selections by increasing the number of testing lines. This strategy requires the investigation of how many candidate lines could be repeatedly tested in environments and how many could be planted in only one or a few environments.
Adequate testing strategies are fundamental to accurately recycle the parents that sparse testing will be seen to obtain a higher selection intensity and larger environmental sampling, as more lines can be tested in more environments. Jarquin et al. (2020) in maize and Crespo-Herrera et al. (2021) in wheat multi-environments trials assessed the effects for the genomic predictive ability of allocation designs in which a certain number of different maize or wheat lines are distributed in different environments and another set of lines is repeatedly observed in all environments.
Results of the maize and wheat multi-environment trials for genomic prediction accuracy indicate that important savings could be achieved by testing a small number of lines in all environments (∼30-50) and allocating the rest of the lines in an overlapping design in different environments. Results showed that those statistical models that included the GE interaction component capture a larger genetic variability than the main genomic effects model and increase the final prediction accuracy by leveraging the interaction via other genetically related cultivars tested in those environments.
Sometimes the size of the testing populations (for all allocation designs) decrease the genomic prediction accuracy, which is recovered when more lines are tested in all environments. Interestingly, the model that includes GE interaction maintains the prediction accuracy throughout both extreme situations of not having repeated lines in all environments or having all repeated lines in all environments.

4.2
Optimizing sparse testing balanced incomplete block design-Single trait Atanda et al. (2021) concluded that for implementing sparse testing GP for multi-environment yield trials, it is important to consider the proportion of lines that overlap across environments in the early yield testing stages of population improvement. The authors suggested including a substantial number of common lines across environments to reach an accurate estimation of genetic correlation between environments and improve the genomic prediction accuracy of the GE interaction. Jarquin et al. (2020) reported improved predictive ability by increasing the number of lines that overlapped across three environments. The previous studies consider a random allocation of tested lines in environments by guaranteeing a proportion of non-overlapping/overlapping lines randomly allocated in environments.
Recently, Montesinos-López et al. (2020) proposed using BIBDs, principally, for the allocation of lines to environments. The authors compared the random allocation of lines to environments with the proposed single trait BIBD allocation and found that allocation under the principle of BIBD outperformed random allocation by between 1.4% and 26.5% across environments, traits, and data sets in terms of mean square error. This was the first evidence that using BIBD for the allocation of lines to environments can help improve genomic predictive performance compared to random allocation.

4.3
Optimizing sparse testing by a balanced incomplete block design-Multi-trait As mentioned above, sparse methods are helpful for increasing the selection intensity (e.g., testing more genotypes) (with the same level of prediction accuracy) at the population improvement stage one or even at preliminary yield trials, where testcross hybrids are evaluated in -three to five environments, and several experimental hybrids are obtained by crossing double haploid lines (or lines developed using the pedigree scheme) to a tester from a complementary heterotic group (Atanda et al., 2021).
In this case, the sparse testing methods enhance the number of hybrids to be evaluated in the same number of locations without increasing the breeding costs. For this reason, given that in most breeding programs there are many populations with phenotypic and genomic records in different environments, the sparse testing method presented in this study (multi-trait BIBD) could be an attractive alternative for creating a robust historical data set that can borrow information across environments, and thus resulting in improved prediction accuracy when compared to the test-half-predicthalf strategy (Atanda et al., , 2021Burgueño et al., 2012;Santantonio et al., 2020).
In product profile stage (stage three, advanced yield trials), breeders can maintain the selection intensity and increase the prediction accuracy of the GS methodology without a significant increase in the breeding cost, as such breeders can increase the number of replications of the hybrids at the same locations or increasing the number of testing environments.
One possible explanation why the gain in prediction accuracy using the multi trait BIBD sparse testing could be equal or slightly lower compared with the single trait BIBD sparse testing is because the building process of BIBD sparse multi trait is under a univariate framework and thus not optimal for the BIBD multi trait. Further research is required for developing an optimization method for the multi trait BIBD sparse testing.

CONCLUSIONS
Due to the need to improve the practical implementation of the GS methodology, sparse testing methods have been proposed to reduce the cost of phenotyping in GS without a significant reduction in prediction accuracy. Therefore, we performed a benchmarking study to compare four existing methods. The comparisons were made using seven real data sets where we found similar results. Furthermore, we found that the best prediction performance was obtained under the balanced incomplete block design (M4), followed by methods M3 and M2. However, our comparison was done under a multi-trait framework for which the sparse testing methods were not originally proposed. Nevertheless, we found that even in the multi-trait context, these methods work, even though they maybe not be as efficient as a single trait framework. Our findings provide empirical evidence that the use of sparse testing in the context of multi-trait prediction is efficient for improving the efficiency of the GS methodology. However, not all sparse testing methods are equally efficient since the best and second-best methods were M4 (where we allocated lines to locations under a balanced incomplete block design) and M3 (where we allocated some lines to locations randomly), and the worst was method M1 (where some lines are allocated to all locations).

A C K N O W L E D G M E N T S
We thank all scientists, field workers, and lab assistants from the National Programs, CIMMYT, and ICRISAT who collected the data used in this study. We are thankful for the financial support provided by the

C O N F L I C T O F I N T E R E S T
The authors declare no conflict of interest.

D A T A AVA I L A B I L I T Y S T A T E M E N T
The phenotypic and genomic data used for the seven data sets employed in this study can be downloaded from https://hdl. handle.net/11529/10548787.

APPENDIX T A B L E A 1
Prediction performance in terms of normalized mean square error (NRMSE) and average Pearsonťs correlation (APC) for the four sparse methods for allocating lines to locations. M1 denotes the allocation of lines in all locations, M2 denotes the allocation with some shared lines in locations, M3 denotes the random allocation of lines to locations, and M4 denotes the allocation of lines to locations using the balanced incomplete block design (BIBD) method. NRMSE_SE denotes the standard error of the NRMSE estimates and APC_SE denotes the standard error of the APC estimates. Ten partitions were implemented with 50% of the data for training and 50% for testing