Phenotyping and predicting wheat spike characteristics using image analysis and machine learning

Improvements in trait phenotyping are needed to increase the quantity and quality of data available for genetic improvement of crops. In this study, we used moderate throughput image analysis and machine learning as a pipeline for phenotyping a key wheat spike characteristic: spikelet number per spike. A population of 594 soft red winter wheat inbred lines was evaluated in the field for 2 years and images of wheat spikes were taken and used to train deep‐learning algorithms to predict spikelet number. A total of 12,717 images were used to train, test, and validate a basic regression convolutional neural network (CNN), a visual geometry group application regression model, VGG16, the ResNet152V2 model, and the EfficientNetV2L model. The EfficientNetV2L model was the most accurate, having the lowest mean absolute error, second lowest root mean square error, and highest coefficient of determination (mean absolute error [MAE] = 0.60, root mean square error [RMSE] = 0.79, and R2 = 0.90). The ResNet152V2 model was slightly less accurate with a slightly better fit (MAE = 0.61,m RMSE = 0.78, and R2 = 0.87), followed by the basic CNN (MAE = 0.75, RMSE = 1.00, and R2 = 0.74) and finally by the VGG16 (MAE = 1.51, RMSE = 1.29, and R2 = 0.076). With an average error of just above one half of a spikelet, utilizing image analysis and machine learning counting methods could be used for multiple breeding applications, including direct selection of spikelet number, to provide data to identify quantitative trait loci, or for training whole genome selection models.

exceeding 700 million tons globally (FAO, 2022).Wheat has a wide cultivation range (Shewry, 2009) due to its capacity for adaptation (Blake et al., 2009), particularly within the flowering pathway.The two largest components of the flowering pathway are the photoperiod response (Ppd), which contributes to 20%-25% of the variation in heading time, and the vernalization response (Vrn), which contributes to 70%-75% of the variation in heading time (Stelmakh, 1998).The The Plant Phenome J. 2023;6:e20087.
There is evidence supporting the influence of heading time on spike architecture traits such as spikelets per spike (SPS) and spikelet differentiation (Chen et al., 2020;Miura & Worland, 1994).Spikelets are groups of three florets that move in an alternating pattern on rachis nodes along the length of the spike (Koppolu & Schnurbusch, 2019).Spikelets are the reproductive part of the spike, and SPS is positively correlated with grain yield, making it an important trait for increasing total grain yield (Chen et al., 2020).Understanding the genetic components underlying morphological traits, such as SPS, is difficult due to the tedious nature of phenotypic data collection.A pipeline to accurately and efficiently determine SPS, as well as other spike characteristics, would improve our understanding of these traits and aid in their genetic improvement.Phenotypes, such as SPS, are the resulting characteristics of a genotype and genotype by environment interaction.Accurate phenotyping is vital to genetic improvement through breeding and for discovery of marker-trait associations, genome-wide association studies, marker-assisted selection, and genomic selection (Furbank & Tester, 2011;Minervini et al., 2015).While advancements in technology for generating high-density genotypic data have increased the efficiency of genomic analysis (Koboldt et al., 2013), phenotyping methods are still lagging (Houle et al., 2010;Jin et al., 2021;Song et al., 2021).Recently, there has been increased emphasis on the development of highthroughput phenotyping (HTP) pipelines to increase the rate at which we can measure traits such as grain yield.Multiple nondestructive HTP approaches are now being evaluated and deployed in plant science research, including sensors (Milella et al., 2019), occupied or unoccupied ariel vehicles (UAVs) (Krause et al., 2020;Rutkoski et al., 2016), and imaging (Fitzgibbon et al., 2013;Yang et al., 2014).
Imaging in a controlled setting (as opposed to in field) allows for flexibility in terms of equipment cost, timing, and resolution.In general, these images are highly reproducible and can be utilized across a variety of applications ranging from microscopic to macroscopic subjects (Furbank & Tester, 2011;Minervini et al., 2015), capturing a single maturity point in time, or creating time-lapse data throughout a life cycle (Fahlgren et al., 2015;Gehan & Kellogg, 2017).While collection of the image data is important, an efficient analysis pipeline is also critical to maximize output and not just input of data.
Machine learning (ML) is a computerized model that can learn patterns from data and make decisions (Singh et al., 2016).While multiple types of models exist for ML, the main idea behind computer vision is the detection of similarities and dissimilarities between the provided images.ML has been used in conjunction with HTP in plant breeding programs due

Core Ideas
• High-throughput phenotyping methods are needed to make larger quantities of phenotypic data accessible to researchers and plant breeders.• Imaging is valuable for high-throughput phenotyping and can be used for collecting quantitative plant traits.• Machine learning algorithms can be used as an efficient analysis pipeline to create highthroughput phenotyping methods for collecting trait data from wheat spikes.
to its ability to automate the analysis process and increase efficiency (Gehan & Kellogg, 2017;Tsaftaris et al., 2016).ML models can be trained using both supervised methods, with labels for images during the time of training, or unsupervised methods, which allows the model to decide the most meaningful features of an image (Singh et al., 2016).For training, the input dataset is partitioned into three parts, including (1) a training set where the model will learn how to identify relevant aspects based on its task, (2) a validation set for testing the accuracy of the model, and (3) a test set of data that the model has not seen before.The validation set allows for finetuning of the model hyperparameters to increase accuracy to the desired level and then used to analyze the test set, which the model did not utilize during training or validation (Singh et al., 2016).For supervised learning, training sets labeled with the associated information are an input to the model so that it can learn the associations between the provided image and label.
Deep learning is a type of ML that does not rely on "feature engineering" (LeCun et al., 2015).Instead, deep learning models have layers of nonlinearity that transform data into a representation that makes it easy to classify.This structure allows deep learning models to excel with larger datasets and more complex data analytics and is becoming more common in plant phenotyping and image analyses (LeCun et al., 2015;Ubbens & Stavness, 2017).Convolutional neural networks (CNNs) are a class of deep learning methods whose primary components are convolutional and pooling layers (LeCun et al., 2015).Convolutional layers scan an image with a collection of small matrices called "kernels" or "filters" to find locations in the image that match the patterns learned by each filter.Following a convolutional layer, we typically employ a pooling layer that computes summary statistics of the output of its preceding convolutional layer.This serves to downsample the data and build locational invariance into the learned model (Montesinos López et al., 2022).These processing F I G U R E 1 Example structure of convolutional, pooling, flattening, and dense layers in a convolutional neural network (CNN).The above example is the architecture of a VGG16 model.In CNNs, convolutional layers and pooling layers, which are shown in orange and blue, respectively, are alternated to extract features from an input image.The outputs from the final layer of feature extracted are then passed on to the flattening layer, shown in purple.Finally, the outputs from the flattening layer are passed on to the dense layers, which will compute the predicted spikelet count.layers are able to extract various features from an input image, assign importance to them, and differentiate them.The final layer of a CNN is the fully connected layer, which takes the output from the processing layers and reshapes them back into a single column for processing based on the type of model being used, such as classification or regression (Ubbens & Stavness, 2017).An example of a CNN architecture with each of these different layers is depicted in Figure 1.
Convolutional neural networks have the ability to find edges and motifs in images that improve their ability for image analysis for certain plant phenotyping tasks (Tsaftaris et al., 2016) such as counting (Khaki et al., 2020;Ubbens & Stavness, 2017).Deep learning can be used for several image-based tasks including segmentation, regression, classification, and detection (LeCun et al., 2015).Regression can be used for counting tasks over other methods, such as classification, to have a better understanding of how close the predicted values are to the true values of an image, providing an error rate that can help assess model accuracy.Some CNN models are known for their use in vision problems and are readily available.Some well-known architectures include VGG16, ResNet models, and EfficientNet models.Many studies utilize these complex models in lieu of generating a new model from scratch (Kanchanadevi & Sandhia, 2023;Khaki et al., 2020;Nigam et al., 2023;Rao et al., 2022).
The objectives of this study were to; (1) evaluate soft red winter wheat (SRWW) genotypes using imaging to determine SPS and (2) develop deep learning models for a highthroughput analysis pipeline of wheat spike images.These results will aid in further genetic analysis and improvement of wheat spike characteristics and a pipeline of imaging and image analysis that could be adapted to other crop species.

Plant materials
The

Experimental design
The HGAWN was evaluated during the 2019 and 2020 growing seasons at the Milo J. Shult Agricultural Research & Extension Center in Fayetteville, AR.Plots were drill seeded at a rate of 250 seed m −2 in a randomized complete block design with two replications.Each plot consisted of a single 1.20-m long row with 0.38 m horizontally between rows and 0.60 vertically between each range of plots.During both seasons, pre-plant recommendations were followed for phosphorus and potassium and 100 kg ha −1 of nitrogen in the form of urea was applied in a split application in the spring (February and March).Harmony Extra (0.28 kg ha −1 ) and Axial (0.6 kg ha −1 ) were applied for control of annual ryegrass (Lolium multiflorum L.) and other weed species.

Imaging
An imaging device was used to achieve consistent lighting and camera height across wheat spike image capture as described by Winn et al. (2021) (Figure 2A).The imaging device was designed for a Canon Powershot G1 X Mark II camera.A 4.5 cm circular hole was drilled into the bottom of a 0.46-L bucket using a 4.5-cm drill bit on an electric drill.The hole was drilled into the center of the bottom of the bucket with the mouth of the bucket facing downward.After drilling, the edges were filed down to protect the camera lens from scratches.A 2-cm hole was then drilled into the side of the bucket approximately 10 cm above the base of the mouth while the mouth of the bucket faced downward.The inside of the bucket was painted by spraying with RUST-OLEUM Camouflage Ultra Flat Black so that the inte-rior was fully coated.The paint was cured for 5 min and was coated until exterior light was no longer visible through the walls of the bucket.After the final coat of paint had cured, tape lights from a 1.8-m GoodEarth Self-Adhesive Tape Lighting Kit were cut to fit the circumference of the bucket.The protective paper strip was removed from the back of the lights, allowing it to be directly applied to the inside of the bucket.The lights were adhered to the interior of the bucket starting at the 2-cm hole, making sure they remained parallel to the mouth of the bucket.Any necessary paint touch-ups were performed using an artistic brush.
For an imaging surface, a 1.29-m by 0.81-m marker board with a smooth particleboard back was used.The board was painted red using Classic Red Valspar Ultra Interior Flat Paint until the color was opaque and texture was no longer apparent.The red background was chosen for visibility of the heads during imaging.Each image was taken on the marker board surface using the imaging cap and the Canon Powershot G1 X Mark II camera.Each of the images were 20.1 megapixels and had a focal length of 9 mm, F-stop of f/4.0, and a shutter speed of 1/60 s at an ISO of 200 with a white balance of 3000 K.This allowed for clarity in the images and definition of the spike for analysis.

2.4
Image analysis: Image J For determination of SPS, six heads were collected from each plot at 30-33 days after the plot was fully headed but before the onset of senescence.All images were taken the same day as sampling.The peduncle of all six heads was inserted into a piece of red Van Aken Plastalina Modeling Clay, ensuring visibility of the heads in the image.Each head was placed in the clay in a uniform manner so that the rachis faced upward toward the lens to ensure each individual spikelet could be observed (Figure 2B).SPS was manually measured using ImageJ version 1.52o (Caroline et al., 2012).Each image of six wheat heads was repeatedly cropped to create six separate photos of individual spikes.The cropped photos were padded to the same size of 140 by 600 pixels (Figure 2C) using the Python package Pillow version 8.4.0 (Clark, 2015).The spikelet number in each image was counted using the multitool on ImageJ, recorded in a comma-separated values (CSV) file, and used as the associated label for the image.After preparation, 12,717 total images were used for model development.

Deep learning
Images were divided randomly into three different groups, with 70% used as a training set, 20% used for the test set, and the remaining 10% as a validation set.The training, validation, and test sets remained the same for all models, and images of heads originating from the same plot were grouped so that they did not appear in more than one of the three datasets.The VGG16, ResNet152v2, and EfficientNetV2L models were made available through Keras (Chollet, 2015).Image arrays for the first model were normalized by dividing them by 255, the number of color values possible for red, green, and blue in an image, to put vector values between 0 and 1, while images for the VGG16, ResNet152V2, and EfficientNetV2L models were preprocessed using their respective TensorFlow Keras preprocessing functions.
The MSE was used as the loss function for all models.The MSE was calculated using the following equation: where N is the number of samples,   is the true value of the label, and ŷ is the predicted label.Mean absolute error (MAE) gives the average absolute error between the predicted value and true value for spikelet number in an image and was the second metric used to evaluate model performance and accuracy.
The MAE was calculated using the following equation: MAE and root mean square error (RMSE), which is the square root of MSE, were used to evaluate the average devia-

RESULTS
The mean spikelets per image was 17.5 with a range of 11-32 spikelets for all images.The basic CNN model had an MAE of 0.75 and an RMSE of 1.00 (Table 1).The VGG16 model had an MAE of 1.51 and an RMSE of 1.29.The ResNet152v2 model had an MAE of 0.61 and an RMSE of 0.78.The Effi-cientNetV2L model had an MAE of 0.60 and an RMSE of 0.79.The MAE values indicated that the EfficientNetV2L had the lowest average error between the true spikelet number in the image and the spikelet number predicted by the model, deviating 0.60 spikelets from the true value on average.The VGG16 model had an MAE over twice that of both the Effi-cientNetV2L and ResNet152V2 models.The VGG16 also had a much larger RMSE than each of the other models, indicating it is not as well fit.Estimated spikelet values and actual spikelet values for each image were correlated for each model to compare the predicted spikelet number to the true value for each image in the test set.The regressions for the three best models, the basic CNN, ResNet152V2, and EfficientNetV2L, are depicted in Figure 3.The coefficient of determination (R 2 ) was 0.74 for the basic CNN (Figure 3A), 0.0076 for the VGG16, 0.87 for the ResNet152V2 (Figure 3B), and 0.90 for the Efficient-NetV2L model (Figure 3C).To evaluate the error of each model, the error for each prediction was calculated by subtracting the true values from the predicted value for each test set image.The error distributions of the three best models are depicted in Figure 4.The mean and standard deviation of error for the basic CNN were −0.19 and 1.75, respectively.The range in errors for the model was −7.80 to 5.30 spikelets (Figure 4A).The average and standard deviation of errors for the VGG16 model were −0.17 and 1.89, with a range of −6.80 to 5.00 spikelets.The average and standard deviation of errors for the ResNet152V2 model were −0.35 and 0.69, respectively, with a range of −5.11 to 3.65 spikelets (Figure 4B).Finally, the average and standard deviation of errors for the  EfficientNetV2L were −0.06 and 0.79, respectively, with a range of −4.36 to 4.05 spikelets (Figure 4C).

DISCUSSION
Developing phenotyping pipelines that allow for greater data collection is crucial to improving the efficiency of plant breeding programs.Imaging could serve as an affordable, accurate, and easily reproducible method for high-to-medium throughput phenotyping.The use of deep learning for image analysis is also promising, but further development and finetuning of hyperparameters is needed to generate optimal models.While it has been shown that the amount of data produced is more important than the accuracy of the data, acceptable thresholds for analyses in plant breeding programs must be determined prior to implementation (Lane & Murray, 2021).
The objective of this study was to demonstrate the capability of imaging for high-throughput counting of a critical wheat spike characteristic, SPS using deep learning models.Counting has a multitude of potential uses in agriculture and is a common method in ML models designed in other studies such as counting leaves (Miao et al., 2021;Ubbens & Stavness, 2017) and corn stalks in fields (Khaki et al., 2020).Generating a model for SPS not only provides a resource for phenotyping wheat spike characteristics but can also contribute to the development of models designed for similar tasks in agriculture.Both regression (Miao et al., 2021;Ubbens & Stavness, 2017) and detection-based methods (Khaki et al., 2020) have been implemented using sets of annotated images.Miao et al. (2021) had an agreement rate, or proportion of exact predictions, ranging from 0.33 to 0.45 and MSEs ranging from 0.92 to 1.72 for leaf counting.Ubbens and Stavness (2017) used regression to count Arabidopsis and tobacco leaves and had MAEs of 0.41 and 0.61.These results are comparable to the accuracy of the basic CNN, ResNet152V2, and EfficientNetV2L models used for counting spikelets in this study which had MAEs of 0.75, 0.61, and 0.60, respectively.
One of the pre-trained models utilized in this study, VGG16, performed much more poorly than each of the other models.This may indicate that some models with more complex architecture, such as VGG16, may not be best suited for a task such as counting spikelets, or that the model was not optimally utilized.Counter to this, one of the most complex pre-trained models, EfficientNetV2L had the lowest MAE and best fit of all of the models, closely followed by the pre-trained ResNet152V2 and the basic CNN model.Even though the ResNet152V2 model had a slightly better fit than the Effi-cientNetV2L model, the EfficientNetV2L model still had the highest R 2 and MAE of all four models.This could be due to the slight overfitting of the ResNet152V2 model, or due to the fact that the EfficientNetV2L had a smaller range of errors.The difference in fit between these two models is negligible since the RMSE values are almost equal (0.78 and 0.79).Since the basic CNN architecture is so simple, it also had the fastest runtime and required less computing power.The increased efficiency of the basic CNN model due to decreased runtime and memory requirements makes the basic CNN the more "user-friendly" option.
For all four models, the largest cause for overestimating spikelets appeared to be spikes that were slightly turned in the image so that the entire side view of the rachis was not perfectly aligned and visible (Figure 5A,B).For all three models, the largest errors for underestimating the number of spikelets were images with awns from neighboring spikes still present in the image (Figure 5C) or spikes containing supernumerary spikelets (Figure 5D).For the VGG16 model, spikes that were slightly turned in the image resulted in underestimations.Occasionally, larger error margins were identified for images that contained larger amounts of padding and smaller spikes (Figure 5E) compared to the other two models.The ResNet152V2 model did not show any patterns for large margins of error associated with abnormal spikes in images, and the EfficientNetV2L had the most trouble with excessive awns and slightly turned spikes.
While the errors found in this study were comparable to results from other studies, there is still room to improve accuracy of the models.Manual inspection found that images with large errors were consistently observed in patterns that may have reduced the accuracy of the model.While some images appeared to be outliers, others that produced high error were very similar to other images in the set.For many images with high error, no issues could be identified with the image or spike present in the image.
For this method of imaging and phenotyping to be optimally used, the images would need to be gathered in a more efficient way, such as in the field.This would impose other challenges or potential sources of error.Taking images in the field would not allow for the same consistent lighting across spikes and images that is possible in a controlled environment.Time of day, the angle of lighting, and cloud cover could all vary either day to day, or multiple times within one data collection session.There would also be many spikes close together in the field, with each being turned in varying directions or occluding others.Analysis pipelines used for HTP of images taken in the field would need to be able to handle variability between images with regard to image quality and spike positioning to accurately phenotype spikelets.This would require much larger training sets in which these sources of variability can be observed.
There are limited datasets of annotated images for specific ML tasks in agriculture, making the use of pre-trained models or transfer learning more accessible for counting algorithms (Wang et al., 2017).Use of these methods could help further improve the accuracy of the models, allowing for the efficient and reliable use of ML models for HTP, both in and out of the field.

CONCLUSIONS
Many methods are implemented by plant breeders for phenotyping crop traits, with little currently existing for wheat spike characteristics.This study used imaging techniques to analyze wheat spike characteristics and developed deep learning models for high-throughput analysis of wheat spike images.
Images and techniques used in this study could be utilized for training models to analyze phenotypic information collected from rovers in fields or from UAVs, further increasing the efficiency of the phenotyping pipeline for spike morphology traits.Improving the efficiency of analyzing these traits will improve breeders' ability to discover and understand genetic components behind wheat spike architecture and their relationship to yield component traits.This will allow for the development of more environmentally efficient and higher yielding cultivars to meet future demands.

C O N F L I C T O F I N T E R E S T S T A T E M E N T
The authors declare no conflicts of interest.

D A T A AVA I L A B I L I T Y S T A T E M E N T
Code utilized for this study can be found in a public folder on the author's GitHub.Image data are available by request to the author due to the file sizes being too large for a GitHub repository.https://github.com/mikhammers/Predicting-SPS.
genetic material used in this study was the Historical Gulf Atlantic Wheat Nursery (HGAWN), a population consisting of 594 SRWW lines from public breeding institutions located in the southeastern United States.The HGAWN includes varieties from the University of Arkansas (n = 103), Louisiana State University (n = 109), University of Georgia (n = 105), North Carolina State University (n = 104), Texas A&M University (n = 60), Virginia Polytechnic Institute (n = 44), Clemson University (n = 19), and the United States Department of Agriculture Agricultural Research Service (USDA-ARS, n = 9).

F
I G U R E 2 A Phenotyping device allowed for a controlled environment and consistency for photos of wheat spikes.(A) A camera rests on top of the device, which was used on the same red background for each photo.The black interior of the cover and attached light strip create uniform lighting and minimal reflection from camera flash while taking pictures.(B) Controlled environment images of six wheat spikes from each plot.Each spike was inserted into clay by the peduncle, rachis side up.Spikelets were fully visible for each spike to allow for phenotyping.(C) Example input image for specified deep learning algorithms.Each image of six spikes was cropped down to images of individual spikes before having their size reduced and padding the images to ensure identical sizing.
Four different regression CNNs were trained, with the first being trained from scratch and having five alternating sets of convolutional and max-pooling layers.The second model was based on a pre-trained visual geometry group (VGG)16 network(Simonyan & Zisserman, 2014).The VGG16 model was selected since it has been utilized in several image-counting studies(Khaki et al., 2020;Ubbens & Stavness, 2017).Next was the pre-trained ResNet152V2 model(He et al., 2016), which is the most complex and recent model architectures available of their respective model series.The ResNet152V2 model was selected due to its prevalence in image classification literature in a variety of different fields from disease ratings in agriculture(Kanchanadevi & Sandhia, 2023;Nigam et al., 2023) to medical research(Sulaiman et al., 2023).The final model was the pre-trained EfficientNetV2L model(Tan & Le, 2021), which was selected due to its recent use in plant disease detection(Shovon et al., 2023;Ulutaş & Aslantaş, 2023).Mean squared error (MSE) was used as the loss function for each model, which evaluates differences between values predicted by the model and their true value.All of the models were implemented using the TensorFlow Keras package(Abadi et al., 2016) and trained on a local computer server utilizing no graphic processing units and 500 gigabytes of RAM.
The model mean absolute error (MAE) and root mean square error (RMSE) for each regression model.estimated values for spikelet numbers from the true value across images.Each model was set to run for a maximum of 100 epochs but utilized early stopping-to-end training before 100 epochs if the validation loss had not improved for five consecutive epochs.

F
Regression of predicted spikelet number versus true spikelet value for each image for the basic (A) convolutional neural network (CNN), (B) ResNet152V2, and (C) EfficientNetV2L regression models.The x axis represents the true values of spikelet number present in each image and the y axis is the spikelet number value predicted by each respective model and image.Displayed within each graph is the equation of the linear regression and the coefficient of determination (R 2 ).

F
Histograms of error for each predicted spikelet number for the basic convolutional neural network (CNN), ResNet152V2, and EfficientNetV2L models.The x axis shows the difference between the predicted value for each image in the test set and the true spikelet number value in the image.The y axis shows the number of images represented for a given error range.

F
I G U R E 5 Examples of images in the test set that resulted in underestimation or overestimation of spikelet number.Common observations of images that produced larger spikelet estimation errors include images where the rachis was not fully visible due to turned spikelets (A) or the spike being slightly turned at the insertion point of the peduncle (B), crowding of a spike with awns of other neighboring spikes (C), supernumerary spikelets (D), and small images that contained large amounts of padding (E).
T H O R C O N T R I B U T I O N S Mik Hammers: Data curation; formal analysis; investigation; methodology; project administration; software; visualization; writing-original draft; writing-review and editing.Zachary Winn: Conceptualization; data curation; method-ology; project administration; writing-review and editing.Asa Ben-Hur: Methodology; supervision; validation; writing-review and editing.Dylan Larkin: Data curation; writing-review and editing.Jamison Murry: Data curation; writing-review and editing.Richard Esten Mason: Conceptualization; funding acquisition; project administration; resources; supervision; writing-review and editing.A C K N O W L E D G M E N T S This work was supported by the USDA-ARS, the Agriculture and Food Research Initiative (AFRI) of the USDA National Institute of Food and Agriculture (NIFA) Grants 2022-68013-36439 and 2017-67007-25939 (Wheat-CAP), and in collaboration with SunGrains.