Toward improved image-based root phenotyping Handling temporal and cross-site domain shifts in crop root segmentation models

Crop root segmentation models developed through deep learning have increased the throughput of in situ crop phenotyping studies. However, models trained to identify roots in one image dataset may not accurately identify roots in another dataset, especially when the new dataset contains known differences, called domain shifts. The objective of this study was to quantify how model performance changes when models are used to segment image datasets that contain domain shifts and evaluate approaches to reduce error associated with domain shifts. We collected maize root images at two growth stages (V7 and R2) in a field experiment and manually segmented images to measure total root length (TRL). We developed five segmentation models and evaluated each model’s ability to handle a temporal (growth-stage) domain shift. For the V7 growth stage, a growth-stage-specific model trained only on images captured at the V7 growth stage was best suited for measuring TRL. At the R2 growth stage, combining images from both growth stages into a single dataset to train a model resulted in the most

, totaling 30 h per meter of minirhizotron.Once annotated, many of these programs have built-in trait extraction features to translate the annotations into meaningful traits such as total root length (TRL), number of roots, and average root diameter.
Recently, convolutional neural networks (CNN) and other deep learning techniques have been applied to automatically segment roots in minirhizotron images, drastically reducing the time required for manual annotation (Peters et al., 2023;Smith et al., 2020;Xu et al., 2020;Yu et al., 2020).To make deep learning techniques easily accessible to individuals without computer science expertise, collaborations between plant, environmental, and computer scientists have resulted in openaccess applications specifically for the purpose of segmenting roots in images (Peters et al., 2023;Smith et al., 2022;Wang et al., 2019).Although deep learning has been found to be effective for a variety of tasks (Coleman & Salter, 2023;Karamov et al., 2023;Sell et al., 2022) and has accelerated in situ root research (Delory et al., 2022;Rambla, 2022;York et al., 2022), substantial challenges remain when training and using segmentation models to process several datasets.
One key challenge in using model-based segmentation tools is that model performance suffers when models encounter images that are systematically different from the original training data (Feng et al., 2020).Such systematic shifts are referred to as domain shifts (Amodei et al., 2016;Sun et al., 2016).A domain shift is a common problem when using machine learning in practice, as real-world data often varies in unpredictable ways.Domain shifts often cause a reduction in generalization performance, which is the performance of a model on data not used in the training process.Even subtle changes in datasets can lead to substantial error (Alcorn et al., 2019;Recht et al., 2018).
Domain shifts in plant datasets may arise due to root system architecture changes between vegetative and reproductive growth stages which can drastically alter the ratio of soil (image background) to root (image foreground) in minirhizotron images.As roots grow, primary roots often branch to develop lateral roots that contribute to greater variations in root diameter and length.The soil environment also changes with time, season, and depth.Soil color often changes noticeably from the surface to deeper horizons, and seasonal rain or drought periods lead to distinct differences in soil color that can impact the contrast between soil and roots.Further consequences of wetting-drying cycles are deposition of water films on external minirhizotron surfaces, translocation of growing roots, and the rearrangement of soil particles in contact with the minirhizotron surface, which can cause smudges and streaks to appear in images.This dynamic environment not only introduces challenges for identifying roots in images but also for quantifying root growth and decomposition over time (Gillert et al., 2023).

Core Ideas
• Minimizing model error is key to accurately depict phenotypes and detect treatment effects.• Domain shifts can exacerbate model error and reduce model accuracy.• Growth-stage-specific models minimize error at growth stages with low total root length (TRL).• Multi-growth-stage models minimize error at growth stages with high TRL.• Cross-site domain shifts inflate error even when experimental conditions are highly similar.
Researchers using deep learning have several options when segmenting image datasets that include a domain shift.Researchers could develop specialized models for each dataset, which requires additional time but may increase precision.Researchers could also train models to handle datasets with a domain shift.For example, an image dataset could be extended by including new images and continue training the model with both the original and additional images.Alternatively, researchers could train a single model on all datasets combined.This approach is possible if all images are available simultaneously or if model training begins at the end of a study.However, waiting until the end of image collection to begin model training may be undesirable in multi-year plant studies where preliminary data analysis can be especially helpful to ensure that no pertinent data is overlooked.Lastly, fine-tuning, also known as transfer learning, could be used.Transfer learning accelerates model training by using weights from an existing trained model on a new dataset (Tajbakhsh et al., 2017).The best approach to creating accurate segmentation models may depend on the degree and type of domain shift present between the datasets and the requirements and timescale of the specific study.
In this study we investigated the extent to which root segmentation model performance degrades due to commonly encountered domain shifts in real-world crop root images.We also investigated ways to address these domain shifts, leading to practical advice for those seeking to train and use deep learning segmentation models to process multiple image datasets that include variable conditions and features.Using a maize (Zea mays L.) field experiment, we utilized the RootPainter + RhizoVision Explorer pipeline (Bauer et al., 2022) to address three primary questions.First, what is the capacity for growth-stage-specific models to handle a temporal domain shift or, in other words, can models developed to segment maize roots present at one growth stage accurately segment roots present at a different growth stage?Second, how accurately do models trained on root images from more than one growth-stage segment root images captured at different growth stages?Last, can models developed to segment maize roots present at one site be directly applied to images of maize roots collected in an entirely separate study and generate accurate segmentations?

Images collected in the field
To obtain a set of minirhizotron images that included phenotypic diversity and temporal domain shifts, we conducted a field study consisting of four maize (Z.mays L.) hybrids each planted at two densities (30,000 and 69,000 seeds ha −1 ) using a randomized complete block design with four replicates in 2021.The field site used for minirhizotron image collection was located at the University of Kentucky North Farm (38˚7′ N, 84˚29′ W).The soil was classified as Maury silt loam (fine, mixed, active, mesic Typic Paleudalfs).The experimental site received 556 mm of precipitation and had an average temperature of 21.6˚C in 2021 during the growing season, which is comparable to the site's 30-year average of 533 mm and 21.8˚C annually from May to September 1991-2020 (NASA, 2021;Sparks, 2018).Within each plot, two minirhizotrons 1 m in length with 7 cm outer diameter were installed 10 cm away from and parallel to the crop row at a 45˚angle for a total of 64 minirhizotrons.Installation occurred as soon as possible following emergence when plants showed evidence of consistent crop stand.A bucket auger was used to remove soil to a length between 95 and 98 cm before inserting the 1-m long minirhizotrons, which reached a vertical depth of approximately 56 cm.Images were collected using a CI-600 In-Situ Root Imager (CID Bio-Science Inc., Camas, WA) at a resolution of 300 dpi at the V7 growth stage, which is during vegetative growth when the crop has seven collared leaves, and at the R2 growth stage, which takes place after pollination when the kernels begin to expand (Lee, 2011).Although research in medical sciences suggest that image resolution may impact segmentation accuracy depending on the task at hand (Sabottke & Spieler, 2020), research in plant sciences has not established an optimum resolution for segmenting root images.Therefore, 300 dpi was selected to balance image quality and the time required to collect images.Imaging each minirhizotron generated one image from each of three depths: surface soil, shallow subsoil, and deep subsoil which approximately correspond to vertical depth increments of 5-23, 23-39, and 39-56 cm.Captured images were 2550 × 2273 pixels, which depict approximately 733 cm 2 of soil each or approximately 2198 cm 2 of soil per minirhizotron.Each field imaging session (i.e., the V7 and R2 growth stages) resulted in 192 images.

2.1.2
Images collected in a greenhouse study This was the same soil series found in the field experiment.Soils were packed to a final bulk density of ∼1.2 and ∼1.4 g cm −3 , respectively.Minirhizotrons were inserted half-way through the horizon filling process to reduce void formation and were sealed into place with GS121 gutter sealant.Minirhizotrons were capped (70 mm caps; MOCAP, Park Hills, MO), and minirhizotrons were painted with a coat of white and then black paint to prevent heating and light in the root zone.
A single maize plant (B73 × MO17) was grown in each mesocosm.All nutrient management and liming requirements were completed before maize planting according to University of Kentucky recommendations (Ritchey & McGrath, 2020).
Watering was adjusted to deliver approximately 22.7 L of DI water to each mesocosm by stake drippers (Mister Landscaper, Dundee, FL) once each week replicating the annual rainfall for Lexington, KY.Root images were collected in the same manner as the field study using the same scanner-based CI-600 In-Situ Root Imager (CID Bio-Sciences, Inc., Camas, WA) every 2 weeks.For the purposes of this study, we used images collected in the 2018 experimental year from mesocosms that contained only maize roots when the plants had nine collared leaves at the V9 growth stage and again during reproductive growth at the R2 growth stage.

Manual annotation of minirhizotron images
To create a ground-truth dataset that could be compared to traits extracted from images using crop root segmentation models, a subset of field images was manually annotated using the RootSnap!application (CID Bio-Sciences Inc.) since manual annotation programs have commonly been the standard for quantifying root traits in images.At the V7 growth stage, the manually annotated subset consisted of all 96 images from the plots under the high planting density treatment since root images under this treatment provided the greatest variety of root characteristics.At the R2 growth stage, 15 images were selected at random for manual annotation as a ground-truth dataset using the "random" R package (Eddelbuettel, 2017).Prior to annotation, features built into RootSnap!were used to adjust both image brightness and contrast to 120% to better identify roots.The images collected in greenhouse mesocosm studies were manually annotated in the same manner as those collected in the field.Further details about manual annotation of the images collected in the mesocosm study are described in McGrail (2021).

Model development for image segmentation
RootPainter (version 0.2.25) was used to create models to segment crop root images and was deployed on a Windows 11 operating system equipped with an NVIDIA GeForce RTX 3090 GPU and an AMD Ryzen 9 5900 × 12-Core Processor CPU.RootPainter's built-in "Create Training Dataset" function was used to tile the V7 and R2 field image datasets.Target tile size was set to 900 × 900 pixels and the maximum number of tiles created per image was set to 10.This resulted in 1728 tiles that were 850 × 757 pixels for both the V7 and R2 image datasets collected from the field experiment.Preliminary experimentation suggested that models trained on image tiles could be developed quicker than models trained on whole images without sacrificing model accuracy (Figures S1 and S2).
We used a corrective annotation protocol (Smith et al., 2022) which first entails annotating soil and roots in two images that contain clear examples of both classes.Then, the user prompts RootPainter to start training and annotates four more images that contain clear examples of roots and soil.Finally, the user begins correctively annotating the predicted segmentations by annotating where the model in training has incorrectly classified soil as root and vice versa.As opposed to the 2 h of corrective annotation recommended by Smith et al. (2022), we elected to correctively annotate tiles for 4 h since the segmentations generated by models following the 2 h corrective annotation period still showed room for improvement, particularly in the images collected in the field at the V7 growth stage (Figures S3 and S4).Following 4 h of corrective annotation, the models continued training until 60/60 epochs without progress was achieved after which point model metrics were obtained from RootPainter's "Extract metrics" feature.In particular, we were interested in how the difference between the predicted and corrected segmentations (i.e., dice scores) changed as the number of images correctively annotated increased.Dice scores are possible to compute because RootPainter uses interactive machine learning, with the user annotating by correcting errors in segmentations on as-yet unseen images.When the dice scores demonstrate a plateau in response to the number of images correctively annotated, the additional information gained by newly developed models is typically minimal.We generally observed this trend in our dice plots which suggested that additional corrective annotation to further train our models would likely contribute little more to overall model performance (Figures S5-S9).
For the field image datasets, five models were developed using these tiled image datasets.Two growth-stage-specific models were created and are referred to as the "V7 field model" and the "R2 field model."Three multi-growth-stage models were created using unique techniques that incorporated images from both growth stages into a single model.The growth-stage-specific models (i.e., the V7 field model and the R2 field model) were developed by exclusively annotating the V7 and R2 image datasets, respectively.The third model was developed using a transfer learning approach and is referred to as the "fine-tuned V7 + R2 field model."As opposed to using random weights to create a model, weights from the V7 field model were applied to begin training the fine-tuned V7 + R2 field model by specifying the V7 field model as the initial model.From this point, the R2 images collected in the field were annotated to train the fine-tuned V7 + R2 model.The fourth model was developed by combining all image tiles from the V7 growth stage and the R2 growth stage into a single image dataset that was used for training and is referred to as the "combined V7+R2 field model."The last model was created by appending the original V7 field model's image dataset with image tiles collected at the R2 growth stage using RootPainter's "Extend dataset" function.This allowed the existing V7 field model to be further refined and is referred to as the "extended V7 + R2 field model."The image tiles used to train the R2 field model were pasted into the image dataset used to train the V7 field model, the V7 field model project was opened and extended using "Extend dataset" from RootPainter's "Extras" menu, and image tile annotation proceeded for 4 h only on image tiles that were captured at the R2 growth stage.This dataset extension approach means that the training process includes the old images and annotations from the previous dataset in addition to the newly added images and corrective annotations from the domain-shifted dataset.Figure S10 demonstrates the workflow used to construct these models.
Growth-stage-specific root segmentation models were also developed for the image datasets collected in the greenhouse experiment.Two growth-stage-specific models, called the "V9 greenhouse model" and "R2 greenhouse model," were developed exclusively annotating images of maize roots collected in the greenhouse at the V9 and R2 growth stages, respectively.These models were developed in the same manner as the field models by first tiling the images, correctively annotating for 4 h, and allowing model training to proceed until 60/60 epochs without progress.Root images collected in the field and the greenhouse that were used to train models are openly published (https://doi.org/10.5281/zenodo.8224956).

Image segmentation and trait extraction
The trained models were used to generate segmentations for the field images and greenhouse images using Root-Painter's "Segment folder" function from the "Network" menu.These segmentations were then converted to binary segmentations (i.e., black and white) using RootPainter's built-in "Convert segmentations for RhizoVision Explorer" function from the "Extras" menu.The binary segmentations were then analyzed by RhizoVision Explorer in batch analysis mode to extract data from the segmentations.In RhizoVision Explorer v2.0.3 (Seethepalli & York, 2020) the "Image Pre-Processing" settings were set as follows: "whole root mode" was enabled since roots in minirhizotron images are largely connected, pixel to physical units was set equal to 300 dpi since this was the resolution of minirhizotron images, and the "filter non-root objects" option was enabled and set to 1 mm to conservatively remove non-root features identified by segmentation models.RhizoVision Explorer's "Feature Extraction" settings were set such that the "root pruning" feature was enabled with its threshold set to 10 pixels since the RhizoVision Explorer software manual recommends values between 1 and 20 to reduce identification of false root tips.For easy reference, a metadata file that can be used to apply the same settings is openly published (https://doi.org/10.5281/zenodo.8224956).Lastly, binary segmentations of the V7 and R2 image datasets as produced by the V7 field model and combined V7 + R2 field model, respectively, were analyzed using the "By Color Select" tool in GIMP version 2.10.34(The GIMP Development Team, 2023) to quantify the average number of root pixels present in each image dataset in order to compare the class balance (i.e., the number of root vs. non-root pixels) present at each growth stage.

Model validation and data analysis
Linear regression and model accuracy metrics were used to compare the models.TRL was the only root trait evaluated since RootSnap!and RhizoVision Explorer quantify TRL in a very similar manner, while other root traits quantified by the two applications can differ widely.We used major axis regressions (also called "type 2" or "ordinary least products" regressions) to relate TRL predicted models to ground-truth data from the images that were manually annotated using RootSnap!.This type of regression was used because the data regressed on the x-and y-axes were both measurements sub-ject to error as opposed to the commonly used ordinary least squares regression for which y-axis measurements are subject to error but x-axis variables are fixed treatments (Bauer et al., 2022;Delory et al., 2017).We tested if the regression line was significantly different from the line of equal outcomes (i.e., the 1:1 line).If the 95% confidence interval around the intercept did not include zero, we considered the model to have a fixed bias, and if the 95% confidence interval around the slope did not include 1, we considered the model to have a proportional bias (Ludbrook, 1997(Ludbrook, , 2002(Ludbrook, , 2010)).To compare how well each model segmented roots in images, root mean square error (RMSE; Equation 1), range normalized root mean square error (NRMSE; Equation 2), and mean bias error (hereafter referred to as "bias"; Equation 3) were calculated.For reference, bias is the average difference between the predicted and observed values.Negative bias indicates that the TRL predicted using the RootPainter + RhizoVision Explorer pipeline is underestimated relative to the manually obtained TRL using RootSnap!software.Likewise, positive bias indicates that TRL predicted by the RootPainter + RhizoVision Explorer pipeline is overestimated relative to TRL measured with RootSnap!software.
where   is TRL measured using RootSnap!, ŷ is the TRL predicted by RootPainter models, and n is the sample size.
where   is TRL measured using RootSnap!, ŷ is the TRL predicted by RootPainter models,  max is the maximum TRL value measured using Rootsnap!,  min is the minimum TRL value measured using Rootsnap!, and n is the sample size as described by Bauer et al. (2022).
where   is TRL measured using RootSnap, ŷ is the TRL predicted by RootPainter models, and n is the sample size.All data analyses were conducted in R version 4.3.2(R Core Team, 2023) equipped with RStudio version 6.1.524(Posit Team, 2023).Linear regression was conducted using the lmodel2() function from the lmodel2 package (Legendre, 2018).Model performance parameters were calculated using the rmse() and bias() functions from the Metrics package (Hamner & Frasco, 2018), and NRMSE was calculated by dividing the rmse() by the range of the measured values using the max() and min() functions from the base R package (R Core Team, 2023).Data visualization was completed T A B L E 1 Model performance metrics for field models applied to images collected in-field at the V7 growth stage as displayed in Figure 1.utilizing the R packages contained in the tidyverse (Wickham et al., 2019) and ggpubr (Kassambara, 2020) packages.

Model name
Rolling averages presented in the dice plots were calculated using the geom_ma() function from the tidyquant package (Dancho & Vaughan, 2023).

Handling site-specific temporal domain shifts: Accuracy of growth-stage-specific and multi-growth-stage field models
Ground-truth datasets created using RootSnap!required approximately 170 h and 158 h to manually segment the 96 images collected at the V7 growth stage under the high planting density treatment and the 15 images randomly selected from the R2 growth-stage image dataset, respectively.For the complete image datasets used in model training, we estimated that on average 0.9% of the pixels in images contained roots at the V7 growth stage whereas 10.2% of pixels contained roots at the R2 growth stage.
As indicated by model performance metrics (Table 1), when the field models were applied to segment images collected in the field at the V7 growth stage the V7 field model demonstrated a lower level of error than the R2 field model and the fine-tuned, combined, and extended V7 + R2 field models.On average, each model overestimated TRL compared to the ground-truth, although both growth-stagespecific models exhibited much better control of bias than the multi-growth-stage models (Table 1 and Figure S11).When TRL was generally low (e.g., <1000 mm), growthstage-specific model estimates slightly overestimated TRL, but were relatively accurate.At higher levels of TRL, however, growth-stage-specific models typically underestimated TRL (Figure 1).Compared to the growth-stage-specific models, the multi-growth-stage models were less accurate when TRL was low and likely contributed to their greater levels of bias (Figure 1 and Table 1).
When the field models were applied to segment root images collected in the field at the R2 growth stage, model performance metrics (Table 2) showed that the combined V7 + R2 model had the lowest level of error, which was only slightly better than the extended V7 + R2 model, whereas the fine-tuned V7 + R2 model and both growth-stage-specific models had notably greater levels of model error.The growthstage-specific R2 model, combined V7 + R2 model, and extended V7 + R2 model exhibited relatively low levels of positive bias indicating that they overestimated TRL, which contrasted with the V7 and fine-tuned V7 + R2 models that underestimated TRL to a much greater extent (Table 2 and   25782703  F I G U R E 2 Scatterplots of total root length as predicted by models developed using root images collected in field and as measured manually on images collected in-field at the R2 growth stage.(A) demonstrates growth-stage-specific models and (B) demonstrates multi-growth-stage models.Colored lines are linear regression trendlines.
Figure S12).The combined and extended V7 + R2 models followed the 1:1 line closely, indicating a high degree of agreement between the predicted TRL and the TRL measured by RootSnap!(Figure 2 and Table S1).The R2 field model deviated slightly from the 1:1 line when TRL was less than 7500 mm and deviated more noticeably beyond this level of TRL, while the V7 field model and fine-tuned V7 + R2 field model began to deviate noticeably when TRL was greater than 5000 mm.
F I G U R E 3 Scatterplots of total root length as predicted by models developed using roots images collected from the greenhouse experiment and total root length as measured manually on images collected from the greenhouse experiment at the R2 growth stage.
Colored lines are linear regression trendlines.

Handling cross-site domain shifts: Application of field models to image datasets from a greenhouse experiment
When the R2 field model was applied to maize root images collected at the same growth stage from the greenhouse study, the R2 greenhouse model had a lower level of error compared to the R2 field model (Table 3).Both models overestimated TRL compared to manual TRL measurements, although they exhibited no proportional bias due to slopes that were near one (Figure 3 and Table S1).When the V7 field model was applied to the V9 greenhouse images, the V9 greenhouse model had a lower level of error compared to the V7 field model (Table 3).As was the case for the R2 greenhouse image datasets, both the V7 field model and V9 greenhouse model overestimated TRL when applied to the V9 greenhouse images (Figure 4).

DISCUSSION AND CONCLUSIONS
Recent applications of machine learning techniques to segment minirhizotron images have dramatically increased the T A B L E 3 Model performance metrics for models applied to images collected in the greenhouse experiment at the V9 and R2 growth stages as displayed in Figure 3. F I G U R E 4 Scatterplots of total root length as predicted by models developed using roots images collected from the greenhouse experiment and total root length as measured manually on images collected from the greenhouse experiment at the V9 growth stage.

Model
Colored lines are linear regression trendlines.
throughput of in situ crop root phenotyping.Yet, guidance is needed regarding how to develop and deploy crop root segmentation models in experiments that collect root images at distinct points in the crop's lifecycle so that researchers can appropriately handle temporal domain shifts.Moreover, guidance is needed regarding how cross-site domain shifts may impact the performance of existing crop root segmentation models when they are applied to similar image datasets collected in separate studies that a model has not experienced in training.To address these gaps, we evaluated how model development strategies affect segmentation accuracy when used to segment root images of the same crop species collected at earlier or later growth stages and in different study conditions.
The temporal domain shift between V7 and R2 images is largely related to the increase in root length density between these two growth stages-average TRL in the images increased from 767 to 5726 mm between the V7 and R2 growth stages (Figures 1 and 2 and Figure S13).As such, we expected that the V7 field model would not accurately segment the root-dense images at the R2 growth stage and would perform worse than the R2 field model when used to segment images from the R2 growth stage, as was observed.We found that the V7 model had high error, negative bias, and a relatively low coefficient of determination when applied to R2 images, supporting this hypothesis and confirming the increased model error associated with temporal domain shifts (Table 2).On the other hand, since the R2 field model was trained on the highly root-dense R2 images and the multigrowth-stage V7 + R2 models were trained on images from both growth stages, we expected that they would be able to handle the less complex V7 root images just as well as the V7 field model.However, we were surprised to find that the R2 field model and all three multi-growth-stage models performed worse than the V7 field model when used to segment the images collected at the V7 growth stage (Table 1 and Figure 1).
Interestingly, our results are both supported and not supported by other research.Han et al. (2022) developed models used to identify biopores in images collected at four unique sites and found that combining the images into a single dataset to train a "multi-site" model saved time and improved model performance as opposed to developing site-specific models for images collected at each unique location.While we found that the combined and extended multi-growth-stage models were highly accurate when used to segment root images collected in the field at the R2 growth stage, this was not the case for the images collected at the earlier V7 growth stage.
While it is unclear what exactly caused a decline in accuracy of the combined V7 + R2 and extended V7 + R2 models on the V7 images collected in the field, we suspect that it may be related to a difference in class balance, that is, the number of foreground (i.e., root) and background (i.e., non-root) pixels that are present in the images used to train models.Class imbalances can lead to decreased model performance on underrepresented classes (Li et al., 2021), and designing deep-learning networks that can handle class-balance issues accurately is an ongoing research avenue (Litjens et al., 2017).We estimated that, on average, images collected at the R2 growth stage contained approximately 10 times more root pixels than images collected at the V7 growth stage.It is likely that this dramatic imbalance was problematic in developing a model that could segment both growth stages without losing some performance capabilities.
Contrary to the V7 growth stage, training multi-growthstage models resulted in improved model performance at the R2 growth stage compared to a growth-stage-specific model (Figure 2 and Table 2).However, this was only the case for the combined and extended V7 + R2 field models as the fine-tuned V7 + R2 field model had greater error than the growth-stage-specific R2 field model (Table 2).The primary difference between the fine-tuned V7 + R2 model and the other two multi-growth-stage models is the annotations that are used during training.The combined and extended V7 + R2 field models were developed by actively training on annotations from both the V7 and R2 growth stages.The fine-tuned V7 + R2 field model, however, was created by using the V7 field model's weights to initialize training and was developed by actively training only on annotations from image tiles collected at the R2 growth stage.This suggests that the improved performance exhibited by the combined and extended V7 + R2 field models is likely due to simultaneous training on annotations and their corresponding images from both growth stages, and the greater diversity of images accessed during training may have improved the model's predictive ability when applied to the relatively complex R2 images.
As opposed to the combined and extended V7 + R2 models, the fine-tuned V7 + R2 model was trained to identify roots at the V7 and R2 root growth stages in multiple phases, first by training on the V7 images and then by fine-tuning the V7 model using the R2 images.A limitation of transfer learning is "catastrophic forgetting" which is a known phenomenon neural networks exhibit whereby training a model to recognize new features can degrade its ability to recognize features that the model previously learned (Kirkpatrick et al., 2017).It is likely that during fine-tuning the model forgot useful information from annotations on the V7 images.It is also possible that the interactive-machine-learning process as implemented in RootPainter exacerbates forgetting when discarding the original dataset as the transfer learning process may begin when only a small subset of the target dataset is annotated.
Solutions to the forgetting problem may be considered as future improvements to RootPainter.When a trained model is available, but access to the original training data and images is not available, then forgetting can be mitigated by selectively slowing down changes to existing model parameters (Aljundi et al., 2018;Kirkpatrick et al., 2017).When some of the original training data and labels are available, then rehearsal methods, also referred to as interleaved learning (McClelland et al., 1995), may be more effective.Such methods involve periodically using a combination of data from the old and new datasets during the re-training process (Chaudhry et al., 2019;Hayes et al., 2019;Lin, 1992;Rebuffi et al., 2017;Rolnick et al., 2019).Furthermore, by using the growth-stage-specific field models to segment maize root images from the greenhouse experiment, we evaluated how the field models performed when subjected to known cross-site domain shifts.As expected, we found that field model performance degraded and that models trained on root images collected at one location were not suitable for direct application to root images collected at a separate location.While the R2 field model was subjected to a single domain shift in the form of experimental site, the V7 field model was subjected to two domain shifts in the form of site and growth stage.Compared to their respective greenhouse models, the R2 field model bias and RMSE increased by 49% and 20%, respectively, while the V7 field model bias and RMSE increased by 89% and 74%, respectively (Table 3).Given that the V7 field model was subjected to two domain shifts, it was unsurprising that the increases in bias and RMSE were greater than those observed for the R2 field model which was only subjected to a single domain shift.Although we focused on temporal and cross-site domain shifts, other research suggests that model performance can decline with exposure to additional domain shifts, such as a cross-species shift.For example, Wang et al. ( 2019) trained a CNN-based segmentation model on soybean root images.When these authors applied their soybean (Glycine max) model to images collected in a forest soil that contained root species other than soybean, they visually noted that model accuracy had decreased but argued that a high level of accuracy was maintained (Wang et al., 2019).Still, other researchers using a CNN-based model trained on maize root images visually demonstrated that high levels of accuracy were maintained when applied to barley (Hordeum vulgare) and Arabidopsis (Arabidopsis thaliana) root images (Narisetti et al., 2021).In the future, research should aim to quantify the level of error that arises when a cross-species domain shift is introduced to crop root segmentation models.
For studies that collect root images at a single growth stage (Rinehart et al., 2022), at many times frequently over a short timeframe (Alonso-Crespo et al., 2022), or that have immediate access to image datasets from multiple sites or experimental years (Bauer et al., 2022;Han et al., 2022) it may be practical to combine all images into one dataset to train a single model.Yet for other experiments with imaging sessions that span months or years (Qian et al., 2022;Wacker et al., 2022;Wu et al., 2022) waiting until all images are captured to begin analysis could be unrealistic, especially considering that early results can provide researchers with information that can shape downstream data collection strategies.Our findings demonstrate that it is imperative for researchers to consider the scope of the image dataset used for training before crop root segmentation models are deployed.When studies collect images from more than one growth stage, we found that developing a growth-stage-specific model was best suited for the first growth stage.For images collected at the next growth stage, we found that combining images from both growth stages into a single dataset to train a segmentation model was the best approach to minimize model bias, RMSE, F I G U R E 5 Proposed iterative approach for developing crop root segmentation models in studies that image multiple growth stages.When images from only the first growth stage have been collected, a growth-stage-specific model can be developed and used to segment images from the first growth stage (A).When images from more than one growth stage are collected, multi-growth-stage models can be developed by combining images from multiple growth stages (B and C). and NRMSE.As an alternative approach for studies that are unable to immediately combine images from more than one growth stage into a single dataset, we found that model bias and error can be controlled nearly as well by extending the initial growth-stage-specific model's dataset to include images from successive growth stages.
Considering the apparent need for an initial growth-stagespecific model along with the level of model bias and error that is controlled by combining images collected at different growth stages, we recommend an iterative approach to develop crop root segmentation models (Figure 5).Following image capture at the first growth stage of interest, a growthstage-specific model can be developed and used to segment images from this first growth stage.With our V7 field image dataset, we achieved a relatively low NRMSE value of 0.11 with fixed bias, but no proportional bias, using this approach.When images are collected at the second growth stage of interest, a new segmentation model can be developed from scratch by combining images from both growth stages into a single image dataset.With our R2 field image dataset, we achieved a very low NRMSE values of 0.04, similar to the most accurate models reported by (Bauer et al., 2022) with no fixed or proportional bias using this approach.Hypothetically, this iterative process can continue for as many growth stages as are of interest, though it is important to bear in mind that only two distinct growth stages were evaluated here.Future research should evaluate the possible tradeoffs of training models by using combined or extended image datasets when roots are imaged more frequently and demonstrate TRL measurements that are comparable between growth stages.
Furthermore, we caution against deploying an updated model to previous growth stages since we found that doing so causes overestimation of roots from previous growth stages (i.e., the combined V7 + R2 model applied to the V7 image dataset; Table 1).That is, if a model has been trained on a dataset composed of images from three growth stages, we recommend only using it to segment images from the latest growth stage.Lastly, we caution against directly applying crop root segmentation models trained on images from only one experiment to image datasets collected in separate experiments even when the species and growth stage are identical as this can also exacerbate TRL estimation and increase model error.

F
Scatterplots of total root length as predicted by models developed using root images collected in field and as measured manually on images collected in-field at the V7 growth stage.(A) demonstrates growth-stage-specific models and (B) demonstrates multi-growth-stage models.Colored lines are linear regression trendlines for each model.
, 2024, 1, Downloaded from https://acsess.onlinelibrary.wiley.com/doi/10.1002/ppj2.20094 by Det Kongelige, Wiley Online Library on [19/02/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License T A B L E 2 Model performance metrics for field models applied to images collected in-field at the R2 growth stage as displayed in Figure 2.