Early‐season biomass and weather enable robust cereal rye cover crop biomass predictions

Farmers need accurate estimates of winter cover crop biomass to make informed decisions on termination timing or to estimate potential release of nitrogen from cover crop residues to subsequent cash crops. Utilizing data from an extensive experiment across 11 states from 2016 to 2020, this study explores the most reliable predictors for determining cereal rye cover crop biomass at the time of termination. Our findings demonstrate a strong relationship between early‐season and late‐season cover crop biomass. Employing a random forest model, we predicted late‐season cereal rye biomass with a margin of error of approximately 1,000 kg ha−1 based on early‐season biomass, growing degree days, cereal rye planting and termination dates, photosynthetically active radiation, precipitation, and site coordinates as predictors. Our results suggest that similar modeling approaches could be combined with remotely sensed early‐season biomass estimations to improve the accuracy of predicting winter cover crop biomass at termination for decision support tools.


Funding information
National Institute of Food and Agriculture, Grant/Award Numbers: 8042-22000-16600D, MD-ENST-22008; Natural Resources Conservation Service, Grant/Award Number: NR21-13G022 for determining cereal rye cover crop biomass at the time of termination.Our findings demonstrate a strong relationship between early-season and late-season cover crop biomass.Employing a random forest model, we predicted late-season cereal rye biomass with a margin of error of approximately 1,000 kg ha −1 based on early-season biomass, growing degree days, cereal rye planting and termination dates, photosynthetically active radiation, precipitation, and site coordinates as predictors.Our results suggest that similar modeling approaches could be combined with remotely sensed early-season biomass estimations to improve the accuracy of predicting winter cover crop biomass at termination for decision support tools.

INTRODUCTION
Winter cover crops are increasingly being integrated into crop rotations due to their multifaceted agroecosystem benefits.Winter cover crops can reduce, and in some cases reverse, the rate of soil erosion (Evans et al., 2020), and they can also reduce nitrate leaching losses (Thapa, Mirsky, et al., 2018).Cover crops suppress weeds through biotic competition during growth phase, and physical impedance and altered surface soil conditions post-termination (Menalled et al., 2022).Cover crops also improve soil water infiltration rates (Basche & DeLonge, 2019), thereby increasing soil water storage to buffer against negative effects of extreme precipitation events, which are increasing with climate change (Basche et al., 2016;Gowda et al., 2018).In addition, residual nitrogen that might otherwise be susceptible to leaching or denitrification can be scavenged by winter cover crops and later released for the subsequent cash crop (Alonso-Ayuso et al., 2018;Thapa, Tully, et al., 2022).However, the efficacy of these benefits is contingent on winter cover crop biomass accumulation; poorly established cover crops are less effective at providing these benefits (Finney et al., 2016;Jian et al., 2020).Conversely, excessively high winter crop cover biomass can pose substantial management challenges for farmers (O'Connell et al., 2015).Decision support tools, such as the "Cover Crop N-Calculator Tool" (https://covercrop-ncalc.org/), are available to facilitate farmers in managing cover crop termination and estimate nitrogen release from cover crop residue (Thapa, Cabrera, et al., 2022).Models capable of predicting in-season cover crop biomass accumulation from readily available data are needed to inform decision support tools for cover crop and nitrogen management.These tools can assist farmer's decisions about when to terminate cover crops to prevent excessive biomass accumulation, and if they can reduce synthetic nitrogen fertilizer applications based on whether nitrogen will be released from the cover crop residue.The potential for nitrogen release from cereal rye monoculture residue may be relatively low compared to cover crops mixtures with legumes with higher aboveground N content and lower C:N ratios (Thapa, Poffenbarger, et al., 2018;Thapa, Tully, et al., 2022).However, we still lack accurate models to predict biomass from cereal rye, which is the most common cover crop species in the United States (CTIC et al., 2020).
Previous work has found that cereal rye biomass is sensitive to termination timing, soil nitrogen availability, fall nitrogen application rates, general soil fertility, seeding method, and length of growing season (Mirsky et al., 2017;Ruffo et al., 2004;Ruis et al., 2019).In this study, we explored how management and environmental factors can contribute to predict cereal rye (Secale cereale) cover crop biomass across 35 site-years of data throughout the eastern United States.Our goals were to identify (1) which covariates best explain variation in cereal rye cover crop biomass and (2) a modeling approach with high accuracy that can be adapted for future cover crop management decision support tools.Previous research has indicated that fall and spring growing degree days and soil nitrogen availability are key determinants of lateseason cereal rye biomass (Kuo & Jellum, 2000;Mirsky et al., 2017;Ryan et al., 2011).Despite soil nitrogen availability's role in determining late spring cereal rye biomass, it is an expensive parameter to estimate.We hypothesized that earlyseason cereal rye biomass would correlate with late-season cereal rye biomass well enough to obviate measurement of soil nitrogen availability for late-season biomass prediction.

Field sites and operations
Cereal rye cover crop biomass data used in the modeling approach were obtained from a field experiment conducted on research farms in 11 states between 2016 and 2020 (as outlined in Table S1).Cereal rye was planted in 9.1 m by 12.2 m plots in the late fall of each year, with four or five replicates per site-year.Management practices (i.e., cereal rye variety, seeding rates, and methods) specific to each site were based on local norms.Biomass samples were collected from two 0.5m 2 quadrats in each plot at 6 weeks (hereafter referred to as "early-season biomass") and 2 weeks ("late-season biomass") prior to target dates for soybean planting.Cereal rye variety, planting date, and early and late-season biomass sampling dates are summarized across sites and years in Table S2.

Data assembly and preparation
In addition to early-season biomass, which was hypothesized to predict winter cover crop growth, weather variables related to temperature and radiation were used to model late-season biomass.Minimum and maximum air temperatures (˚C) and shortwave incoming solar radiation (W m −2 ) were extracted for each site-year on a daily basis at a spatial resolution of 0.125˚by 0.125˚from the North American Land Data Assimilation System phase 2 dataset (Xia et al., 2012).Cumulative growing degree days (CGDD) (−4.5˚C base) were calculated over two time periods, and negative values were omitted (Pessotto et al., 2023)."Early CGDD" and "early precipitation" were summed between cereal rye planting date and early termination date (6 weeks prior to soybean planting), and "late CGDD" and "late precipitation" were summed between early termination and late termination date.Precipitation data were extracted from the multi-radar/multi-sensor system (NOAA, 2023).Daily photosynthetically active radiation (PAR) was calculated from shortwave radiation using the "sw.to.par" function in the LakeMetabolizer v.1.5.0 R package (Winslow et al., 2016).The mean of daily PAR was calculated for the period between early and late cover crop termination dates.

Model selection to evaluate support for each covariate
All predictor variables, early-season cereal rye biomass, cereal rye planting date (Julian days), late termination date (Julian days), mean PAR, and both early and late CGDD, were standardized by subtracting the mean and dividing by the standard deviation of each variable (Gelman, 2008).We examined all candidate predictor variables for collinearity using the vif function in the car package (v3.0-10)(Fox et al., 2018) and removed the precipitation variables because of their variance inflation factor scores > 3 (Zuur et al., 2010).Site location (which varied occasionally from year to year within states) was input as a unique categorical variable for each set of field location coordinates.
We fit a generalized linear mixed effects model (GLMM) using the glmer function in the lme4 package (Bates et al., 2015) with a Gaussian error distribution and log link function due to overdispersion in the response variable (late-season cereal rye cover crop biomass in kg ha −1 ).We specified a hierarchical model with random intercepts for each location and for blocks (nested under each location) to address the non-independence of repeated measurements within the same locations and blocks through time (Pinheiro & Bates, 2000).We fit a "global" model with all covariates that we hypothesized to be important, including early-season cereal rye biomass, cereal rye planting date (in Julian days), late termination date, mean PAR, and both early and late CGDD.We visually assessed model assumptions of homogeneity of variance across groups and normality of fitted residuals.

Random forest model and validation
To improve the accuracy of predictions, we also fit a random forest machine learning model on the dataset using the randomForest package v. 4.7-11 in R (Breiman, 2001;Liaw & Wiener, 2002).We specified a random forest model with the training parameters ntree set to 1000 and mtry set to 2 and included the same covariates as the GLMM, except we included site latitude and longitude coordinates separately rather than as categorical locations.Variable importance was calculated with the randomForest package; variables were ranked using %IncMSE, the mean decrease in prediction accuracy on the out of bag samples as each variable is randomly permuted.The dataset was randomly partitioned so that the random forest model was trained on 70% of the total data, and 30% was withheld and used for model validation.We also used the same data partition to validate a version of the "global model" (GLMM) that was refitted to include only the training data.To assess how model performance varied across "low" and "high" cover crop biomass values, we evaluated it separately for "low" biomass observations of 4000 kg ha −1 or less and "high" cereal rye biomass values greater than 4000 kg ha −1 .

RESULTS AND DISCUSSION
Early-season cereal rye biomass had the strongest correlation to late-season cereal rye biomass (0.71), followed by late CGDD (0.38) (Figure S1).In contrast, mean PAR had the weakest correlation with late-season biomass (>−0.01)(Figure S1).Notably, late-season biomass performance varied across states, with generally better biomass production in Maryland, Louisiana, and Missouri and generally low performance in Minnesota, Arkansas, and Nebraska (Figure 1a).Interestingly, the relationship between early-season biomass and late-season biomass varied across states but appeared to cluster within site-year (Figure 1b), suggesting the importance of cover crop establishment and early-season biomass in determining final late-season biomass.The significant variables in the GLMM with effect size estimates not overlapping zero were late CGDD, early-season biomass, mean PAR, and cereal rye planting date (Figure 2a).After controlling for differences by location in the random effects, the GLMM suggested that the fixed effects of late CGDD, mean, and early-season biomass had positive effects on late-season cereal rye biomass, whereas increases in cereal rye planting dates (i.e., later dates) had a negative effect (Figure 2a).The random forest model ranked early-season biomass as the most important variable, followed by termination date, longitude, and cereal rye planting date (Figure 2b).Previous studies have also identified the importance of cereal rye planting and termination dates on cereal rye biomass (Mirsky et al., 2011;Nord et al., 2011), and both modeling approaches suggested that there is a strong relationship between early and late-season cover crop biomass.
Our results corroborate the strength of the relationship between early and late-season cereal rye biomass across 35 site-years, thereby extending the geographical range examined in previous studies (Mirsky et al., 2017).Decision support tools for farmers could feasibly incorporate algorithms based on estimates of early-season cover crop biomass from proximal or remote sensing (Jennewein et al., 2022;Thieme et al., 2023).It is worth noting that the remote-sensing approaches are currently limited to a ceiling of ∼1900 kg ha −1 biomass due to the saturation of vegetation index-based cover crop biomass estimation (Jennewein et al., 2022).In our dataset, a substantial 207 observations, or 83% of our data, fell below this biomass threshold.In the future, early-season biomass estimates could be derived from imagery early in the spring, before this saturation point, and used in these models to predict cereal rye biomass accumulation at later termination dates.
With these high-quality data from a large, replicated study across many site-years, we achieved moderately accurate predictions of late-season cereal rye biomass (Figure 2f).For lower biomass observations, the model tended to overpredict biomass, and the root mean square error (RMSE) constitutes a more significant proportion of cereal rye biomass levels compared to high biomass observations.Both the GLMM and random forest models were validated using a randomly withheld "test" dataset, which constituted 30% of the original data.

F I G U R E 2
Variable importance summaries and model performance from the "global" generalized linear mixed effects model (GLMM) and random forest (RF) models.Effect size estimates from the GLMM, where all covariates were standardized and significant relationships are indicated by *p < 0.05, **p < 0.01, and ***p < 0.001 (a).Variable importance plot from the RF model; %IncMSE is the mean decrease in prediction accuracy on the out of bag samples as each variable is randomly permuted (b).Model performance as measured by root mean square error (RMSE) and R 2 values model prediction on validation observations (n = 75, or 30% of the original dataset) for the GLMM for low biomass observations of 4000 kg ha −1 or less (c), RF model for low biomass observations of 4000 kg ha −1 or less (d), GLMM for high biomass observations above 4000 kg ha −1 (e), and RF model for high biomass observations above 4000 kg ha −1 (f).Linear model fits are displayed as blue solid lines with shaded 95% confidence intervals, and the dashed red line is the 1:1 line (c-f).CGDD, cumulative growing degree days; PAR, photosynthetically active radiation.
The GLMM's predictions as compared to the observations had an RMSE of 1,378 kg ha −1 and R 2 of 0.28 for low biomass observations (Figure 2c) and RMSE of 2,336 kg ha −1 and R 2 of 0.43 for high biomass observations (Figure 2e), whereas the random forest model achieved a markedly lower RMSE of 962 kg ha −1 and R 2 value of 0.40 for low biomass (Figure 2d) and an RSME of 1,086 kg ha −1 and R 2 value of 0.72 for high biomass observations (Figure 2f).

CONCLUSION
nTo support the adoption and improve management of winter cover crops, we need to equip farmers with predictive tools.These tools should facilitate in-season winter cover crop biomass prediction to optimize management for various goals such as soil moisture, N availability, and weed suppression.
In this study, we found that late-season cereal rye biomass could be predicted within approximately 1,000 kg ha −1 with relatively few data inputs-early-season cereal rye biomass, cereal rye planting date, termination date, CGDD, precipitation, late PAR, and site coordinates.This level of error is considerable for sites with low biomass levels.However, the results are a promising advance for relatively high cereal rye biomass producing sites, which may be more likely to rely on decision support tools for agronomic management decisions.Moreover, we anticipate that integrating more geospatial variables, such as soil type and remotely sensed normalized difference vegetation index estimates of early-season cereal rye biomass, may predict the full range of cereal rye biomass values more accurately.In the future, similar approaches can be improved upon using new datasets, particularly with data on more complex cover crop systems such as cereallegume mixtures, which may have greater ecosystem service benefits.

•
Cereal rye winter cover crop biomass modeled on data from 35 site-years.•We found a strong relationship between early and late-season biomass.• Random forest model with early-season biomass and weather data performed well.• Similar approach could improve decision support tools for cover crop management.
This study was made possible by funding from USDA Area-Wide Pest Management (Project Number 8042-22000-16600D), USDA Natural Resources Conservation Service Conservation Innovation Grants (award no.# NR21-13G022), and Hatch Project (award no.MD-ENST-22008).The data and code used in this article can be accessed at the following repository link: https://doi.