Ordinal regression models for predicting deoxynivalenol in winter wheat

Authors

  • S. Landschoot,

    Corresponding author
    1. Faculty of Applied Bioscience Engineering, University College Ghent, Gent, Belgium
    2. KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Gent, Belgium
    Search for more papers by this author
  • W. Waegeman,

    1. KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Gent, Belgium
    Search for more papers by this author
  • K. Audenaert,

    1. Faculty of Applied Bioscience Engineering, University College Ghent, Gent, Belgium
    2. Department of Crop Protection, Laboratory of Phytopathology, Ghent University, Gent, Belgium
    Search for more papers by this author
  • G. Haesaert,

    1. Faculty of Applied Bioscience Engineering, University College Ghent, Gent, Belgium
    2. Department of Crop Protection, Laboratory of Phytopathology, Ghent University, Gent, Belgium
    Search for more papers by this author
  • B. De Baets

    1. KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Gent, Belgium
    Search for more papers by this author

Abstract

Deoxynivalenol (DON) is one of the most prevalent toxins in Fusarium-infected wheat samples. Accurate forecasting systems that predict the presence of DON are useful to underpin decision making on the application of fungicides, to identify fields under risk, and to help minimize the risk of food and feed contamination with DON. To this end, existing forecasting systems often adopt statistical regression models, in which attempts are made to predict DON values as a continuous variable. In contrast, this paper advocates the use of ordinal regression models for the prediction of DON values, by defining thresholds for converting continuous DON values into a fixed number of well-chosen risk classes. Objective criteria for selecting these thresholds in a meaningful way are proposed. The resulting approach was evaluated on a sizeable field experiment in Belgium, for which measurements of DON values and various types of predictor variables were collected at 18 locations during 2002–2011. The results demonstrate that modelling and evaluating DON values on an ordinal scale leads to a more accurate and more easily interpreted predictive performance. Compared to traditional regression models, an improvement could be observed for support vector ordinal regression models and proportional odds models.

Introduction

Fusarium head blight (FHB) is one of the most important diseases in small grain cereals, caused by a complex of Fusarium species. Hitherto, 17 species of Fusarium have been described as potentially associated with FHB symptoms (Leonard & Bushnell, 2003). Although FHB may cause grain yield losses, the interest in FHB is primarily fuelled by the ability of the majority of the Fusarium species to produce mycotoxins. Deoxynivalenol (DON) is one of the most common mycotoxins produced by Fusarium spp. This secondary fungal metabolite can accumulate to significant levels in the grain and, as such, is able to seriously affect human and animal health (Xu et al., 2005). Mycotoxins may cause technological problems, such as negative effects on malting of beer (inhibition of enzyme synthesis), on fermentation (inhibition of yeast growth) or on baking quality (Prange et al., 2005). Furthermore, they can cause human and animal toxicoses, the principal symptoms of which are nausea, lethargy, vomiting, digestive and haemolytic disorders, impairment of both humoral and cellular immune responses, and nervous disorders (Champeil et al., 2004; Pestka, 2007). To minimize feed and food safety concerns, the European Commission has set guideline limits for Fusarium mycotoxins in animal feed (EC 576/2006) and maximum limits for unprocessed cereals and cereal products for human consumption (EC 1881/2006). The limits for DON are 0·20 mg kg−1 for processed cereal-based food for infants and baby food; 0·50 mg kg−1 for bread, biscuits, cereal snacks and breakfast cereals and 0·75 mg kg−1 for cereals intended for direct human consumption, cereal flour, bran and germ as end products marketed for human consumption. For unprocessed cereals the threshold is 1·25 mg kg−1. These regulations provide an extra economic motivation for farmers to prevent FHB infection and mycotoxin accumulation in small grain cereals such as wheat.

Early insight into the expected mycotoxin contamination at harvest is useful for the various stakeholders of the wheat supply chain, in order to reduce year-to-year losses by applying appropriate fungicides in conjunction with other management strategies (Del Ponte et al., 2005). Therefore, agricultural researchers in many countries have developed various types of forecasting systems for predicting the occurrence of FHB and DON values in wheat (Prandini et al., 2009). The predictions of these systems often originate from statistical models that establish relationships between DON values and a variety of predictor variables, including weather conditions, agronomic variables, soil type and wheat variety information. Traditionally, linear regression models have been widely applied in such scenarios (Moschini & Fortugno, 1996; Hooker et al., 2002; Van Der Fels-Klerx et al., 2010). However, De Wolf et al. (2003) used binary logistic regression models to assess the risk of FHB incidence of wheat based on within-season weather variables. In addition, more complex non-linear models, such as neural networks, have also been investigated (Klem et al., 2007).

In most previous studies, the DON value was treated as a continuous response variable, i.e. metric regression techniques were used to obtain point estimates of DON values. However, it can be argued that this is often impossible due to uncertainty in the measurement of DON values, as a result of the variance and the detection limits. In this study, an enzyme-linked immunosorbent assay (ELISA) was used to quantify the amount of DON, as this method is commonly used for commercial screening purposes. Additionally, the DON value is the result of complex interactions between agronomic factors, weather conditions, and chemotype and aggressiveness of the prevalent Fusarium species. In particular, factors associated with the behaviour of the Fusarium spp. and DON formation make it impossible to predict the exact DON value of a certain field. Furthermore, the complex interactions among the variables influencing FHB and DON values demand an advanced mathematical approach to predicting the DON values. More conceptually, it can also be argued that predicting continuous DON values is not actually required. From an application-oriented perspective, it is more important to know whether the DON value of a given wheat sample exceeds certain well-defined thresholds, like those stated by the European Commission for human consumption.

As a consequence, instead of applying traditional regression models, it makes more sense to use ordinal regression models (Agresti, 2010), which provide mechanisms to represent thresholds for DON values in an explicit way. These models also require fewer assumptions with respect to relationships between the predictor variables and the response variable (Minetos & Polyzos, 2010).

Ordinal regression can be situated somewhere in between multiclass classification and traditional regression because labels are chosen from a finite set of ordered classes (Waegeman & De Baets, 2011). Applications frequently arise when humans participate in the data generation process and when the response variable cannot be measured with high precision. These are also the two main reasons why ordinal scales are encountered for assessing disease severity in plant disease prediction (Isebaert et al., 2009). Other application domains include ecology (Guisan & Harrell, 2000; Minetos & Polyzos, 2010), bioinformatics and agriculture in general (Henderson et al., 2007).

This article aims to introduce ordinal regression models for predicting DON values in Belgium. Agriculture in Belgium is characterized by very intensive animal production, implying that most of the wheat is processed as concentrate animal feed or directly incorporated in animal diets, especially for pigs and chickens. As wheat contaminated with Fusarium mycotoxins often occurs in Belgium, it gives rise to a direct and strong economic impact (Isebaert et al., 2009). Wheat-growing areas in Belgium are characterized by a complex crop rotation system (including maize and ryegrass), small fields and a very complex structure of the Fusarium population, with huge inter- and intraseasonal variations (Audenaert et al., 2009; Landschoot et al., 2011).

This analysis is performed on a sizeable data set, containing DON values and more than 300 predictor variables, including agronomic variables, wheat variety information and whole-year time series of weather conditions for 18 locations from 2002 until 2011. This data set was extensively analysed in Landschoot et al. (2012a), and the most important variables affecting FHB and DON contamination were identified. Using these initial results as a starting point, this study concentrates on the modelling techniques, while focussing on building predictive models. Several traditional and ordinal regression models are compared, including proportional odds models (McCullagh, 1980) and support vector ordinal regression models (Chu & Keerthi, 2007). The main objectives are: (i) comparing various traditional and ordinal regression models; (ii) analysing the influence of the choice of thresholds and the number of ordinal classes on the performance of the models; (iii) discussing different performance measures for predicting DON values on an ordinal scale; and (iv) studying the impact of several validation strategies on the performance of the models.

Materials and methods

Field trials and sampling

From 2002 until 2011, different winter wheat field trials were established throughout Belgium as part of the ACG (Flanders Agricultural Centre for Small Grains, Belgium) trial network. Each year, at each location, commercial winter wheat varieties were sown in a randomized complete block design with four replications (Table 1). At all locations, wheat was produced under normal cropping conditions for Belgium, including three applications of N fertilization. Additionally, one general fungicide treatment was applied at growth stage (GS) 59 (Zadoks et al., 1974). At each location, the same fungicides, containing a strobilurin and a triazole, were applied in accordance with the manufacturer's recommendations. Zero tillage is not common in Belgian agriculture, therefore all fields were ploughed before sowing. The wheat varieties were sown following different previous crops, both host crops for Fusarium (maize or wheat) as well as non-host crops (beans, sugar beets, onions or chicory). At each location, at least 10 wheat varieties were grown and the varieties differed from year to year. At the time of writing, the database contained more than 80 different wheat varieties. Each year the varieties ranged from moderately resistant to highly susceptible for Fusarium infection and DON accumulation (Landschoot et al., 2012a).

Table 1. Locations of the experimental field trials in Belgium from 2002–2011
2001–20022002–20032003–20042004–20052005–20062006–20072007–20082008–20092009–20102010–2011
AssentBottelareBottelareHelkijnBottelareBottelareBottelareBottelareBottelareBottelare
BottelareGistelGistelKoksijdeKoksijdeKoksijdeKoksijdeCineyCineyCiney
GingelomHelkijnHelkijnPoperingeLeefdaalLierdeLeefdaalHavelangeHavelangeHavelange
HelkijnPoperingePoperingeTongerenLierdePoperingePoperingeHollogneHollogneHollogne
KoksijdeTongerenTongerenVerrebroekPoperingeTongerenTongerenKoksijdeKoksijdeKoksijde
NieuwenhoveZuienkerkeVerrebroekZuinekerkeTongerenVerrebroekVerrebroekLinterLinterLinter
PoperingeZwevegemZuinekerkeZwevegemVerrebroekZuinekerkeZuinekerkePoperingePoperingePoperinge
Zuienkerke Zwevegem ZuinekerkeZwevegemZwevegemTongerenTongerenTongeren
Zwevegem   Zwevegem  VerrebroekVerrebroekVerrebroek
       ZuinekerkeZuinekerkeZuinekerke
       ZwalmZwalmZwalm
       ZwevegemZwevegemZwevegem

The degree of FHB resistance of the different wheat varieties was determined through independent artificial inoculation tests with a mixture of Fusarium spores. Spores of each species were diluted to a final concentration of 106 spores mL−1 and the spray volume was 300 L ha−1. The inoculation tests were performed over several years to account for year-to-year variability. Based on the results of these inoculation tests, the wheat varieties were classified into five risk classes for DON accumulation: Class 1, varieties with a DON value >40% lower than the mean DON value for that year; Class 2, DON value between 40 and 20% lower than the mean; Class 3, DON value between 20% lower and 20% higher than the mean; Class 4, DON value between 20 and 60% higher than the mean; and Class 5, the most susceptible varieties, DON value >60% higher than the mean.

In early July at GS 71 or GS 75, all experimental fields at all locations were evaluated for the presence of Fusarium symptoms. In each plot 100 ears, originating from distinct plants, were randomly sampled and scored using an ordinal scoring system based on the surface area of the ear covered with Fusarium symptoms: 1 = healthy; 2 = <25% covered; 3 = 25–50% covered; 4 = 50–75% covered; 5 = 75–100% covered. The disease index (DI) was calculated as follows: DI = (0n1 + 1n2 + 2n3 + 3n4 + 4n5)/4× 100%; where n is the number of evaluated ears and ni the number of ears in disease class i (Isebaert et al., 2009). At the time of harvest (mid-August), all individual plots were harvested separately and a homogeneous sample was taken for DON analysis. The DON values were measured with an ELISA (Veratox DON 5/5 kit Biognost; Neogen; Audenaert et al., 2009).

Meteorological variables

Daily weather data during the entire growing season were collected using the automated weather monitoring equipment of the Agricultural Centre for Potatoes (Flanders, Belgium) within 5 km of the field experiments. Daily rainfall (mm), temperature (°C), relative humidity (RH,%) and leaf wetness duration (h) were recorded. All weather observations were evaluated graphically to identify potential errors or missing values. Additional variables (wind (m s−1), air pressure (hPa), dew point temperature (°C)) were supplemented with weather data from wolfram mathematica 7.0, a software package for scientific computing which includes tools for retrieving historical time series of a large number of weather variables. The quality of the data and the coverage of weather stations are relatively high for Belgium.

Information from previous research efforts was used to construct predictor variables that are possibly useful for forecasting FHB symptoms and DON values (Moschini et al., 2001; Hooker et al., 2002; Schaafsma & Hooker, 2007). For each month of the growing season (November to July) the average, median, 10th, 25th, 75th and 90th percentiles of temperature, relative humidity, air pressure, wind, number of freezing days in winter, number of rainy days, and the number of days with RH greater than 80% were calculated. Note that 10th and 90th percentiles of weather variables can be used to estimate the impact of extreme weather conditions. For example, one might theoretically expect that extreme rainfall might have an effect on Fusarium infection and subsequently on DON production.

Spearman rank correlation coefficients were calculated (R software v. 2.10.1; R Development Core Team, 2006) for all combinations of weather variables in the different months. Each month, three to 14 variables that were mutually sufficiently uncorrelated (correlation less than 0·7) and contributed most to explaining the variation in DON values were considered for further analysis. This variable selection method based on a correlation analysis can clearly be improved upon, but this is beyond the scope of this paper. This method might lead to overfitting problems, but these are expected to be marginal because the FHB data set can be considered as low-dimensional, at least compared to modern data sets with hundreds of thousands of variables, as encountered in areas such as genetics and molecular biology (Saeys et al., 2007).

Furthermore, it could be argued that 30-day periods lined up with the actual months may be not the most appropriate time periods to summarize the weather variables. In view of the short period of susceptibility of wheat to FHB, it is necessary to analyse the correlation between DON values and weather conditions during the period of preanthesis and anthesis. Therefore, the window-pane methodology (Kriss et al., 2010) was employed to determine shorter periods (window lengths 5–30 days) around anthesis, in which weather conditions significantly contribute to the FHB incidence and measured DON values. Table 2 shows the predictor variables that were retained to fit the models. All models were fitted with the same set of predictor variables.

Table 2. Predictor variables used to fit both metric and ordinal regression models
  1. a

    The risk class of the previous crop (host or non-host crop for Fusarium).

  2. b

    Wheat variety susceptibility.

  3. c

    10% P, 25% P, 50% P, 75% P and 90% P mean 10th, 25th, 50th, 75th and 90th percentiles of the monthly weather variables, respectively.

  4. The variables determined by the window-pane analysis are RH days 156–160, days with rainfall 166–171, days with rainfall 155–159, mean RH days 156–170, total rainfall days 155–184, total rainfall days 147–165 in which day 1 is 1 January.

Risk previous cropaTotal rainfall March25% P dew point June
Wheat varietybMean temp April75% P dew point June
10% Pc temp Nov10% P RH April90% P temp June
25% P temp Nov10% P dew point April90% P RH June
25% P RH Nov25% P dew point April90% P dew point June
50% P RH Dec90% P temp AprilTotal rainfall June
75% P dew point Dec90% P RH AprilDays with rainfall June
90% P temp DecDays with rainfall AprilMean temp July
Days with rainfall DecMean temp MayMean RH July
Days with frost DecMean RH MayMean dew point July
10% P RH Jan50% P temp May50% P temp July
75% P temp Jan50% P dew point May50% P RH July
75% P RH Jan10% P temp May10% P temp July
Total rainfall Jan10% P RH May10% P RH July
Days with frost Jan10% P dew point May25% P RH July
50% P temp Feb25% P temp May75% P dew point July
10% P RH Feb90% P temp MayTotal rainfall July
Total rainfall FebTotal rainfall MayDays with rainfall July
Days with frost FebDays with rainfall MayRH days 156–160
Days with RH >80% FebDays with RH >80% MayDays with rainfall 166–171
10% P temp MarchMean temp JuneDays with rainfall 155–159
10% P dew point MarchMean RH JuneMean RH days 156–170
25% P RH March10% P dew point JuneTotal rainfall 155–184
90% P dew point March25% P RH JuneTotal rainfall 147–165

Statistical analysis

Metric regression models

The term metric regression is commonly used to denote regression models that consider continuous response variables (Waegeman et al., 2008). The traditional multiple linear regression model is the most simple and best-known metric regression model. In an ordinary linear regression model, the residual sum of squares wants to be minimized. Yet it is known that least-squares minimization in multiple linear regression may not always provide accurate predictions for a number of reasons, such as a lack of robustness to outliers and missing mechanisms for handling multicolinearity and controlling the complexity in the presence of many predictor variables (Hastie et al., 2008). Ridge regression offers a solution to the last two problems by shrinking the regression coefficients towards zero (Hastie et al., 2008). As such, a penalized residual sum of squares is minimized. Multiple regression and ridge regression models were fitted using the R software v. 2.10.1 with the functions ‘lm’ and ‘ols’.

Regression trees were used as a third type of predictive model, which allows for the construction of non-linear functions, unlike the two former regression techniques. Trees are built using a process known as binary recursive partitioning. The algorithm recursively splits the data into two groups based on a splitting rule. The partitioning intends to increase the homogeneity of the two resulting subsets or nodes, based on the response variable. The partitioning stops when no splitting rule can improve the homogeneity of the nodes significantly (Hastie et al., 2008). The R function ‘rpart’ was employed to construct regression trees.

Support vector machines (SVMs) are a class of universal approximators based on statistical learning theory and quadratic programming (Cortes & Vapnik, 1995). SVMs support both regression and classification tasks and they can handle thousands of predictor variables of quite different natures. In a regression context, a quadratic program is solved to minimize a so-called ε-insensitive error function that is known to be a better alternative to the least-squares error function in terms of robustness to outliers. This setting is known as support vector regression (SVR; Fig. 1a). Moreover, SVMs (and SVR) can infer complex non-linear functions by mapping the predictor variables to a high-dimensional space that can be represented by means of kernel functions. The candidate functions of a linear ε-SVR can be represented as:

display math(1)

where x is the matrix of the predictor variables, â the vector of the parameter values and b the bias term. In ε-SVR the goal is to find a function f(x) that has at most ε deviation from the actual targets and at the same point remains as flat as possible. Flatness in the case of Eqn (1) means a small â. One way to ensure this is to minimize the norm, ||â||², which can be written as:

display math(2)
Figure 1.

(a) Example of linear support vector regression in combination with the insensitivity tube depicted by the dashed lines. The points scattered around the regression line are training examples. (b) Example of support vector ordinal regression with three classes. The samples from different ordinal classes, represented as circles filled with different colours, are mapped by f(x) onto the axis of function value (Chu & Keerthi, 2007).

The tacit assumption in Eqn (2) was that such a function f that approximates all pairs (xi; yi) with ε precision exists. However, sometimes this may not be the case, which can be seen in Figure 1a. It can be seen that not all points lie within the insensitivity tube. To accommodate these non-fitting points, slack variables are introduced. Using the slack variables ζ and ζϚ*, Eqn (2) can be written as:

display math(3)

The constant C determines the trade-off between the flatness of f and the amount to which deviations larger than ε are tolerated. Eqn (3) is also called the primal objective function. Using the kernel trick, the above optimization problem can be solved in a dual formulation, similar to support vector machines (Cortes & Vapnik, 1995). Two common types of kernel functions are linear kernels, leading to linear models, and Gaussian radial basis function (RBF) kernels, resulting in non-linear models by mapping the predictor variables to a high-dimensional space.

Ordinal regression models

In ordinal regression problems, the response variable can take a small number of discrete, ordered values that are often referred to as classes. Ordinal regression differs from multiclass classification due to the existence of an order relation on the response variable, so that typically different model structures and different performance measures are considered in these two settings. In contrast to metric regression, the response variable is discrete and finite (Waegeman et al., 2008). In this article a penalized proportional odds model is used (fitted using the R function ‘lrm’), in which, similar to ridge regression, the penalty term prevents overfitting by shrinking the regression coefficients towards zero.

The proportional odds model (McCullagh, 1980) is the best-known and most applied technique to represent ordinal response variables as a function of multiple predictor variables. This kind of model involves modelling cumulative logits, in which the cumulative probability of observing an outcome greater than or equal to k is defined as follows:

display math

for = 1, …, r with r being the number of classes considered. Similar to the logistic regression model for binary responses, the proportional odds model fits a linear model to the log-odds of these cumulative probabilities, i.e.

display math

for = 1,…, r. As a single vector of parameters â is used, the model has the same effect for each class, guaranteeing that all response curves for individual classes have the same shape. These curves share exactly the same rate of increase or decrease, but they are horizontally displaced by the thresholds bk. Hence, the proportional odds model forms a direct generalization of the logistic regression model by considering a threshold for each class. If a different slope for each class is considered as well, a one-versus-one ensemble would be obtained with logistic regression models as binary classifiers (Waegeman & De Baets, 2011).

Additionally, support vector ordinal regression (SVOR) models were also used, which allow modelling of non-linear relationships between the response variable and the predictor variables. The support vector ordinal regression algorithm was introduced by Shashua & Levin (2003) as a direct generalization of SVMs (Cortes & Vapnik, 1995) to more than two ordered classes. Later, this algorithm was enhanced by Chu & Keerthi (2005, 2007) by repairing some shortcomings of the initial algorithm. In essence, these authors presented two slightly different support vector approaches for ordinal regression. Only one of these methods is analysed here, namely the version with implicit constraints on the thresholds. In SVOR, thresholds can be interpreted as hyperplanes in the kernel space that lie perpendicular to the direction â (Fig. 1b). The vector â and r−1 thresholds bk are inferred, such that the weighted sum of the error on training data and the regularizer math formula are minimized. Similar to SVR, individual errors are denoted with slack variables ζ. More details about the derivation of this method and the implementation can be found in Chu & Keerthi (2005). For the experiments here, the publicly available SVORIM-package was used (http://www.gatsby.ucl.ac.uk/~chuwei/code/svorim.tar).

Choice of thresholds for the class boundaries

The thresholds that define the boundaries of the risk classes for DON contamination are based on the limits of the ELISA kit and the current European legislation for contaminants in foodstuff (Commission regulation (EC) No. 1881/2006):

  • 0·10 mg kg−1: detection limit of the ELISA kit;
  • 0·20 mg kg−1: the limit for processed cereal-based food for infants and baby food;
  • 0·50 mg kg−1: the limit for bread (including small bakery wares), pastries, biscuits, cereal snacks and breakfast cereals;
  • 0·75 mg kg−1: the limit for cereals intended for direct human consumption, cereal flour, bran and germ as end product marketed for direct human consumption.

The threshold of 1·25 mg kg−1 for unprocessed cereals was not considered, as the data set contained only a very small number of observations (0·1%) exceeding this threshold. It appears that in Belgium, DON values higher than 1·25 mg kg−1 in wheat occur only in extreme situations (susceptible wheat variety combined with a host crop as previous crop and weather conditions conducive for Fusarium development).

First, proportional odds models and support vector ordinal regression models were applied in a five-class setting with thresholds of 0·10, 0·20, 0·50 and 0·75 mg kg−1. Subsequently, the lowest threshold was omitted and the resulting three thresholds were used in a four-class setting. Finally, three two-class settings were analysed, with thresholds of 0·20, 0·50 and 0·75 mg kg−1.

Evaluation procedures and performance measures

The predictions obtained with the different models were evaluated via cross-validation, a standard approach that uses the available data in an optimal way for evaluating predictive models (Hastie et al., 2008). Two versions of cross-validation were considered. The first version involved a standard 20-fold cross-validation for which the data was randomly subdivided into 20 parts. Subsequently, 20 iterations of model calibration and validation were performed, also known as training and testing, leaving out one particular fold for testing in each iteration, while using the remaining 19 folds for training.

The second evaluation strategy was a more specific cross-year cross-location strategy, proposed in Landschoot et al. (2012b) to account for year and location effects. This procedure can be summarized as follows.

First, the available data was grouped by year and by location. In this way, K × L subsets were generated for a data set with observations from K years and L locations. It is important to include all locations and years. However, some of these subsets contained no data because not all locations were analysed in all years. Secondly, for each of the non-empty subsets, a predictive model was fitted to the whole data set without using the observations of the subsets from the same year or the same location. Subsequently, observations in the subset under consideration were used to evaluate the predictive performance of the model. Finally, a global estimate of the performance of the predictive modelling technique was obtained by weighted averaging of the results for every individual subset, in which the weights were proportional to the number of observations in each subset. A more detailed assessment of the usefulness of this cross-validation strategy is addressed in the discussion.

The two most common performance measures are reported for regression models with a continuous response variable, the coefficient of determination R2 and the mean squared error (MSE). The latter is formally defined by normalizing the residual sum of squares. Five different performance measures are reported for the ordinal regression models, as defined in Waegeman et al. (2008). The accuracy is calculated as the proportion of correctly classified examples. The mean squared error and the mean absolute error (MAE) represent the mean of squared or absolute difference between the output of the classifier and the correct label. Furthermore, two versions of the concordance index (C-index) were calculated. The C-index is often reported as a measure of the predictive power of an ordinal regression model. It is an estimator of the concordance probability by counting the number of (lower-class; higher-class) object couples that are correctly ranked by the model. A C-index of 1 represents a perfect model, whereas a value of 0·5 represents a completely random model. When only two classes are considered, the C-index reduces to the area under the receiver operating characteristics (ROC) curve (AUC) or, equivalently, to the Mann–Whitney–Wilcoxon test statistic. Two alternative approaches are usually considered to calculate multivariate performance measures like the AUC and the C-index in cross-validation, namely pooling and averaging (Parker et al., 2007; Airola et al., 2011). For the pooling approach, predictions made in each cross-validation fold are pooled into one set and one common C-index is calculated from this pool. The assumption made when using pooling is that predictions returned in different folds come from the same distribution. Parker et al. (2007) showed that this assumption is in most cases not valid, potentially leading to large pessimistic biases. An alternative approach, averaging, is to calculate the C-index separately for each fold and average them to obtain one common performance estimate. Unlike with pooling, predictions made for instances in different folds are never compared, so this approach will be unbiased. However, in K-fold cross-validation, only a small subset of all the possible positive–negative instance pairs present in the training set is considered when calculating the C-index. This can lead to a high variance in the estimates when using small data sets. As an extreme case, if there are more folds than observations for the minority class, some of the folds will not have instances from this class (Airola et al., 2011).

To compare the results of the metric and ordinal regression models, the accuracy, MSE, MAE and C-index were also calculated for the metric regression models. These were obtained by transforming both the measured and predicted continuous DON values into five and four ordinal categories, respectively.

Results

Metric regression models

The results for the metric regression models, where continuous DON values are used as response variable, are summarized in Table 3. For all regression models the random cross-validation strategy resulted in the lowest MSEs, ranging from 0·039 (SVR with RBF kernel) to 0·043 (regression trees), and the highest R2, ranging from 0·560 (SVR with RBF kernel) to 0·516 (regression trees). The cross-year cross-location validation strategy gave rise to substantially higher MSEs and a substantially lower R². Additionally, the differences between the various regression models were more prominent for cross-year cross-location validation than for random cross-validation. When the cross-year cross-locations strategy was used to evaluate the models, the MSEs ranged from 0·073 (ridge regression) to 127 (linear regression) and the R² ranged from 0·192 (ridge regression) to less than 0·001 (SVR with linear kernel).

Table 3. Performance (MSE (Mean Squared Error) and R2) of different ordinary regression methods. The variance of the DON values is 0·094
 Random CVaCYL CVb
Linear reg.cRidge reg.Reg. treeSVR linearSVR RBFLinear reg.Ridge reg.Reg. treeSVR linearSVR RBF
  1. a

    Random cross-validation.

  2. b

    Cross-year cross-location validation.

  3. c

    Linear reg.: multiple linear regression; Ridge reg.: ridge regression; Reg. tree: regression trees; SVR linear: support vector regression with linear kernel; SVR RBF: support vector regression with Gaussian RBF kernel.

MSE0·0400·0400·0430·0400·0391270·0730·1043·6200·090
R 2 0·5540·5540·5160·5300·5600·0100·1920·032<0·0010·020

Ordinal regression models

The results of the five- and four-class ordinal regression settings are summarized in Table 4 and the results of the three different two-class classification problems are shown in Table 5. In all cases, the performance based on the random cross-validation strategy was characterized by a higher accuracy, lower MSE and MAE and a higher C-index than when the performance was based on the cross-year cross-location validation strategy. In the random cross-validation setting, the proportional odds model always resulted in a higher accuracy and a lower MSE and MAE than the support vector models.

Table 4. Performance of the five- and four-class proportional odds model (Prop odds) and support vector ordinal regression model (SVOR). The results are presented for random cross-validation and cross-year cross-location validation (CYL CV)
PerformanceaFive classesFour classes
Random CVCYL CVRandom CVCYL CV
Prop oddsSVORProp oddsSVORProp oddsSVORProp oddsSVOR
  1. a

    MSE: Mean Squared Error; MAE: Mean Absolute Error; C-index-A: C-index with averaging method; C-index-P: C-index with pooling method.

Accuracy0·6250·5700·4640·4420·7270·6950·6760·652
MSE0·6150·7021·3901·2000·4100·5180·8250·740
MAE0·4790·4870·7700·7450·3330·3780·4700·462
C-index-A0·8920·8750·6170·6320·9000·8340·6250·581
C-index-P0·8920·8100·7700·7130·8970·8200·7560·723
Table 5. Performance of the two-class proportional odds model (Prop odds) and support vector ordinal regression model (SVOR) with different thresholds and different cross-validation strategies (Random cross-validation and cross-year cross-location validation (CYL CV))
Performancea0·20 mg kg−1
Random CVCYL CV
Prop oddsSVORProp oddsSVOR
Accuracy0·8510·8100·7550·756
MSE0·1490·1960·2450·243
MAE0·1490·1960·2450·243
C-index-A0·9190·8680·6800·670
C-index-P0·9150·8450·7140·745
Performance0·50 mg kg−1
Random CVCYL CV
Prop oddsSVORProp oddsSVOR
Accuracy0·9020·8730·8480·875
MSE0·0980·1270·1520·125
MAE0·0980·1270·1520·125
C-index-A0·9140·7190·6750·532
C-index-P0·9120·5750·8460·500
Performance0·75 mg kg−1
Random CVCYL CV
Prop oddsSVORProp oddsSVOR
  1. a

    MSE: Mean Squared Error; MAE: Mean Absolute Error; C-index-A: C-index with averaging method; C-index-P: C-index with pooling method.

Accuracy0·9500·9500·9450·950
MSE0·0500·0500·0550·050
MAE0·0500·0500·0550·050
C-index-A0·9290·6110·7010·502
C-index-P0·9240·5500·8030·500

For the cross-year cross-location strategy in the four- and five-class setting, the proportional odds model performed better than the SVOR model based on the accuracy, whereas the SVOR model resulted in a lower MSE and MAE than the proportional odds models. For the C-index the results were variable. The highest accuracy obtained with the five-class model validated with the realistic cross-year cross-location strategy was 46% and with the four-class model 67%. The lowest MSE and MAE reached were 1·20 and 0·75 for the five-class model and 0·74 and 0·46 for the four-class setting, respectively. The highest C-index, calculated with the averaging method in the five- and the four-class setting, was 0·63. The C-index calculated with the pooling method was slightly higher, at 0·77 in the five- and 0·76 in the four-class setting.

For the two-class classification task, it is clear that the higher the threshold of the two-class models, the higher the accuracy and the lower the MSE and the MAE (Table 5). For the C-index no clear trend could be observed. The differences between the thresholds were more clear-cut for the models validated with the cross-year cross-location validation strategy than for the models validated with the random cross-validation strategy. The SVOR models validated with cross-year cross-location validation performed slightly better than the proportional odds models based on the accuracy, MSE and MAE. With the cross-year cross-location validation strategy, an accuracy of 76% was reached when the threshold was 0·20 mg kg−1, and 95% when the threshold was 0·75 mg kg−1. In most cases, except for the model with threshold 0·20 mg kg−1 validated with cross-year cross-location validation, the C-index was higher for the proportional odds models than for the SVOR models. For the models with thresholds 0·50 mg kg−1 and 0·75 mg kg−1 the averaged C-index was only slightly higher than 0·50. Additionally, especially for the higher thresholds, the DON values in one fold lie in the same class. If all DON values of a particular fold are in the same class no C-index can be calculated for that fold. Therefore, in the models validated with the cross-year cross-location validation strategy the averaged C-index was based on 36 and 15 values for a threshold of 0·50 and 0·75 mg kg−1, respectively, whereas for the threshold of 0·20 mg kg−1 the averaged C-index was based on 50 values (in total there were 89 folds in the cross-year cross-location validation strategy). Concerning the pooled C-index, a C-index of about 0·50 was obtained with the SVOR models, with a threshold of 0·50 and 0·75 mg kg−1.

Comparison of metric regression and ordinal regression

To compare metric and ordinal regression, the performance of multiple linear regression models and ridge regression models were compared with the performance of the ordinal regression models when cross-year cross-location validation was applied. These two model types were chosen because they resulted in the worst and the best predictions.

The results of this comparison are given in Table 6. Note that the MSE and MAE calculated as above are different from those in Table 3. To calculate the MSE and MAE in Table 6, the continuous values were transformed into ordinal categories, whereas the calculations of the MSE and MAE in Table 3 were based on the continuous values.

Table 6. Comparison of the performance of the metric regression models (linear and ridge regression models) and the ordinal regression models (proportional odds and SVOR models) validated with cross-year cross-location validation. Note that the continuous output of the metric regression models was converted into five and four different classes, respectively, to calculate the performance measures
PerformanceFive classesFour classes
Metric regressionOrdinal regressionMetric regressionOrdinal regression
Linear reg.Ridge reg.Prop. oddsSVORLinear reg.Ridge reg.Prop. oddsSVOR
Accuracy0·1810·3400·4640·4420·3100·5470·6760·652
MSE7·0501·3501·3901·2004·3030·6420·8250·740
MAE2·2000·8700·7700·7451·6220·5160·4700·462
C-index-A0·6170·6160·6170·6320·6250·6250·6250·581
C-index-P0·4070·7720·7700·7130·7720·7980·7560·723

Within the ordinal regression setting, for five and four ordinal categories, more observations were predicted in the right class, which resulted in a higher accuracy. The MSE and MAE were clearly lower in the ordinal regression setting than in the continuous setting. For the C-index, the differences between the two approaches were rather small, except for the pooled C-index of the multiple linear regression model in the five-class setting, which was very low. In the four-class setting the C-indices of the ordinal regression models were similar or slightly lower than for the continuous approach. Additionally, it can be seen that the pooled C-index is in most cases higher than the averaged C-index (except for the multiple linear regression model with five categories). This can be explained by the fact that the DON values are largely determined by year and location effects.

In this data set, DON values of a particular test set in the cross-year cross-location validation strategy were similar. If all classes are not represented in the test set, the averaged C-index suffers from a high variance. Furthermore, as mentioned above, in the ordinal setting the pooling method suffers from a bias because function values from different folds are compared. In the continuous setting, the returned function values are direct estimates of the DON values and can thus directly be compared, which results in a lower bias. Therefore, the C-index is not the most appropriate measure to compare continuous and ordinal regression models. In conclusion, based on the accuracy, MSE and MAE, the ordinal regression models outperformed the metric regression models.

Discussion

To the authors' knowledge, this work is the first attempt to apply ordinal regression models for predicting threshold-based DON values in winter wheat. Previous research has mainly focused on forecasting exact DON values by means of conventional multiple linear regression models (Moschini & Fortugno, 1996; Hooker et al., 2002; Van Der Fels-Klerx et al., 2010) and artificial neural networks (Klem et al., 2007). The experimental results suggest that a threshold-based prediction of DON values can lead to more meaningful and more easily interpreted results, especially from an application-oriented perspective. By transforming the continuous measured and predicted DON values into the five and four ordinal categories, it was illustrated that ordinal regression models outperformed the more traditional regression approaches. The predictive performance obtained for the ordinal regression models indicated that a substantial portion of the variance in the response variable can be explained by means of the predictor variables. In contrast, such an effect was not seen for the metric regression models when the cross-year cross-location validation strategy was adopted. Thus for the specific situation of Belgium, whilst it remains very difficult to forecast specific DON values, it is possible to predict whether certain thresholds will be exceeded using ordinal regression models.

All models were developed based on a variety of predictor variables, including agronomic variables, wheat variety information and time series of weather variables, including temperature, rainfall and relative humidity. Most of these variables are known to influence the DON value of wheat (Hooker et al., 2002; Klem et al., 2007; Schaafsma & Hooker, 2007; Prandini et al., 2009; Kriss et al., 2010; Van Der Fels-Klerx et al., 2010; Landschoot et al., 2012a). The different models constructed from this study are intended to provide accurate predictions, rather than give further insight into the relative importance of the predictor variables, as a result of using non-linear techniques. See Landschoot et al. (2012b) for more details on this subject.

The thresholds that define the class boundaries of the risk classes for DON contamination were based on the limits of the ELISA kit and the current European legislation for contaminants in foodstuff (Commission regulation (EC) No. 1881/2006). The EU legislation is based on toxicological experiments with various animals, and it has been suggested that such results are difficult to extrapolate to other animals or humans, and that maximum levels may be too high. In some animals, negative health effects also seem to occur at a low DON value (Pestka, 2007; van der Burgt et al., 2011). Additionally, mycotoxins rarely occur as a single contaminant. Food and feed may contain a blend of mycotoxins, and intake of combinations of mycotoxins may lead to interactive toxic effects. Furthermore, the toxic effect of a single mycotoxin may be amplified due to the presence of other contaminants. More research studying the additive and synergistic effects of a mixture of mycotoxins is needed. Therefore, without including the limit of 1·25 mg kg−1, the developed ordinal regression models are useful to predict which thresholds the DON value will lie between.

Remarkable differences were observed between the different cross-validation strategies. These results support relevant findings from Landschoot et al. (2012b), where various evaluation methodologies for plant disease prediction models were discussed and empirically analysed. In general, it can be expected that the random cross-validation strategy returns a better performance than the specific cross-year cross-location validation strategy, because the former method is known to be statistically biased if predictions have to be made for future years or new locations (Kaundal et al., 2006; Hastie et al., 2008). Cross-year cross-location validation should therefore be the preferred methodology for evaluating models that intend to forecast DON values.

Concerning the metric regression techniques, it was observed that the multiple linear regression technique resulted in huge positive and negative predictions for new years and/or new locations in the cross-year cross-location validation strategy, due to overfitting. This problem was resolved in the ridge regression technique, in which the regularization parameter controlled the effective complexity of the model. The ridge regression technique outperformed the other modelling techniques when the cross-year cross-location validation strategy was used to validate the models. When comparing support vector regression models with a linear kernel or a Gaussian RBF kernel, in case cross-year cross-location validation was used, it was clear that a Gaussian RBF kernel performed better than a linear kernel. This confirmed that non-linear relationships such as pairwise or higher-order interactions existed between the response variable and several of the predictor variables.

In a second attempt ordinal regression techniques were used to obtain a threshold-based prediction of the DON values. The proportional odds model fitted the data reasonably well, but only linear models are considered, while the true relationship between the response variable and the predictor variable might be non-linear in many predictive modelling problems (Hastie et al., 2008). As a preprocessing step, variables were selected that were sufficiently uncorrelated. This might be the explanation for the comparable performance of the proportional odds model and the support vector ordinal regression model in this analysis.

The models were evaluated using five different performance measures, all of them having advantages and disadvantages. The accuracy can be calculated for both ordinal regression and multiclass classification models. The accuracy is also easy to compute and interpret, but it gives a biased view of the performance of an ordinal regression model because it does not take into account the magnitude of an error. For example, misclassifying an object of class four into class one should be evaluated as a more severe mistake than misclassifying the same object into class three. The MSE and MAE, on the other hand, evaluate the predictive performance by considering the magnitude of errors, but these measures require that a metric is defined on the risk classes. Mathematically, this is an incorrect assumption, as no metric is defined on an ordinal scale. Often the C-index or concordance index is reported in the statistical literature as a measure of the predictive power of an ordinal regression model (Parker et al., 2007; Airola et al., 2011). This measure is an estimator of the concordance probability by counting the number of (lower-class; higher-class) object couples that are correctly ordered by the model. Compared to the mean squared error, it has the important advantage that no metric on the risk classes is required.

Both the pooled C-index and the averaged C-index were calculated. The difference between the two approaches was quite small in most of the experiments, but some inconsistencies were observed when cross-year cross-location validation was applied to the two-class models. The discrepancy is caused by problems in fold sampling during cross-validation, if certain folds do not contain instances from the minority class. This is especially true in the cross-year cross-location validation strategy, where the folds contain observations from a single year and a single location. Because DON contamination is shown to be location- and year-specific, most of the observations were likely to be in the same class. The averaged C-index then becomes highly variable, and thus less reliable. In the most extreme case, if all instances in the test set lie in the same class, no C-index can be calculated. This was the case in most of the folds of the cross-year cross-location validation strategy with a threshold of 0·75 mg kg−1 where very few observations exceeded these thresholds. With this threshold the C-index could only be calculated for 15 of the 89 folds.

The pooled alternative on the other hand can be statistically biased, because the assumption made when using the pooling method is that classifiers produced in different cross-validation folds come from the same population. This is generally not valid, because some of the positive-negative pairs are constructed using observations from different folds (Parker et al., 2007; Airola et al., 2011). In the two-class setting, when the goal was to predict if the DON value was lower or higher than 0·50 mg kg−1 or lower or higher than 0·75 mg kg−1 with the SVOR model, a C-index of 0·50 was obtained, which was lower than the C-index obtained with the proportional odds model. It was clear that the function values obtained in the different folds came from a different population. For example, in fold x a function value of 0·0063 corresponds to a classification in class 1, while in fold y this function value corresponds to a classification in class 2. This led to test samples in different folds being ranked in the wrong order.

Additionally, it can be observed that the predictive performance improves when the number of risk classes decreases. This is not at all surprising, as a higher grade of precision is assumed if more risk classes are introduced. When comparing the accuracy of the binary classification settings, it is clear that the models with a threshold of 0·75 mg kg−1 performed the best of all. However, the pooled C-index was low for this threshold, due to a skew class distribution. The data set for this threshold contained only 51 of the 1034 observations with a DON value higher than 0·75 mg kg−1. A value of 0·50 for the C-index alludes to no discriminative power for a predictive model (Waegeman et al., 2008) and most of the other values were substantially greater than 0·50.

The proportional odds model, which can be easily implemented, resulted in a performance that was comparable to the more complex SVOR model, when the selected predictor variables are sufficiently uncorrelated. The main objective for developing a useful predictive model may be timely accurate predictions, but this must be accomplished by means of the most simple technique, using inexpensive and easily acquirable data. If the data are too expensive to acquire, or if the modelling technique is too complicated, then it remains unlikely that the predictive model would be widely deployed or accepted as a predictive tool at the farm level.

As mentioned above, it remains difficult to compare the performance of metric and ordinal regression models because the MSE calculated in an ordinal regression setting cannot directly be compared. In addition, to compare the performance of the metric and ordinal regression approaches, the continuous values obtained in the metric regression approach were transformed into ordinal categories to calculate the accuracy, MSE, MAE and C-indices. It was clear that the ordinal models outperformed the metric regression models because metric regression models are not designed to predict ordinal responses. Furthermore, with an ordinal regression model a probability that the DON value is in a certain category is obtained, which is more practically useful and more easily interpreted for growers and the authorities, as they are interested in the probability that the DON values exceed certain thresholds.

Finally, the last step of the modelling process is building a single predictive model based on the cross-validation results. However, this step is not discussed here, because it does not bring any added value with regard to comparing different methods. From that perspective, the methodology does not differ from typical predictive modelling papers, where cross-validation is commonly used for comparing different methods. As such, this paper does not compare different models, but different modelling techniques, a subtle difference. However, after comparing different methods, a final model can be easily constructed by using the whole data set as training data. The ultimate goal then consists of delivering accurate predictions for future data. Cross-validation has been applied to simulate such a scenario in a reliable manner.

Acknowledgements

WW was supported as a postdoc by the Research Foundation-Flanders (FWO Flanders). This work was financially supported by the Agency for Innovation by Science and Technology, project 70575 (IWT, Brussels, Belgium). The authors thank the ACG Trial Network for providing the wheat samples.

Ancillary