Regression Models for Order-of-Addition Experiments

The purpose of order-of-addition (OofA) experiments is to identify the best order in a sequence of m components in a system or treatment. Such experiments may be analysed by various regression models, the most popular ones being based on pairwise ordering (PWO) factors or on component-position (CP) factors. This paper reviews these models and extensions and proposes a new class of models based on response surface (RS) regression using component position numbers as predictor variables. Using two published examples, it is shown that RS models can be quite competitive. In case of model uncertainty, we advocate the use of model averaging for analysis. The averaging idea leads naturally to a design approach based on a compound optimality criterion assigning weights to each candidate model.


Introduction
The purpose of order-of-addition (OofA) experiments is to identify the best order in a sequence of m components in a system or treatment (Van Nostrand 1995). For example, m herbicides may need to be combined to obtain a herbicide mixture targeting a range of weed species and it is not clear in which order the mixture components should be added to the tank (Mee 2020). Another example is a medical treatment involving m drugs, and the optimal order of administration needs to be determined (Table 1; Yang et al. 2020).
[ Table 1 about here] With m components, there are m! possible orders (see Table 1). The full design comprises one run for each possible order. When m is large, only a subset of the possible orders may be tested. With smaller m, replication of some or all orders may be feasible, thus providing an independent estimate of error. The choice of design for OofA experiments has received considerable attention recently. Most authors focus on D-optimal designs, primarily based on combinatorics for design generation, e.g. using block designs  or component orthogonal arrays (Yang et al. 2020;Zhao et al. 2020;Huang 2021), whereas other authors proposed purely numerical search strategies (Voelkel, 2019;Mee 2020, Winkler and. There are currently two popular basic models for analysis, i.e., a model using pairwise ordering (PWO) factors (Van Nostrand 1995) observing that there are 1 2  m constraints on the parameters   j c model is preferable will depend on the application. We conjecture, however, that for either model it will be beneficial to have components well spread out across positions over the runs of a design so that the optimal position of each component can be determined.
The assumptions underlying the PWO and CP models are certainly plausible in principle, but they may not always be the best possible options. For use of OofA experiments in practice, it is useful to have a number of alternative models for analysis (Buckland et al. 1997). The purpose of this paper is to consider such alternative models and illustrate them using examples. The focus will be on regression models that use the position of components to define regressor variables. In particular, we will propose response-surface (RS) regression models, which can be thought of as accounting for both relative and absolute positions of the components at the same time. The general picture emerging from our review is that there is potentially quite a substantial number of regression models. Hence, we advocate model averaging (Buckland et al. 1997) as a viable analysis option. In the discussion, we will also consider the implications for design.

A second-order response surface model
The CP model assumes that the effect of a component c depends solely on its absolute position It should be pointed out that a regression on absolute positions p c (q c ) of the components can be considered as also accounting for relative position. This is because the m absolute positions q c can be re-expressed by the absolute position of one common reference and 1  m distances relative to the common reference. For example, taking the m-th component as the reference, we may replace q c   If this replacement is made in (10), we obtain a second-order model in the distances cm  , meaning that the models allows for an optimal position for each c < m relative to the reference component m.

Modifications of the PWO model: A nearest neighbour model and a taperedeffects model
The  will not provide much improvement of the tPWO model over the PWO model (1).

Criteria for judging the predictive accuracy of a model for a given design
When m is large ( 10, say), even identifying the best order(s) from a fitted model may be computationally far from trivial, and the best strategy may well depend on the kind of model. In this paper, we will focus on the situation when m is small (m < 7, say).
Then an obvious strategy is to predict the response for all possible orders and then perform all pairwise comparisons between the predictions in order to find the best one.
One may simply use the order with the best prediction. Alternatively, a subset of best solutions could be identified, and the choice substantiated by significance tests.
Specifically, Hsu's multiple comparison with the best (Hsu 1996, p.81) could be used to account for the multiple testing problem involved in identifying this subset. This approach is completely general and would work for any regression model. But it will become infeasible even for modest m, e.g. m = 10, where m! = 3,628,800. Even when predictions are not actually computed for all possible orders, however, identification of the best order still comes down, at least implicitly, to comparing the predictions of all possible orders.
Thus, regardless of the value of m, in judging the efficiency of a design, it is appropriate to consider the predictions for all m! orders. Let the linear predictor for a design be given by  X , where  is the parameter vector, and let the full set of m! orders be represented by the matrix f X . Thus, we are aiming to predict . From this, we may obtain the average pairwise variance of predictions as (Piepho 2019) When apv is to be used in design search, we may set 1 2   and the matrix needs to be computed only once. We note that the apv bears some resemblance with the I-optimality criterion (Goos et al. 2016), sometimes also denoted as V-optimality (Atkinson et al. 2007, p.143), which focuses on the average variance (av) of predictions over the experimental region. For the full set of m! orders in a OofA experiment this equals whereas (12)  , as is done in OPTEX when the option CODING=ORTH is used (Atkinson et al. 2007, p.188).

Model averaging
Model choice may not be obvious, in which case model averaging (Buckland et al. 1997;Burnham and Anderson 2002, p.450) is a good option for obtaining predictions of . A model-averaged prediction is given by (Buckland et al. 1997)  where k f ,  is the prediction based on the k-th model g k   where I denotes an information criterion such as the Akaike Information Criterion (AIC). The variance of the model-averaged prediction of the i-th order of application can be estimated from (Burnham and Anderson 2002, p. 450) For alternatives to estimate both the weights and the variance of predictions, see Buckland et al. (1997). To evaluate the overall performance of the prediction, (16) may be averaged across the complete set of application orders.

Examples
In Figure 1, the second-order response surface model (10) is illustrated for m = 3 using the data of Table 1. Circles correspond to design points; circle filled with red is the component order largest predicted response.
[ Figure 1 about here] We further consider the data for N = 24 runs and m = 4 anti-tumor drugs in Table 3 of Yang et al. (2020). The fitted models are reported in Table 2. The RS model yields the best fit in terms of the root mean square of error (RMSE) and the Akaike Information Criterion (AIC) (Burnham and Anderson, 2002), whereas the PWO and tPWO models has a smaller avd. The model-averaged predictions for the ten best combinations is shown in Table 3. The RS model has the largest Akaike weight (w k = 0.410), closely followed by the tPWO model (w k = 0.376) ( Table 2). In this example, it would be hard to make a choice among these two best-fitting models, so the modelaveraged inference is very apt. The ranking of the model-averaged predictions for the ten best combinations agrees rather well with those obtained for the individual models, though there are some rank changes (Table 3).

[Tables 2 and 3 about here]
A third example is the data given in Table 4 Table 4 show that the RS model fits best and rather better than the second-best CP model. The very large Akaike weight (w k = 0.99996) for the RS model means that this dominates the model-averaged prediction in this example. The ranking of both the model-averaged prediction and the prediction based on the RS model, which agree perfectly, are rather different from those based on the other models (Table 5).
[Tables 4 and 5 about here]

Model selection
Our review has revealed that there are several candidate regression models. This raises the question, which model should be used for analysis. The answer will very much depend on the application. If the best suitable model is known a priori for the application at hand, both the design and the analysis may proceed considering just that model. But what if model choice is not clear a priori? As regards analysis, one obvious option is to fit all candidate models and pick the best one based on some standard criterion such as information criteria (Burnham and Anderson 2002) or apv. A better strategy to deal with model-selection uncertainty is model-averaging (Buckland et al. 1997), as illustrated briefly in Section 2.5.
The set of candidate models can be expanded in many ways. For example, each of the regression models considered in our review may be extended or modified. Mee (2020) suggested extending the PWO model by including interaction terms, e.g. for triplets of components applied in sequence. The tPWO model of Peng et al. (2019) gives rise to a whole family of models depending on the choice of the tapering function   h preceding a position and those following a position, calling for a directional version of   h z . Conversely, our NN model (11) assumes that direction matters for the neighbour effects. This could be simplified by assuming a non-directional NN model. We do not want to expand this list of examples further, but merely re-iterate that there are potentially many candidate models.
A further option to expand the candidate set substantially is variable selection, e.g. by stepwise regression. This has been considered by various authors, e.g., Mee (2020) and Yang et al. (2020). With some proposed models, such as when higher-order interaction terms are included (Mee 2020), variable selection is a necessity when the number of runs is limited. Again, rather than settling for a single final model, which is a decision that is always subject to uncertainty and entails the risk of over-or underfitting (Burnham and Anderson, 2002;Heinze et al. 2018), model averaging may be a better strategy. It also seems prudent to limit the candidate set by making best use of prior knowledge and focus on the more parsimonious models.

Implications for design generation
There is a growing body of literature on optimal design for OofA experiments, much of which focuses on a single model, which is the obvious strategy when the best design is known with certainty a priori. For example, Winkler et al. (2020) and Chen et al. (2020) focus on the PWO model, while Huang (2021) focuses on the CP model. Yang et al. (2020) and Zhao et al. (2020) consider both of these models in finding a design, first constructing component orthogonal arrays. These designs are globally optimal under the CP model. Then among these designs, they select the best designs in terms of Doptimality for the PWO model. If there is uncertainty as regards the best model for analysis and there is a larger set of candidate models, however, it may not be sensible to optimize the design for particular models. Instead, a compound criterion could be optimized that averages an optimality criterion across a select set of candidate models. This idea has an obvious counter-part in analysis, i.e., model averaging (Buckland et al. 1997) (see sections 2.4 and 2.5).
Thus, for design generation, it seems reasonable to consider two quite different scenarios: (i) The most suitable model is known a priori; (ii) the best model is not known and a set of candidate models will be considered for analysis. In scenario (i), the obvious design strategy is to find an optimal design, e.g. one that minimizes apv, for the known best model. In scenario (ii), a compound criterion covering the set of candidate models may be used to find an optimal design. If where k a , K k ,..., 1  is a set of non-negative weights (Atkinson et al., 2007, p.144 (Dykstra 1971). D-optimal designs also usually do quite well for other optimality criteria including I-optimality (Atkinson et al. (2007, p.153). Donev and Atkinson (1988) showed this for response surface designs.  ; CP = component position; RS = response surface; NN = nearest neighbour; $ Model-averaged prediction using Akaike weights (see eg. 14); the model comprises a dummy for the two component orthogonal arrays of the design.  $ Model-averaged prediction using Akaike weights (see eg. 14); the fitted model comprised a dummy for the two batches of the design.  $ RS-3 = third-order response surface model (A6) Figure 1. Contour plot for fit of model (10) plotted in the (p 1 , p 2 )-plane for data in Table   1.