Data‐driven storage operations: Cross‐commodity backtest and structured policies

Storage assets are critical for physical trading of commodities under volatile prices. State‐of‐the‐art methods for managing storage facilities such as the reoptimization heuristic (RH), which are part of commercial software, approximate a Markov Decision Process (MDP) assuming full information regarding the state and the stochastic commodity price process and hence suffer from informational inconsistencies with observed price data and structural inconsistencies with the true optimal policy, which are both components of generalization error. Focusing on spot trades, we find via an extensive backtest that this error can lead to significantly suboptimal RH policies. We develop a forward‐looking data‐driven approach (DDA) to learn policies and reduce generalization error. This approach extends standard (backward‐looking) DDA in two ways: (i) It represents historical and estimated future profits as functions of features in the training objective, which typically includes only past profits; and (ii) it enforces structural properties of the optimal policy. To elaborate, DDA trains parameters of bang‐bang and base‐stock policies, respectively, using linear‐ and mixed‐integer programs, thereby extending known DDAs that parameterize decisions as functions of features without policy structure. We backtest the performance of RH and DDA on six major commodities, employing feature selection across data from Reuters, Bloomberg, and other public data sets. DDA can improve RH on real data, with policy structure needed to realize this improvement. Our research advances the state‐of‐the‐art for storage operations and can be extended beyond spot trading to handle generalization error when also including forward trades.

Production and Operations Management constraints on the rates of injection and withdrawal.Maximizing the profit from operating storage requires adapting the timing of constrained injections and withdrawals to the movement of uncertain commodity spot prices.The related optimization of storage operations can be approached using a Markov decision process (MDP) that contains in its state the on-hand inventory (i.e., endogenous state) and multiple factors (i.e., exogenous state) of a Markovian stochastic process describing the evolution of spot prices.The extant storage literature formulates this MDP assuming that the stochastic process is known.A common choice for the exogenous state is a vector of futures contract prices (Lai et al., 2010) since they are available for commodities with futures markets and expectations of the spot price equals futures prices under the risk-neutral measure.
Storage MDPs with realistic price dynamics are highdimensional and thus intractable to solve directly.Leastsquares Monte Carlo (LSM) and reoptimization heuristics (RH) are state-of-the-art approaches (Breslin et al., 2008;Breslin et al., 2009;Gray & Khandelwal, 2004a, 2004b) for approximating the aforementioned intractable MDP and are part of commercial storage management software (Energy Quants, 2018;Kyos, 2018;Lacima, 2018;MathWorks, 2018).LSM computes a parametric approximation of the MDP value function using backward induction and regression, which is then used to compute storage decisions (Nadarajah et al., 2015).RH obtains storage decisions at a given stage and state by solving a deterministic linear program, referred to as an intrinsic linear program (ILP), which is based on futures prices available at the current time.For RH, ILP is reoptimized at each stage after accounting for updated futures price information (Lai et al., 2010;Secomandi, 2015).An advantage of RH over LSM is that its inputs are agnostic to the assumed commodity price process in the MDP.Moreover, in the context of natural gas, RH storage policies have been shown to be near-optimal in computational studies that assume full information about the storage MDP, that is, in a setting where the exogenous state composition and the stochastic process describing the evolution of this state are known and exact (e.g., Lai et al., 2010;Nadarajah & Secomandi, 2018;Secomandi, 2010;Secomandi, 2015;Wu et al., 2012).Empirical studies investigating the performance of RH for managing the storage of commodities other than natural gas are scant.
Focusing on spot trades, we perform an extensive backtest of the RH policy on price data across six commodities (i.e., copper, gold, crude oil, natural gas, corn, soybean) from Thomson Reuters over the period 2000-2017.The goal of this backtest is to understand the true performance of RH by applying its decisions on a historical sample path of prices and benchmarking the resulting profit against the value of an optimal perfect foresight solution on this price path, which is indeed optimistic but immune to assumptions implicit in the MDP.We observe that several insights regarding the performance of RH change fundamentally as explained next: • RH yields smaller profits than ILP on 37.0% of our commodity backtest instances, which suggests that the value of reoptimization can be negative, deviating from the substantial positive value of reoptimization reported for the fullinformation problem (Secomandi, 2015).• A one-period look-ahead policy leads to higher profits than RH on several instances, that is, ignoring futures price information may be beneficial.This result differs from the literature on forecast horizons in the full-information setting, which argues that far-ahead futures price information does not affect optimal first-stage decisions (Cruise et al., 2019).• RH yields an average value of 11.0% of the perfect foresight solution, which makes one wonder if the nearoptimality of RH in the full-information setting extends to real data.
We rationalize the aforementioned stark differences using the train-test paradigm of machine learning (ML).Specifically, existing performance evaluations of RH (as well as other methods such as LSM) in the storage literature both compute policy parameters/decisions (i.e., train the policy) and test the performance of these decisions under the fullinformation setting, which the actual data may not satisfy.
Informally speaking, the performance difference in the training environment (i.e., full-information setting) and the testing environment is referred to as generalization error.Our backtest suggests this error may be significant when employing the RH policy.Motivated by the above observations, we take an ML approach to target the reduction of generalization error and learn storage policies.To this end, we relax the fullinformation assumption and formulate a feature-based storage stochastic dynamic program (F-SDP) where the exogenous state is represented by a generic set of features that evolve according to an unknown stochastic process.We then develop a data-driven approach (DDA) to tackle F-SDP that extends existing approaches in two key ways.First, it is forward-looking and uses financial-market features (e.g., futures prices) to include future estimates of profit in the training objective, in addition to historical profits considered by standard DDAs (see, e.g., Bertsimas & Kallus, 2020).Second, it allows one to enforce structural properties of an optimal F-SDP policy when computing data-driven policies.Within this framework, we begin by considering standard linear decision rules (DDA-LDRs) from the literature (see, e.g., Ban & Rudin, 2019) that specify decisions as a linear parameterization of random variables.The DDA-LDR parameters are trained using the empirical risk minimization (ERM) framework (Friedman et al., 2001), which involves solving a regularized convex program.DDA-LDRs do not encode any structure of the F-SDP optimal policy and thus training them in our forward-looking approach allows us to understand the value of such future information alone without considering the impact of policy structure.We subsequently propose structured data-driven policies (DDA-SPs) that encode bang-bang and double base-stock structures shared by the F-SDP optimal policy for storage assets with different operating characteristics.In contrast to DDA-LDRs, DDA-SPs are parameterized by coefficients of

Production and Operations Management
price thresholds or base-stock levels, which are trained using linear and mixed-integer programming.We discuss how the regularized training procedure and the policy structure used when computing DDA-SP make it robust to price uncertainty (i.e., it accounts for downside risk) and estimation error, respectively.
We perform a backtest of DDA approaches across the same six commodities used in our RH backtest.As candidate features, we consider spot and futures prices from Thomson Reuters, analyst forecasts of spot prices from Bloomberg, temperature, the S&P 500 index, and the Trade Weighted U.S. Dollar index.Feature selection reveals several practical insights.First, in the absence of futures prices and analyst forecasts as features, adding the S&P 500 and trade weighted U.S. dollar indices can improve profits compared to using only spot prices.This finding is relevant when futures markets or analyst forecasts are absent, which is the case for commodities such as asphalt and specific types of polyethylene.Second, while futures prices have large errors when treated as forecasts of spot prices, their inclusion on top of spot prices can improve storage profits.Third, embedding analyst forecasts further enhances storage profits by 7%, that is, spot and futures prices along with these forecasts lead to median profits that are undominated by other feature combinations, which is consistent with the hypothesis that futures prices and analyst forecasts account for factors that affect prices.We thus use this feature combination for our performance analysis.
The median profits of DDA-LDR range between 2.1% and 2.4% (of the perfect foresight value) for different feature choices, while the RH median profit is 12%.That is, despite being a data-driven policy, DDA-LDR performs even worse than RH.In contrast, we observe that DDA-SP generates median profits between 15.7% and 26.7% and also improves on the 25th percentile of profits (i.e., downside).Hence, both regularization in training and the structure encoded in these data-driven policies can help to improve on the performance of RH profits as well as the downside risk profile of the profit distribution.In addition, the DDA-SP median profits change significantly when forward-looking profits based on futures prices are considered during the training procedure, indicating that our extension of existing DDA approaches can add value.The difference between DDA-SP and RH policies are 35.9%,19.8%, 11.2%, −4%, 19.5%, and 5%, respectively, on the copper, gold, crude oil, natural gas, corn, and soybean instances.The performance of RH and DDA-SP thus varies significantly across commodities with DDA-SP exhibiting good overall performance and RH remaining a strong contender in particular for natural gas.
Our findings advance the state-of-the-art for commodity storage operations.The extended DDA and structured policies highlight potential opportunities to enhance storage software by considering generalization error.In particular, although we apply our framework for spot trading, it can be extended to evaluate and handle generalization error when combined with forward trading.

Related work and novelty
Our models, methods, and findings extend the literature on commodity storage, data-driven optimization, and commodity finance as discussed below.
The literature on commodity storage dates back to the warehouse management problem introduced by Cahn (1948) and further studied by Charnes and Cooper (1955), Bellman (1956), andDreyfus (1957).The storage assets in these very early papers were managed under deterministic prices.Charnes et al. (1966) and Secomandi (2010) consider the stochastic version of the storage problem with and without rate constraints, respectively, and characterize the optimal policy.Significant recent effort has gone toward using approximate dynamic programming techniques to find nearoptimal policies to the intractable storage SDP in the fullinformation setting (Cruise et al., 2019;Lai et al., 2010;Nadarajah et al., 2015;Nadarajah & Secomandi, 2018;Nascimento & Powell, 2008;Wu et al., 2012).Secomandi et al. (2015) consider the impact of choosing an incorrect number of factors in a prespecified price model on storage valuation and hedging.They term this price-model error.Secomandi (2015) and Nadarajah and Secomandi (2018) argue in single and network storage settings, respectively, that the RH policy is price model error-free as it uses only market futures prices as input.However, they do not analyze the impact of futures prices providing poor forecasts of the spot price as they work under the risk-neutral measure where the expected spot price equals the futures price.In summary, the extant storage literature has not empirically studied the impact of generalization error on the storage operating policy or developed data-driven operating policies that target this error.Our backtest of RH, development of DDA approaches that leverage known policy characterizations, and related empirical insights are novel to this literature.Moreover, our use of regularization and policy structure provides an ML and optimization-inspired view of managing storage operations, which is relevant beyond this setting to other real options involving commodities such as soybean, corn, and palm (Boyabatlı et al., 2017;Devalkar et al., 2011Devalkar et al., , 2018;;Goel & Tanrisever, 2017) and energy (Nadarajah & Secomandi, 2021).
Our work builds on methodological work from empirical optimization (Bartlett & Mendelson, 2006;Esfahani et al., 2018) and the emerging data-driven optimization literature (see, e.g., Ban et al., 2018;Bertsimas & Kallus, 2020;Curtis & Scheinberg, 2017;Elmachtoub & Grigas, 2022), which addresses generalization error by explicitly focusing on out-of-sample performance.Strictly speaking, our paper belongs to the growing literature (e.g., Ban & Rudin, 2019;Chenreddy et al., 2019;Mandl & Minner, 2020) that empirically tests the value of data-driven optimization in operations management problems.Data-driven optimization has been applied to single-period inventory control or newsvendor applications (Ban & Rudin, 2019) and in multiperiod settings using linear or piece-wise linear decision rule approximations Production and Operations Management (Ben-Tal et al., 2005;See & Sim, 2010), for instance, for financial contracting (Mandl & Minner, 2020).In a marketing setting, Chenreddy et al. (2019) combine ERM with polynomial approximations and inverse reinforcement learning.The forward-looking DDA that we propose extends the backward-looking DDAs in this literature.While linear decision rules are known, our assessment of their performance for commodity storage, especially when trained using estimates of future profits is new.Our structured data-driven policy and the evaluation of the value of enforcing policy structure are both novel.In addition, the parameters of the structured policies that we train are thresholds, which are easily interpretable by managers.Our models for training these data-driven policies add to the literature on interpretable ML, an area that has studied several applications ranging from classification to healthcare (see Lakkaraju & Rudin, 2017, and references therein) but none that share the structure of the commodity storage application.More broadly, our empirical finding that enforcing policy structure can improve out-of-sample performance of data-driven policies is relevant for other operations management problems where characterizations of the optimal policy structure are known.
Finally, our results contribute to recent work in commodity finance that brings to light the value of features for price prediction (Alquist & Kilian, 2010;Cortazar et al., 2018;Heath, 2019).These papers emphasize the importance of the true distribution of spot prices (as opposed to risk-neutral distributions), which is consistent with our focus.However, the aforementioned papers take a statistical view and do not focus on decision making, while we take an ML perspective and train operating policy parameters.Therefore, our comparison of DDA approaches and feature selection in the context of storage decisions add novel components to this literature.Our forward-looking DDA shows how the presence of financial markets allows one to obtain future profit estimates that can be leveraged as part of the training objective.We also assess the values of futures prices and analyst forecasts of spot prices as features when training decision rules for storage to be significant.In particular, while futures prices may provide poor forecasts of spot prices, they nevertheless provide valuable information to train policies.This finding motivates further research on the differential impact of data on prediction versus decision making.

COMMODITY STORAGE OPERATIONS AND POLICY PERFORMANCE
In Section 2.1, we present a feature-based extension of the well-known storage SDP.In Section 2.2, we describe the statistical perspective used to evaluate storage policies in the literature and make a case for the value in using an ML perspective instead.

2.1
Feature-based storage MDP We extend the (stochastic) commodity storage problem formulated by Charnes et al. (1966), Secomandi (2010), andLai et al. (2010).Consider a single-item, multiperiod, discretetime, periodic-review inventory replenishment problem at a single commodity storage asset (e.g., warehouse) with a finite planning horizon T. Periods t = 0, 1, 2, … , T equal decision stages and might correspond to hours, days, weeks, or months.The storage asset state is described by I t and denotes the amount stored at the beginning of t.I t is bounded by warehouse capacity C, that is, 0 ≤ I t ≤ C. The holding cost per unit of time and unit of inventory is denoted by c h t ≥ 0. We denote by y i t ≥ 0 the period t injection quantity and by y o t ≥ 0 withdrawal quantity at this period.These decisions are subject to injection and withdrawal limits G i and G o , respectively.Storage operations have associated operational frictions.Specifically, injections and withdrawals incur marginal costs of c i ≥ 0 and c o ≥ 0, respectively, and have associated losses of  i ∈ (0, 1] and  o ∈ (0, 1].The commodity spot price in period t is denoted by p t .The friction-adjusted purchase and selling prices are where it is common to assume that storage losses are paid inkind, that is, using a fraction of the physically traded commodity.Note that p i t ≥ p o t .As is standard in the merchant operations literature, we assume that the merchant is a price taker (i.e., injections and withdrawals do not affect the spot price).We also assume the merchant has access to the spot market only (physical trading, rather than financial trading via futures contracts).
Injection and withdrawal decisions at each period are conditioned on the information available to the user (i.e., the MDP state).Let  t := {I t , X t } denote all information available to the merchant at the beginning of period 0 ≤ t ≤ T. The inventory level I t is endogenous information as past injections and withdrawals determine its value.The remaining component is a vector of N features (X t,n ∈  n , n = 1, … , N), which is exogenous information and unaffected by storage operations.At period t, the distribution of the (random) spot price p  ,  > t depends on X t .Examples of features include current and past spot prices, prices of futures contracts, and investor forecasts of spot prices.Given I t at period t, the feasible injection and withdrawal set  t (I t ) is defined as Storage results in inventory I t transitioning to I t+1 = I t − y o t + y i t .Under nonzero marginal costs, it is easy to verify that it is suboptimal to inject and withdraw in the same period.

Production and Operations Management
A storage operating policy  is a collection of decision rules {Y  t , t = 0, … , T}, where Y  t := (Y ,o t , Y ,i t ) is a function that assigns a pair of withdrawal and injection decisions (y o t , y i t ) to each state (I t , X t ) at period t.Denoting by Π the set of all operating policies, the value of optimally managing storage starting from a state (I t , X t ) at period t is where V t (I t , X t ) is the value function at period t and state (I t , X t ), I  t is the inventory level reached at period t when using policy , and  is expectation with respect to the true (and potentially unknown) stochastic process driving the features.We suppress the discount factor without loss of generality as it can be factored into the prices and holding cost.
An optimal policy to the storage MDP can be sequentially computed using the following stochastic dynamic programming recursion: ∀t = 0, … , T − 1 and (I t , X t ).Unlike the feature-based SDP presented here, the extant storage literature predefines the feature vector X t and assumes a stochastic process for its evolution.A popular choice for X t is the forward curve, that is, , where f t,t ′ , t ′ > t, is the time t price of a futures contract maturing at time t ′ and f t,t = p t (see, e.g., Lai et al., 2010).The stochastic process driving these prices typically has multiple factors and satisfies This is true in complete markets under the risk-neutral measure, where market participants have different risk preferences but attribute a unique value to the asset.However, market incompleteness is common in commodity markets and this assumption may not hold.The structure of the optimal policy known in the commodity storage literature (see, e.g., Secomandi et al., 2015) extends to F-SDP as stated in Proposition 1 under Assumption 1.
Assumption 1 (Bounded spot price expectation).Assume that for all stages t = 0, 1, … , T it holds that (5) Proposition 1 (Optimal Policy Structure).The following holds under Assumption 1: Then there is an optimal policy of F-SDP and price threshold functions P t (X t ) such that at each period t and state (I t , X t ), the optimal storage injection and withdrawal decisions satisfy Then there is an optimal policy of F-SDP and injection and withdrawal basestock-level functions S i t (X t ) and S o t (X t ), respectively, with ) such that at each period t and state (I t , X t ) the optimal storage injection and withdrawal decisions satisfy: We omit the proof of Proposition 1 as it follows standard reasoning available in the literature.When Assumption 1 holds, the value function can be shown to be bounded following the arguments in Lemma B.1 of Nadarajah and Secomandi (2018).The remaining parts of the proof to establish policy structure mirror Lemma B.2 of Secomandi et al. (2015).
Proposition 1(a) summarizes the bang-bang structure of the optimal policy when the storage asset is fast, that is, it has full operational flexibility (FF) and no rate constraints.In this case, the optimal policy is based on the value taken by a statedependent price threshold P t (X t ) in relation to the frictionadjusted spot prices.Depending on the value of P t (X t ) the optimal decision is to (i) fill storage, (ii) do nothing, or (iii) empty storage.The optimal policy for a slow storage asset with rate constraints, which we refer to as limited operational flexibility (LF), is shown in Proposition 1(b).Injection and withdrawal depend on comparing inventory level with statedependent injection and withdrawal base-stock levels.These decisions (i) fill up storage to the injection base-stock level, (ii) do nothing, or (iii) decrease inventory down to the withdrawal base-stock level.

Policy performance evaluation
Solving F-MDP directly is challenging since we do not have a feature representation X or knowledge of the stochastic process M driving its evolution.We refer to (X, M) as the feature-model pair.Even if the feature representation was known, the computational burden of solving F-MDP is

Production and Operations Management
prohibitive due to the well-known curses of dimensionality.Therefore, it is common to forgo finding an optimal policy and instead solve a tractable optimization model that approximates F-MDP and delivers a heuristic policy.Given such a heuristic policy π, its performance needs to be evaluated.We discuss the evaluation procedure used extensively in the merchant storage literature (and more broadly in stochastic optimization) and then present a data-inspired evaluation procedure, also highlighting its implications to methods that compute policies.This subsection will form the conceptual basis for our empirical results and methods in the remaining parts of the paper.The literature on storage operations evaluates the performance of a heuristic policy π via simulation.Let V π(p) be the value of applying the decisions of policy π on the spotprice trajectory p := (p t , t = 0, … , T).Further, we denote by  * X,M an optimal policy to F-MDP formulated using (X, M).The goal is to evaluate the exact optimality gap: where  X,M is expectation w.r.t.model M over feature trajectories X under feature representation X.
common upper bounding approach is information relaxation and duality (see Brown et al., 2010, for details).The resulting optimality gap estimate is The evaluation of the optimality gap in the literature is tied to the feature representation and stochastic model assumptions.This estimate of policy performance can be misleading if the assumed pair (X, M) is different from the true (X * , M * ).We term this potential difference between assumed and true feature-model pairs as information inconsistency.To illustrate, consider policies πA and πB evaluated using the exact optimality gap (8).Then it is possible that OPT X,M ( πA ) > OPT X,M ( πB ) while OPT X * ,M * ( πA ) < OPT X * ,M * ( πB ).Hence, the simulation-based performance ranking of these policies may differ from their ranking on real data due to information inconsistency.
Motivated by the above observation, we consider evaluating the performance of policies in a data-driven manner.Our starting point is the following definition of idealized generalization error used in reinforcement learning (see Murphy, 2005, section 4): Intuitively, this definition is an assessment of how the notion of approximate optimality underlying the problem solved to obtain π "generalizes" to handle the exact optimality asso-ciated with F-MDP formulated with (X * , M * ), which gives rise to  * X * ,M * .While conceptually appealing, similar to the issue in (8),  * X * ,M * is unknown and hence the value V  * X * ,M * (p) too.We thus replace this value by the perfect foresight value V PF (p) obtained by optimizing storage operations with knowledge of the true spot prices p. Clearly this value does not depend on (X, M), that is, unlike U X,M (X 0 ) used to obtain (9) starting from (8), the term V PF (p) is a feature-model pair independent upper bound.The resulting computable generalization error is We replace the expectation  X * ,M * by its sample average approximation based on H trajectories of observed data p h := (p h 0 , … , p h T ) for h = 1, … , H to obtain the empirical generalization error Minimizing GE( π) to find a policy πGE := arg min ∈Π GE() is equivalent to maximizing empirical performance on observed data, that is, πGE solves max ∈Π ∑H h=1 V π(p h )∕H.In other words, unlike the focus of the existing storage literature on finding policies with low optimality gaps (e.g., less than a few percent) under a potentially incorrect model (X, M), the effort when using (12) for evaluation is redirected to ranking policies based on their performance on data.In addition to providing a data-driven ranking of policies, GE( π) measures the empirical performance of a policy relative to the perfect foresight value.This difference is insightful, especially in volatile commodity markets, as it shows the value that can be gained from perfect knowledge of future information and has been considered in the literature (Kleindorfer et al., 2012).
Minimizing GE( π) mitigates the possibility of incorrectly ranking policies during evaluation and concluding that a policy with poor empirical performance is near-optimal owing to the previously discussed information inconsistency between an assumed feature-model pair (X, M) and the true such pair (X * , M * ).Although we focus mainly on performance evaluation of a given policy here, it is important to note that information inconsistency can also cause the method that computes π to incorrectly rank policies and thus choose one with poor performance on data.Specifically, the generalization error of a policy π can be larger than πGE due to this information inconsistency.It is also common for methods to optimize over a smaller policy class Π ⊂ Π to achieve tractability.This restriction can result in Π excluding the optimal policy or more importantly near-optimal ones, and thus increase generalization error even in the absence of information inconsistency.We refer to this difference in policy sets as structural inconsistency.In summary, a firm can compare the performance of policies using traditional simulation assuming a feature-model pair and re-evaluating these policies using generalization error as the metric on data.If the ranking

Production and Operations Management
of policies changes, this is a signal that there is information inconsistency.When designing data-driven policies that attempt to reduce generalization error, one needs to be cognizant of both information and structural inconsistencies.

RH BACKTEST
In this section, we perform an extensive backtest to evaluate the performance of RH based on generalization error as defined in (12).We describe RH in Section 3.1.We overview the data set used for our backtest in Section 3.2 and present results in Section 3.3.

Algorithm
RH, which is sometimes referred to as forward dynamic optimization (Eydeland & Wolyniec, 2003, p. 355), is a sequential RH and a type of certainty-equivalent control that determines injection and withdrawal decisions by solving an "intrinsic" linear program formulated using point estimates of future spot prices.It does not require any training, which makes its implementation easy.We denote the time t point estimate of the spot price p  with  > t by f t, and define the vector respectively.We denote by  (I, t, T) the feasible set of inventory levels and storage decisions {(y i  , y o  , I  ),  ∈ {t, t + 1, … , T}} over a planning period {t, t + 1, … , T} with a starting inventory level of I.This set is defined by the following constraints: Constraints (13) model the inventory transitions.Constraints ( 14)-( 17) enforce the restrictions on the injection and withdrawal amounts.
The ILP at period t is max The objective function ( 18) maximizes the profit from storage operations estimated using the spot price forecast F t with decisions subject to operational constraints  ( Īt , t, T).
The RH policy is based on solving ILP ( 18)-( 19) at each stage.To elaborate, ILP is solved in the current period t given the forecast F t and the inventory state information I t to obtain injection and withdrawal decisions for each future period, that is, (y i  , y o  ) for  ∈  t .The period t decision pair (y i t , y o t ) is the decision implemented by the RH policy at state (I t , F t ).Then an ILP is formulated in period t + 1 using updated inventory state information I t+1 = I t − y o t + y i t and an updated forecast F t+1 .Solving the resulting ILP gives the period t + 1 RH decision and so on.RH thus side-steps the curse of dimensionality involved in tackling F-SDP by solving LPs based on point forecasts.
RH is popular for managing natural gas storage, where f t, is chosen to be the time t price of a futures contract with maturity at time .Choosing f t, as a futures price is directly applicable for operating storage assets of other commodities with traded futures contracts.When a futures market is absent, f t, could be a point forecast or a prediction of p  .
When F-SDP is formulated for a commodity with a futures market (i.e., X t = F t ), the existing literature that uses RH cites two advantages.The first is that the RH policy is consistent with the structure of an F-SDP optimal policy, which from our discussion in Section 2.2 implies that the structural component of generalization error is zero for RH.The second advantage is that RH is model-free.Specifically, the futures prices in F t are available from traded contracts in the market and not based on any statistical model.Under the risk-neutral measure typically used in the literature, this model-free definition of f t, as a futures price also results in an unbiased estimator of p  since we have f t, = [p  ] for all  ∈  t .However, the RH policy is applied under the real-world measure (often referred to as the physical measure) that drives spot prices, where a futures price may provide a poor forecast of the spot price.In other words, although RH is model-free, it can (and likely will) have a nonzero informational component of generalization error.To assess the generalization error and performance of RH on real data, we perform a backtest in Section 3.2.

Data and instances
Our RH backtest is based on spot and futures price data between 2000 and 2017 for the following six commodities: copper, gold, crude oil, natural gas, corn, and soybean.Futures contracts for metals and energy are traded at the New York Mercantile Exchange (NYMEX) and for agricultural commodities at the Chicago Board of Trade (CBOT).We consider futures prices for the first 12 maturities, that is, 1-to 12months-ahead contracts, and use monthly prices at the first trading day of the corresponding month.Even though contracts beyond 1 year are available for various commodities, these markets are typically highly illiquid with only very few contracts traded, which implies that the predictive content for future spot prices might be low (Alquist & Kilian, 2010).Furthermore, our perfect foresight analysis on the empirical data indicates that planning horizons that are significantly smaller than 12 months are sufficient for optimal first-stage decisions (see Figure EC.1 of the Supporting Information).
Among the six commodities we consider, four of them have futures contracts with monthly maturities.The exceptions are CBOT corn and soybean futures.The former futures mature in March, May, July, September, and December while the latter futures mature in January, March, May, July, August, September, and November (www.cmegroup.com).To obtain monthly corn and soybean futures prices we employ linear interpolation following Nadarajah and Secomandi (2018) who apply the approach described in Guthrie (2009).Table 1 summarizes the sources we use to obtain data.Figure 1 plots the spot prices for each commodity.
For each commodity, we consider various operational settings for the storage asset in our backtest.Table 2 summarizes the parameters that we vary to obtain 4 × 2 × 3 × 9 = 216 instances per commodity (i.e., 1296 instances in total).Across all instances, we normalize warehouse capacity to C = 1, choose initial inventory I 0 = 0, and set storage hold- Operational frictions 2000-2002, 2002-2004, …, 2016-2017 ing cost c h t = 0 (although, we tested plausible values for this parameter and found that it led to similar results).There are no injection and withdrawal costs (c i = c o = 0).Furthermore, we distinguish between fully flexible storage (FF) with

Results
Our implementation of RH assumes monthly inventory review periods so that each storage decision period coincides with a futures contract maturity.This assumption is consistent with past studies of RH (see, e.g., Lai et al., 2010;Secomandi, 2015).We also note that futures markets can be more liquid than spot markets, which are often thinly traded (Geman & Smith, 2013).We tested RH based on trading in the futures market with the closest expiry (the so-called front-month contract) as a proxy for the spot price.As the results were similar, we do not report them in the paper.V PF ⋅ 100% where V RH and V PF are the RH and perfect foresight profits, respectively.V PF is calculated based on past data via ( 18)-( 19) given known price trajectories rather than forecasts.

Empirical performance of RH
We find that RH achieves 11.0% of the perfect foresight value on average.Its performance across commodities varies significantly.For instance, the mean profits vary from −5.1% for crude oil to 24.5% for soybean.The negative mean profit for crude oil was intriguing.Further investigation showed that the RH policy results in negative profits on 30.1% of the instances.Negative profits usually occur whenever the forecast shows decreasing prices and therefore the RH policy sells available inventory to the market disregarding that the purchase costs were higher.Indeed, in practice, these negative profits can be avoided by trading in forward contracts.We also observe that the financial crisis of 2008-2009 and the oil price drop during 2014-2015 significantly impact performance.Notably, if we exclude the instances corresponding to the 2014-2015 subperiod, the average performance of RH for crude oil increases from −5.1% (unprofitable storage) to 7.6% (profitable storage) and its worst-case performance improves from −171.9% to −64.2%.
The performance of RH varies with the operational storage parameters (G i , G o ,  i ,  o ).We report these results in Table EC.1 of the Supporting Information.While for a given operational setting (e.g., copper, n = 12,  i =  o = 0.995) RH yields positive mean profit for a fully flexible (FF) storage asset, this average profit becomes negative once the storage asset has limited flexibility (LF), that is, once the injection and withdrawals are constrained.Moreover, if frictions are large (i.e., small  i and  o ) relative to the (expected) price changes, the warehouse slows down its activity (see, for example, the instances of gold with zero mean profit in Table S1).
The small RH profit percentages relative to the perfect foresight solution are not themselves concerning because our benchmark is anticipative but it does raise the question of whether RH can be improved.Note that this question does not arise in the RH performance results reported for natural gas in the literature (see, e.g., Lai et al., 2010), which are performed under statistical model assumptions and show that RH is within a few percent of the optimal policy value.These differences in the assessment of RH suggest that information inconsistency may be at play here but confirming this suspicion requires comparing against a method that targets generalization error, which will be the focus of Sections 4 and 5.

Performance impact of the planning horizon
To understand if the performance of RH can be improved, we define variants of RH that solve an ILP at each stage formulated over a shorter horizon than T = 12, that is, we exclude futures prices with further maturities.In particular, we consider T ∈ {1, 3, 6, 12} and use RH T to denote the RH variant with a planning horizon of T periods.
Figure 2 shows that the performance of RH is sensitive to the planning horizon T on almost all instances.Exceptions include the FF instances without frictions where it is known that a one-period look-ahead policy is optimal (see related results in Table EC.1 of the Supporting Information).Using longer planning horizons in RH helps improve the average performance for crude oil, gold, and soybean but hurts profits for natural gas and corn, while this effect is mixed for copper.Several performance changes are substantial with T. For instance, the average profit percentage for natural gas reduces from over 25% for RH 1 to less than 20% for RH 12 .A We investigate the results of Figure 2 further in the dominance matrix shown in Table 4.The one-step look-ahead policy based on RH 1 outperforms RH policies with longer planning horizons on a significant number of instances.
Specifically, RH 1 strictly improves RH 12 on 29.9% of the instances (and weakly in 58.6% of the instances, which is omitted in the paper).Further, we observe that RH 6 is equal to or better than RH 12 on 88.0% of the instances, which suggests that futures prices with later maturities adversely affect the performance of the RH 12 operating policy.This finding is qualitatively different from the literature on forecast horizons (Chand et al., 2002) applied to RH under a risk-neutral measure (see Section 2 for a related discussion), which states that adding futures price information to RH can only be beneficial to its performance and ignoring futures prices beyond a certain maturity does not affect performance (Cruise et al., 2019).One possible explanation for the longer futures maturities hurting the performance of RH in our real-world backtest could be related to these futures prices providing a poor forecast of the corresponding spot price as exemplarily shown in Figure 3.

Value of reoptimization
The preceding qualitative deviation from the literature also brings into question whether there is value in the reoptimization of ILP, which is needed to define the RH policy.Under the risk-neutral measure and standard statistical model assumptions, reoptimization has been shown to add significant value over the intrinsic (static) policy based on the forward curve available at the initial stage (Lai et al., 2010; Secomandi, 2015).We assess if this remains the case in our backtest.We define the value of reoptimization as VReO := ( ) ⋅ 100% with V ILP denoting the profit obtained using the ILP.While RH revises the injection and withdrawal decision plan in each period based on new futures price information and updated inventory, ILP determines the plan for all periods based on F 0 .Figure 4 summarizes the value of reoptimization for RH with different planning horizons.The value of reoptimization can be either positive or negative.Across the instances, this value is negative for RH 12 on 37.0% of the instances, which is significant.Figure EC.2 of the Supporting Information shows in more detail when reoptimization would have generated positive value and when not.We observe that ILP outperforms RH especially in phases of sharp price jumps or drops, once again indicating that the inability of futures prices to forecast spot price changes can lead to the behavior of RH for spot trading when evaluated on real data being substantially different from what has been observed in controlled simulations.

Value of perfect price information
In Section EC.2.3 of the Supporting Information, we investigate what information (albeit idealistic) could be provided to RH in lieu of futures prices to improve its performance.Our results show that one-step-ahead spot price information generates significant additional profits compared to standard RH with futures price information.The perfect foresight value for

Production and Operations Management
F I G U R E 4 Value of reoptimization for T ∈ {1, 3, 6, 12} (light gray to dark gray).Note on boxplot characteristics: minimum, 1st-, 2nd-, 3rd-quartile, maximum flexible storage assets can almost fully (on average 98.5%) be captured by correctly classifying the direction of one-stepahead price movements.The improvement for limited flexibility is over 50%.Therefore, we show that there is opportunity to improve on RH for spot trading.This observation motivates the development of DDAs for managing storage that targets generalization error.

DATA-DRIVEN DECISION RULES
In this section, we focus on data-driven decision rules for managing commodity storage.In Section 4.1, we present a forward-looking DDA.We apply this approach using linear decision rules and structured policies in Section 4.2 and Section 4.3, respectively.Finally, we discuss robustness aspects of these policies in Section 4.4.

Forward-looking policy training and evaluation framework
We consider the computation of data-driven injection and withdrawal decision rules (u i t (X,  i ), u o t (X,  o )) at each stage, which are functions of features X and parameter vectors  i and  o .The decision rule parameters  i and  o are the trained coefficients of features that fully characterize the decision rules for purchasing and selling (see Equations ( 28) and ( 30)).They are computed using the ERM framework (Vapnik, 1998, p. 32).Suppose we have a historical spot price sample path covering T ′ periods.This procedure divides this sample path into training, validation, and testing segments corresponding to the subperiods {0, … , T s }, {T s + 1, … , T v }, and {T v + 1, … , T ′ }, respectively.The coefficients of the decision rules are chosen to maximize regularized profit on the training sample path segment.Regularization is added to avoid overfitting decisions to a single sample path (see Mohri et al., 2012, p. 28).The standard backward-looking math program-see Bartlett and Mendelson (2006), and for recent applications, Ban and Rudin (2019) and Mandl and Minner (2020)-used for training is max where the first term of the objective is the average profit over the training period and the second term is a 1-norm regularization where  ≥ 0 controls the weight of this term in the objective.Constraint (21) enforces operational constraints while constraints ( 22)-( 23) encode the decision rule structure.Given  coefficients, a data-driven storage policy  DDA is the collection of feasible injection and withdrawal decision rules {(y i t (X t ,  i ), 23) is solved several times by varying  ≥ 0. Each solution produces a possibly different set of  coefficients and corresponding policy.Among these policies,  DDA is chosen to be the one that results in the largest profit on the validation segment.Finally, the profit of  DDA is evaluated on the testing segment against the perfect foresight solution.
An important property of ( 20)-( 23) is that all the data used in its definition are available at or before period T s .Hence, it is common in the literature to maximize the profit over the training period.In the context of commodity storage as well as other applications with financial markets, forwardlooking market information beyond the training set is available via futures prices at period T s .As already discussed in earlier sections, the futures price f t, may be a reasonable predictor for the spot price p  at least for near-term maturities.Based on this observation, we consider solving the following Constraints ( 25)-( 27) are analogous to ( 21)-( 23) but defined over a longer time horizon T f .The objective function ( 24) has an extra term not found in (20).This term corresponds to the estimated profits over the periods {T s + 1, … , T f } obtained using futures prices available at T s .Indeed, choosing T f = T s gives us back the standard ERM math program.We illustrate our forward-looking framework to train storage policies in Figure 5.The data-driven framework described above can be used to target generalization error, which as discussed in Section 2.2 can be viewed as having components due to information inconsistency and structural inconsistency.The effect of the former inconsistency can be mitigated via feature selection in the context of the training, validation, and testing framework.The effect of latter inconsistency depends on the choice of (u i t (X,  i ), u o t (X,  o )), which will be the focus of Sections 4.2 and 4.3.In this paper, we evaluate if our forwardlooking ERM adds value compared to using the standard ERM math program.

Linear decision rules (DDA-LDR)
The common choice for decision rules is an affine mapping of features to decisions, referred to as linear decision rules (LDRs; see Ban & Rudin, 2019, for a newsvendor example and for additional references).Such a mapping for (u i t (X t ,  i ), u o t (X t ,  o )) at period t is: where  i n ∈  ⊂ ℝ and  o n ∈  ⊂ ℝ are feature coefficients that are unknown to the merchant and must be learned from historical time series data.To allow for a feature-independent intercept, we set X t,0 = 1 ∀t ∈  0 .The linear parameterization of u o t and u i t is not very restrictive since one can introduce new features that are nonlinear functions of the original features.For instance, such functions could involve interactions terms (e.g.,  i 3 X 1t X 2t ), polynomials (e.g.,  i 1 X 2 1t ), and lagged observations (e.g.,  i 2 X 1,t−1 ).We refer to the policy obtained based on the choice (28) as DDA-LDR.A computational advantage of DDA-LDR is that ( 24)-( 27) becomes a linear program that is efficient to solve.In terms of structural consistency, an LDR will in general not have the same structure as an optimal storage policy.Thus, it may suffer from generalization error because of this inconsistency, which motivates the structured decision rules considered next.

Structured decision rules (DDA-SP)
We choose u i t (X t ,  i ) and u o t (X t ,  o ) guided by the policy structures outlined in Proposition 1.
We begin by considering the optimal policy structure in the full flexibility case (i.e., Proposition 1(a)), which is based on a price threshold P t (X t ).Our goal will be to compute P t (X t ) using feature information in a data-driven manner and impose the optimal policy structure on the decision rules that we compute.In other words, we enforce and choose (30)

Production and Operations Management
The choice for u i t (X t ,  i ) and u o t (X t ,  o ) in ( 30) is fundamentally different from a linear decision rule, as these decisions are not directly parameterized by features but features thresholds (as decision signals) within the optimal policy structure.Math program ( 24)-( 27) under definition (30) has an mixed-integer program representation, which is detailed in Section EC.3.1 of the Supporting Information.This representation facilitates the use of off-the-shelf commercial solvers for solving this math program.
For a storage asset with limited flexibility, the estimation of a single price threshold is not sufficient because at the same market price p t , different purchase-and-inject and withdraw-and-sell decisions y i t and y o t , respectively, can be optimal depending on the current inventory level I t .We specify u i t (X t ,  i ) and u o t (X t ,  o ) using the policy structure of Proposition 1(b) and the parameterized base-stock levels with . This additive formulation of the base-stock levels is required to ensure that S i t ≤ S o t .We provide a mixed-integer linear program to compute the  i i and   i coefficients in Section EC.3.2 of the Supportingx Information.

Robustness of DDA-SP
Storage policies need to account for price and estimation risk.Price risk arises because commodity prices are uncertain, while estimation risk is a consequence of errors incurred when determining the parameters of a policy.DDA-SP accounts for both these risks as discussed below.Differences between in-sample and out-of-sample profits of a storage policy can be attributed to the informational and structural components of generalization error (discussed in Section 2.2).The informational component arises due to commodity spot prices in the test set being uncertain and different from the training set (i.e., price risk).Regularization adds bias to the estimator to improve out-of-sample performance by avoiding overfitting (Mohri et al., 2012).
It can also be viewed as ensuring that the policy is trained using a robust objective in the forward-looking math program (20)-( 23).To understand this, note that setting  equals zero in the objective (20) amounts to maximizing profits on the training set without accounting for price risk.For a positive , the bias added by the regularization makes this objective robust, that is, training and validation procedures used to determine policy parameters account for spot prices differing from historical prices within some uncertainty set (see, e.g., Gao et al., 2017, for theoretical results on the robustness interpretation of regularization).
The structural component of generalization error, interestingly, has implications on both model complexity and the impact of estimation error.Consider DDA-LDR, which is inconsistent with the optimal policy structure in general.The complexity of the class of policies represented by LDRs is a function of the richness of features.For instance, if features used in the definition (28) of LDR include the class of prespecified threshold and/or base-stock policies, then the class of LDRs subsume the set of structured policies considered by DDA-SP.In contrast, regardless of the richness of features, the class of policies that DDA-SP considers is restricted to those satisfying optimal policy structure, thus potentially reducing model complexity relative to LDRs.Additionally, policy structure makes DDA-SP robust to estimation error.To see this, consider DDA-LDR again.Changes in feature values X t directly translate into changes in the injection or withdrawal decisions for LDRs as seen in ( 28).In contrast, for the choice (30) of DDA-SP, small changes in feature values may not affect decisions if the estimated P t (X t ) value remains in the same interval as the exact threshold.Analogous reasoning holds for the double-base stock structure in the case of a slow storage asset.
We numerically verify robustness of DDA-SP compared to DDA-LDR in Section 5.3.

PERFORMANCE EVALUATION OF DDA
In this section, we evaluate the performance of DDA-LDR/SP compared to ILP and RH.

Setup
Table 5 summarizes the approaches we compare and the data that they exploit.In addition to the data considered in the RH backtest (Section 3.2), we also include analyst forecast data based on a feature selection study detailed in Section 5.4, which shows that spot prices, futures prices, and median analyst forecast constitute an undominated feature combination.
We use Bloomberg's Analysts' Median Composite Forecast that reports the median of the price forecasts offered by up to 31 major financial institutions.While individual expert forecasts may exhibit high prediction errors, by using the median forecast over a variety of well-established financial institutions, we expect some error diversification (Cortazar et al., 2018).Based on the median forecasts, we generate monthly analyst forecast curves A t = (a t, :  ∈  = {t, t + 1, … , t + T}) for the six commodities for planning horizons up to 12 months for the limited time period of 2008 to 2017 (Note: The RH results based on analyst forecasts and AR(1) spot history rather than futures curves are reported in Tables EC.2 and EC.3 of the Supporting Information).Our experimental setup for operational parameters is identical to the first three rows of Table 2. To obtain instances spanning multiple subperiods, we split the data as shown in Table 6.This yields 2 × 3 × 8 × 4 = 192 instances per commodity.Referring to Table 6, we optimize based on a single sample path (e.g., 2000-2001) and evaluate on a test set (e.g., 2002-2003).We repeat this procedure for all test sets on a broad variation of operational storage parameters and then show the results as mean and quartile statistics across the 192 instances.The rationale behind this is that it represents the setting for decision making in practice: The storage manager trains the policy parameters on a training set (including validation on a validation set) and evaluates on a test set.
We tested sensitivity on training horizons by evaluating DDA-SP for three different training set lengths, that is, 12 months, 24 months, and 36 months.A training length of 24 months resulted in the best performance on average.Using a shorter training cycle of 12 months or a longer training of 36 months can deteriorate the downside performance of datadriven policies and foster downside outliers by not fully capturing the underlying price behavior (too short training sets) or by training on structural breaks (too long training sets).Apart from that, median performance when employing 12 and 24 months of training is similar.For more details, we refer to Section EC.7.1 of the Supporting Information.The following results are based on 24 months of training.
We further test DDA-LDR and DDA-SP with and without forward optimization.For the forward optimization, we use F t containing futures prices with the 12 closest monthly maturities (T f = T s + 12) that outperformed both a shorter forward optimization horizon of T f = T s + 6 and having no forward optimization, that is, T f = T s (see Section EC.7.2 of the Supporting Information for more details).We report sensitivity toward frictions and storage flexibility in Supporting Information EC.7.3.The effect of discount rates on performance is reported in Supporting Information EC.7.4.To avoid overfitting and to enable feature selection, we apply regularization to regularize DDA-LDR and DDA-SP in a cross-validation procedure, which leads to better performance compared to unregularized DDA for the majority of instances (see Section EC.8 of the Supporting Information for more details).

Performance evaluation
Figure 6 summarizes the performance results of the different storage policies.This may be reasonable in markets that are particularly efficient, which is the case for natural gas compared to less efficient metal and agricultural markets (Kristoufek & Vosvrda, 2013).In highly efficient markets, all available information is already included in the futures prices.Another reasonable explanation is the (iv) DDA-SP under full and limited flexibilities.Table 7 shows the disaggregated performance of DDA-SP** from Figure 6 with respect to storage flexibility.The results of DDA-SP relative to the perfect foresight bound are not fundamentally different between fully flexible storage assets (FF) and limited flexible storage assets (LF), that is, DDA-SP performs well for both of the storage settings.However, we observe for the FF case that it is more effective to train a price threshold P t , rather than the more general double base-stock structure.While DDA-SP with the FF structure yields an average (median) performance of 21.3% (26.6%),DDA-SP with the LF structure yields 11.9% (22.8%).Furthermore, DDA-SP-FF is more efficient and reduces computation times of DDA-SP-LF (above 3600 s) on average by almost 90%.

Improvement of downside risk
We investigate the performance of methods in terms of the 25%-quartile of the profit distribution on each instance, which is representative of downside risk.Figure 7 displays these results.Despite the DDA-LDR policies being trained using regularization, their 25-th percentile of profits are worse than RH on roughly 50% of the commodities and instances.In contrast, DDA-SP policies improve on the downside risk of RH policies or are comparable for all commodities except natural gas.For natural gas, where RH was shown to be a strong competitor, the downside risk measured as the 25%-quartile performance can be improved by monthly reoptimization of DDA-SP, which increases the 25%-quartile performance from −31.8% to −14.0% of the perfect foresight value.Thus, consistent with the discussion in Section 4.4, both regularization and policy structure in DDA-SP are valuable to manage downside risk.

Feature selection
For effectively using DDA-SP, and in particular reducing information inconsistency, selecting the right initial feature set is crucial.We consider the following candidate features: spot prices, futures prices, analyst forecasts, temperature, the S&P 500 index, and the Trade Weighted U.S. Dollar Index.
We will employ as a reference the feature combination used to obtain the results in earlier sections, specifically spot prices, futures prices, and analyst forecasts.Our results show that all three feature categories were relevant for storage decisions.In addition to the feature type, the lag of features also matters (see Tables EC.16-EC.19 of the Supporting Information).
Our results reported in Supporting Information EC.7.5 show that ignoring futures and analyst forecast features from the reference feature combination and relying on a pure backward-looking approach with spot price features only deteriorates performance.
However, there can be situations when liquid futures contracts or analyst forecasts are absent.In this case, it may be worth considering other features.Table EC.12 of the Supporting Information therefore compares the performance of DDA-SP with spot price features only to DDA-SP with spot price and macroeconomic features (i.e., the S&P 500 index and the Trade Weighted U.S. Dollar Index) that have been shown to drive commodity prices.The results show that in the absence of futures and analyst forecasts, adding macroeconomic features can help in particular with respect to downside risk.This is an important result with practical implications as there are commodities where futures and analyst forecasts are not available, for example, for commodities without liquid futures markets (e.g., asphalt or specific types of polyethylene such as HDPE, LDPE, and LLDPE).
Our additional results reported in Table EC.13 of the Supporting Information also show that whenever both futures and analyst forecasts are consistently available, additional macroeconomic features do not lead to a consistent performance improvement.One reason may be that macroeconomic information is already priced into futures and analyst forecast rates (Rational Expectation Hypothesis).
The absence of analyst forecasts however deteriorates storage performance (see Table EC.14 of the Supporting Information).This observation adds to the empirical findings from Cortazar et al. (2018) by showing that analyst forecasts also improve storage decisions.Cortazar et al. (2018) find similar support for price forecasting.Specifically, adding analyst forecasts as features on top of futures prices improves spot price forecast accuracy, arguing that this improvement is likely because futures-based forecasts alone may not incorporate explicit information about the risk premium.For natural gas where DDA performs comparatively poor in our experiments, we test the effect of the additional feature temperature that has been shown to drive natural gas prices (see, e.g., Nick & Thoenes, 2014).Therefore, we collect monthly average temperature data and add it as an additional feature to the original DDA models.Our results from Table EC.15 of the Supporting Information confirm that temperature does not provide significant additional information for storage decisions.The results only slightly improve performance on single instances.

Summary of insights
Our findings have implications on both storage practice and data-driven optimization research.The existing literature evaluates the performance of the RH policy relative to the optimal policy of a storage MDP with full-information assumptions.In this setting, RH has been shown to yield near-optimal profits.However, we show that this evaluation may be misleading if applied to operate storage on real data due to generalization error.We make four related observations from Section 3.2: (i) RH can yield unprofitable storage operations (V RH < 0), (ii) ignoring futures price information can be beneficial, (iii) the value of reoptimization is not necessarily positive, and (iv) the direction (upward or downward) of the one-step-ahead price forecast is essential.
We show that there are two potential sources of generalization error: informational inconsistencies and structural inconsistencies.To mitigate the adverse effects of general-ization error, we propose data-driven and ML-based policies that explore feature data (e.g., available analyst forecasts and futures prices or macroeconomic and weather features).We find that these policies can outperform RH without requiring the reoptimization of a linear program or tuning the planning horizon.Further, the linear decision rule approach from the data-driven optimization literature is not effective in our setting.Structured policies that encode properties of an optimal policy are instead needed to improve on RH.Finally, extending the standard ERM approach to include forward-looking information (if available, as is the case in commodity markets) can improve the performance of data-driven policies.

CONCLUSIONS
We study the fundamental commodity storage problem.RHs are widely used in academia and practice to compute storage operating policies due to their computational attractiveness and known near-optimality in simulation experiments based on specific model assumptions.We demonstrate on real data that the empirical performance of RH can be suboptimal due to generalization error and propose a forward-looking ERM approach to compute linear decision rules and structured data-driven policies, also highlighting how it addresses informational and structural inconsistencies.We find that data-driven policies that encode an optimal policy structure exhibit robust performance across commodities and time periods in our data set, while linear decision rules perform worse than RH, despite being trained using data.In addition to uncovering the importance of policy structure in a Production and Operations Management data-driven optimization setting, using forward-looking futures price information in the training phase on top of historical spot prices can be crucial to improve out-of-sample performance.On the other hand, the additional value of learning from historical spot price data in our best datadriven storage policy compared to using only futures prices for this purpose sheds light on its performance relative to RH, which uses futures price alone.For markets such as natural gas, which are highly efficient and exhibit high volatility, the value of learning from historical spot prices appears to be limited and both RH and our DDA show good performance.In contrast, this value is substantial in less efficient and/or less volatile commodity markets such as copper, gold, crude oil, corn, and soybean, where the DDA can outperform RH.
Our DDA and structured policies advance the state-of-theart for commodity storage.They suggest potential value in having existing software, which already incorporates backtesting capabilities, to also directly target generalization error when computing storage decisions.This research can be enhanced in several ways, of which we briefly state three.The first is to improve our backtest by allowing more granular intramonthly trading and leveraging data on trading volume to select only "liquid" futures contracts for use in RH and DDA.The second is to extend our backtest to understand the impact of generalization error and the performance of RH and DDA when forward trades are combined with spot trades.The third is to investigate DDAs that directly minimize downside risk when computing operating policies as opposed to relying on regularization for potential risk mitigation as we do in this paper.

F G U R 3
NYMEX futures curves (dashed) and realized spot prices (•) for natural gas (prices refer to closing prices at the first trading day of the corresponding month) [Color figure can be viewed at wileyonlinelibrary.com] Optimization (training) and evaluation framework forward-looking training math program over T U R 7 25%-quartile of V∕V PF ⋅ 100% of RH versus DDA-LDR with forward optimization and RH versus DDA-SP with forward optimization for the six combinations of G i = G o ∈ {0.5, 1} and  i =  o ∈ {1, 0.995, 0.99}

Production and Operations Management
Commodity spot and futures price data from ThomsonReuters (2000Reuters ( -2017)   ) TA B L E 1 F I G U R E 1 Spot prices for copper, gold, crude oil, natural gas, corn, and soybean from 2000 to 2017 [Color figure can be viewed at wileyonlinelibrary.com]

TA B L E 2
Summary of the numerical design

Production and Operations Management TA B L E 3
Performance of futures-based RH in V RH ∕V PF ⋅ 100% across all instances

Table 3
reports statistics of the RH backtest across the 1296 instances summarized in Table2.We obtain an assessment of the true generalization error by measuringV RH

Production and Operations Management TA B L E 7
Performance of DDA-SP in V∕V PF ⋅ 100% with respect to storage flexibility