An iterative algorithm for regret minimization in flexible demand scheduling problems

A major challenge to develop optimal strategies for allocation of flexible demand toward the smart grid paradigm is the uncertainty associated with the real‐time price and electricity demand. This article presents a regret‐based model and a novel iterative algorithm which solves the minimax regret optimization problem. This algorithms exhibits low computational burden compared with traditional linear programming methods and affords iterative convergence through updates of feasible power schedules, thus enabling a scalable parallel implementation for large device populations. Specifically, our approach seeks to minimize the induced worst‐case regret over all price scenarios and solves the optimal charging strategy for the electrical devices. The convergence of the method and optimality of the computed solution is justified and some numerical simulations are discussed for the case of a single device operating under different types of price realizations and uncertainty bounds.


INTRODUCTION
The electrification of the transportation sector is envisaged as a key step toward a decarbonized and sustainable energy system. 1 The increasing penetration of electric vehicles (EVs) will reduce carbon emissions and enable a more efficient integration of renewable generation. The intrinsic flexibility of the EVs' charging process and the implementation of vehicle-to-grid technologies are expected to support the reliability and the efficiency of the power system by providing a wide range of ancillary services, [2][3][4] such as balancing, voltage support or frequency regulation.
In order to fully achieve these potential benefits while limiting undesired consequences (e.g., network congestion, price volatility), it is crucial to develop efficient and robust control strategies for the charging of the EVs. A significant amount of research has investigated this topic after the seminal papers, 5,6 proposing decentralized and game-theory approaches, 7-10 adaptive strategies 11 or Lagrange relaxation methods 12 to achieve an efficient large-scale integration of the EVs in the power system. These analyses have been expanded to consider additional elements such as coupling constraints between the devices, 13 network topology 14 or the degradation costs of the EVs' batteries. 15 Most of the cited works adopt a deterministic framework that assumes perfect knowledge of all the relevant parameters of the problem, such as electricity prices, network conditions or EVs' availability. However, these elements are subject to relevant uncertainties, which might have a substantial impact on the efficiency and benefits of the proposed control schemes. One possible approach to tackle this issue is to adopt receding-horizon schemes 16,17 or real-time algorithms 7 that modify the charging profile of the EVs on the basis of updated system conditions, implicitly accounting for imperfect predictions. Other works have considered the application of competition online algorithms, 18 stochastic dynamic programming 19 or robust min-max optimization. 20 The analysis presented in this article also adopts a robust optimization framework, considering bounded uncertainties on the electricity prices and determining the cost-minimizing charging profile of a single EV as the solution of a regret minimization problem. In the proposed approach, the EV charging is determined in order to minimize the worst-case regret, that is, the largest additional cost (with respect to the deterministic perfect-information case) due to price uncertainties. The optimal solution is calculated in a constructive manner: a tentative feasible charging profile is iteratively updated through some elementary operations consisting of power swaps between two or multiple time instants, reducing the maximum regret at each iteration and preserving feasibility. Differently from the numerical approach presented in Reference 20, it is analytically demonstrated that the proposed method always converges to the minimized regret solution using Lyapunov techniques, under all the considered price uncertainty hypotheses and for any initial charging profile. Thanks to these features the method is well suited for a large-scale parallel implementation involving asynchronous schedule updates and energy price recalculations. In addition, numerical case studies are presented to (i) demonstrate the increased computational efficiency of the proposed method with respect to standard linear programming approaches while achieving the same optimal control solutions and (ii) qualitatively characterize the optimal power schedules with respect to different uncertainty hypotheses and flexibility parameters of the EV.
The remainder of this article is organized as follows: Section 2 presents the main regret model of a single EV and discusses the impact of correlated price uncertainties on the regret. The description of the iterative optimization algorithm and the proof of its optimality are included in Sections 3.1 and 3.2, respectively. Simulations of the proposed control strategy are presented in Sections 4, and Section 5 concludes this article.

PRELIMINARIES AND SETUP
Next we introduce our modeling framework and formulate a regret criterion to characterize optimal power allocation strategies in the presence of price uncertainty. In essence, the minimax regret criterion captures the idea that devices are concerned with foregone opportunities. It departs significantly from a worst-case approach where devices optimize the cost incurred for the highest possible electricity price occurrence. The latter is a risk-averse strategy that may result in conservative decisions and is therefore more suited for malicious uncertainties. This section introduces the minimax regret formulation for the flexible devices under the uncertain electricity market. The impact of correlated uncertainty model on the regret analysis is also discussed.

Regret modeling
Let us denote the discrete time interval by = {1, … , T}, the required total energy and rated power by E r and P r , respectively. The power consumption schedule is subjected to the availability window ⊆ . If the power consumption profile of the appliance is denoted by u ∶ → R + , then a collection of feasible power schedules can be formulated as the following set where I (t) is an indicator function While for the case of a single device there is no loss of generality in assuming = , it is meaningful, for multiple devices, to allow possibly distinct availability windows. Therefore, we adopt this more general view point in the description of our modeling framework and algorithms. Assumption 1. The feasible sets of power consumption profile characterized by parameters (E r , P r , ) are not empty, that is, Assumption 2. The required energy is a multiple of its rated power, that is, If Assumption 2 holds and the price function Π ∶ → R + is convex, at least one optimal power consumption profile will be an element of the following set In the following, we normalize the discrete time sampling period Δt = 1 to ease the notation. We assume that energy prices at time t, denoted by Π(t), are uncertain but known to belong to an a priori assigned interval: Remark 1. While in this article we do not model the impact of power demand on energy costs, the inclusion in (6) can be justified by assuming a simple model of power market where prices at any time t are represented an increasing function of total demand (both flexible and inflexible). 5,6,21 Although this assumption is only justifiable with 100% traditional generation, it can be easily generalized to more realistic scenarios with high penetration of renewable energy sources by using net demand instead of absolute demand to calculate the electricity price. Since the renewable contributions could vary according to the weather conditions, the forecast electricity price is uncertain within the considered time-window and assumed to belong to the interval in (6). Moreover, the impact of network congestion on the price is also neglected, as commonly assumed in previous research. [22][23][24] If the signal Π(t) was known in advance, one could accurately forecast the cost of energy for a power allocation u as: and an optimal power schedule, could be found by solving the following linear program: In particular, one might without loss of generality assume an optimal power schedule in K . Thus, the optimal energy cost can be computed identifying the K time instants with the lowest prices which could simply be evaluated according to: where |S| denotes the cardinality of set S. We call this cost the utopia cost, and, for each price profile, it can only be achieved in the absence of uncertainty. In general, each schedule u(⋅) and each price profile will induce a regret, defined as the difference between the cost incurred by u(⋅) minus its corresponding utopia cost. Our objective is to schedule u(⋅) so as to minimize the worst case regret, defined as: The objective is to find the optimal power schedule with the minimal worst regret It can be shown that the considered minimax regret problem in (10) can be reformulated as the linear program detailed below: Note that the number of constraints used to formulate the regret lower bounds (11c) is determined as a combinatorial function of the size of availability window , and of the minimum number of time slots needed for task completion. As a result standard linear optimization algorithms might become computationally inefficient when the cardinality of is large.
To solve the minimax regret optimization problem more efficiently, we propose an iterative algorithm in which the power schedule u(⋅) converges to the optimal solution u * (⋅) in (10), and accordingly the regret R(u) converges to R(u * ). In order to do so, it is useful to derive a more explicit expression of the regret, as detailed below: Now, let us next define a function Q(t, u(t)) as follows: The quantity Q allows to express the regret in a form which lends itself to explicit formulas for evaluation, without the need of a combinatorial test of all possible subsets of cardinality K as from the previous expression. In fact: The latter maximization problem can be easily solved after sorting in ascending order the values of Q and assigning to Sthe time instants in with the K smallest values of Q. More precisely, the Kth smallest value of Q is defined as: Accordingly, the discrete time availability interval is divided into three sets Remark 2. Notice that the Q function is the basis to define the three sets in (15), rather than a evaluation of the energy cost. According to their definition, the sets L in and L out can be empty, whereas the set L is always nonempty. Concerning cardinality, it follows that |L in | + | L| ≥ K. Since the support of any power consumption profile in K has K elements, two cases should be taken into account in order to calculate the regret. The simplest case is |L in | + | L| = K, when every element of L in and L corresponds to a time instant where regret is evaluated costing energy according to the lower bound of the price. For the second case |L in | + | L| > K, one must choose a subset L in ⊆ L with cardinality K − |L in | > 0, whose elements corresponds to time instants where energy is costed using its lower bound of the price, while, for the remaining elements in L out = L ⧵ L in (of cardinality | L| − (K − |L in |) > 0) energy is costed using the price upper bound. The mathematical equation of the regret for both cases is expressed as Note that the set L is partitioned into two nonempty subsets of suitable cardinality in (16) when |L in | + | L| > K. While this partition might not be unique, the expression of regret computed in (16) is independent of the choice made. This fact is illustrated by the following lemma: Lemma 1. If |L in | + | L| > K, the elements in L in and L out are exchangeable without varying the value of regret computed according to (16), that is, Proof. According to the definition of the set L, for all t 1 ∈ L in and t 2 ∈ L out , it holds Q(t 1 , u(t 1 )) = Q(t 2 , u(t 2 )), or explicitly, which implies Therefore, when |L in | + | L| > K, we can select arbitrary K − |L in | elements in L as L in without changing the regret. ▪

Regrets with correlated uncertainty
The price uncertainty model considered so far can be equivalently formulated by introducing the vector Π of prices, and specifying that Π is known to belong to the set: Since  is a Cartesian product, this can be interpreted as the deterministic counterpart of independent stochastic prices at each time instant. One might argue that, in practice, this is not often the case. For instance, on a sunny or windy day availability of renewable cheap energy could be uniformly high throughout the day and assuming a strong fluctuation of energy price on a hour to hour basis (or less) might seem not very plausible. We show next how to extend our approach to include a more general type of uncertain set , one which accounts for some form of correlated price uncertainty. In particular, we define  as follows: Notice that  is no longer a Cartesian product as an uncertain constant price shift is added to each price interval. Alternatively  can be written as: Notice that the following inclusion holds: however, this approach tends to overestimate the price fluctuations possible on an hour to hour basis so that using these new bounds on minimal and maximum price at each time t might result in suboptimal decisions. In the following we show that one may still use the regret expression computed with uncertainty bounds Π(t) ∈ [Π(t), Π(t)] in order to devise optimal decisions also in the case of uncertainty set  as in (18). While the expression for the utopia price is clearly unchanged, a different formula for regret evaluation is needed in this case: More explicitly, by substituting the formula for the utopia price, we obtain: where the last equality follows since: Notice that Π av and its bounds have no impact at all on the regret and therefore optimization of the regret can proceed along the same lines as with a Cartesian product uncertainty set. We show next another form of correlated uncertainty which has no impact on the modeling of the problem. In particular we consider an uncertainty set of the following type: From the physical point of view this corresponds to a correlated deterministic uncertainty that results in a time-independent rescaling of upper and lower prices. It could be seen as a (time-invariant) uncertainty in the exchange rate of the currency used to cost energy and the one owned by the agent minimizing its cost. We proceed to the evaluation of regret for this type of uncertainty set.
By substituting the formula for the utopia price, we obtain: where the second equality follows by redefiningΠ = Π∕ and rewriting the corresponding maximization problem.
In particular, the regret R(u) equals times the regret corresponding to the situation [ , ] = {1}. Because of this simple calculation, we conclude that multiplicative constant uncertainty does not affect the choice of optimal regret allocation.

ITERATIVE OPTIMIZATION STRATEGY
An iterative regret minimization algorithm is introduced in this section and its convergence to the optimal power allocation of minimal worst regret is justified. Before studying the technical details of the algorithm, we first illustrate its overall structure. The proposed iterative algorithm is described in Algorithm 1. This updates the power schedule sequentially to reach the optimal equilibrium. The algorithm is initialized with a feasible power schedule u 0 ∈ and two variables: k is the iteration index whereas conv indicates if the optimal equilibrium has been reached. Then, in the main part, a while cycle is performed for the agent to justify if there is any nontrivial power swap > 0 which can potentially reduce it regret. At each execution of the cycle, one of the three elementary power swap policies, which will be defined in Section 3.1, is implemented to update the previous power scheduling u k−1 . Finally, the indicator conv is set to 0 and proceeds to the next iteration. When there are no power swap > 0 that decrease the regret, the while loop terminates and the power schedule u k has reached an optimal solution.

Elementary power swap
Given an arbitrary initial power allocation, this is sequentially updated in response to uncertain price signals, with the objective of reducing the regret function. In particular, this sequential process consists of two types of swap operations: (i) single time instant to single time instant power swap, that is, 1 → 1; (ii) single time instant (n multiple time instants) to n multiple time instants (single time instant) power swap, that is, 1 → n (n → 1, respectively).

Algorithm 1.
Iterative Scheme -Worst regret optimization if >0 exists such that: R(u k + ) − R(u k )<0 then 6 At least one of the following swap policy holds: The power schedule is the optimal solution with the minimum worst regret: u k → u * .

1 → 1 power swap
The first type of elementary power swap transfers power units from time t to t. Starting from an initial power schedule u ∈ , the updated power profileũ after the swap will have the following expression: An equivalent compact definition reads as:ũ where ⃗ e s is the vector of the canonical basis associated with the sth element. For a given power schedule u ∈ , the amount of power that can be swapped by the single device is limited by the quantity , defined as follows: Each of the four terms of the minimum function in (24) is now described in detail: • Maximum feasible power increase at t: • Maximum feasible power decrease at t: The bounds a and b ensure the new power scheduleũ after the swap fulfills the constraints in (1) for ≤ a and ≤ b.
• Maximum regret-reducing power swap within region of constant gradient: The last term c in (24) ensures that > 0 only if the associated power swap reduces the regret and, for the sake of its explicit evaluation, it is computed within a region of constant gradient, which is always guaranteed to exist for a piecewise linear continuous function: Notice that the feasibility conditions (25) and (26) are easily determined according to the given power allocation, whereas the determination of c(u, t, t) in (27) is not straightforward. Since the worst regret R(u) can be evaluated using the Q values as expressed in (13), it is worth to analyse the relation between the power swap and the worst regret variation via Q values. The detailed examination for regret variation as well as the determination of c(u, t, t) are elaborated numerically in Appendix A.1.

1 → m and n → 1 power swap
The second type of elementary power swap is between the multiple time instants and a single time instant. Let us denote the power units from a single time t to the set S of m time instants by 1→m , and the proportion of power units into t ∈ S by (t) 1→m . The weighting parameter fulfils 0 ≤ (t) ≤ 1, ∀t ∈ S and ∑ t∈S (t) = 1. Accordingly, the updated power scheduleũ is provided:ũ and its compact formũ Similarly, the n → 1 power swap policy from a set S of n time instants to a single time t with power units n→1 can be defined asũ where 0 ≤ (t) ≤ 1, ∀t ∈ S and ∑ t∈S (t) = 1. Next, we determine the maximum feasible amount of power 1→m and n→1 that can be swapped: Each of the terms of the minimum function in (31) is now described in detail: • Maximum feasible power increase at t: • Maximum regret-reducing power swap within region of constant gradient: To guarantee that the regret decreases by implementing 1 → m or n → 1 power swaps, the bounds c 1→m (u, t, S) in (31a) and c n→1 (u, S, t) in (31b) are imposed in each case. This also guarantees that the rate of decrease as a function of the considered amount of power swapped is constant. In particular, we let: and: Similar to the approach in the 1 → 1 power swap, Q values are adopted to investigate the impact of power swap between multiple time instants on the worst regret variation. The calculation of both c 1→m and c n→1 is analytically illustrated in Appendix A.2.
Up until now, based on the described elementary swap policies, the proposed optimization strategy can be summarized as: 1. At the beginning, we initialize a feasible power schedule u(⋅) ∈ and determine the associated Q values over the availability window. 2. By sorting the Q values, the three sets in (15) are defined and the initial regret is calculated according to (16). 3. Each scenario of power swap as described in Appendix A is tested to explore a feasible swap enabling the worst regret reduction. Specifically: (a) The 1 → 1 power swap is composed of 6 subcases according to which set the considered time instants belong and the swapping direction. The Q value at the time instant with power swapped out decreases, whereas the other instant, with power swapped in, has an increased Q value. (b) Both 1 → m and n → 1 power swap are composed of 2 scenarios. In 1 → m (n → 1) power swap, the Q values of m (n) time instants increase (decrease) by the same amount.
4. When the amount of power that can be swapped in all scenarios is zero, the device can no longer improve its regret and the algorithm has reached an optimal solution u * (⋅).
By adopting this optimization strategy, from the theoretical point of view, only asymptotic convergence to the optimum is guaranteed as analyzed in the next section.

Optimality of the algorithm
This section engages to prove our iterative algorithm asymptotically converges to power allocations with the minimum worst regret by considering all elementary power swapping directions in Section 3.1 that strictly decrease the regret of the agent. For a feasible power allocation we define Δ * (u) as follows: where F(u) denotes the set of power allocations resulting from u after a single swap of any kind, which result in a nonincreasing regret. In this respect Δ * (u) denotes the maximum regret decrease that can result from u after a single swap. For technical reasons, and in order to avoid the possibility of swaps that do not lead to a strict regret improvement, we embed our algorithm in the following difference inclusion: for some positive ∈ (0, 1]. The index k denotes the iteration number and it is increased by one. Taking = 1, forces the algorithm to always pick a swap that yields the maximum regret reduction, however, more flexibility can be allowed and arbitrary positive values of < 1 are also suitable for convergence analysis. Whenever, for a particular kind of swap, the minimum in (24) or in the subsequent Equation (31) is 0 we include u in F(u). This guarantees upper semicontinuity of F(u) and consequently upper-semicontinuity ofF(u). Our main result for this subsection is the following:

Theorem 1. Under Assumption 1 and 2 any solution of (37) initialized from a feasible power schedule asymptotically converges to a set of power allocations (normally a singleton) of minimal regret.
To prove that the proposed algorithm can always converge to allocations with minimal worst regret, it is necessary to show that the considered set of potential power swap directions is rich enough, and specifically that when no power swaps are found then we necessarily are on a power allocation of minimum regret. The following technical lemmas are presented.

Lemma 2.
Let u ∈ be a feasible power allocation, S ⊆ and S ⊆ be two sets such that S ∩ S = ∅. If there exists > 0 such that: then, at least one of the following holds: 1. There exists t ∈ S and̂> 0 such thatũ is feasible, and 2. There exists t ∈ S and̂> 0 such thatũ is feasible, and Proof. See Appendix B. ▪ Lemma 3. Let u be a feasible power allocation, t ∈ and S ⊆ such that t ∉ S. If there exists > 0 such that: then, at least one of the following holds: 1. There exists t ∈ S and̂> 0 such thatũ is feasible, and 2. There exists S † ⊆ S L ⊆ S and̂> 0 such that is feasible and Proof. See Appendix C. ▪ Lemma 4. Let u be a feasible power allocation, t ∈ and S ⊆ such that t ∉ S. If there exists > 0 such that: then, at least one of the following holds: 1. There exists t ∈ S and̂> 0 such thatũ is feasible, and 2. There exists S † ⊆ S L ⊆ S and̂> 0 such that is feasible, and Proof. See Appendix D. ▪ Corollary 1. According to Lemmas 2,3,and 4, whenever there exists a feasible swap > 0 such that R(u + ) < R(u), at least one of the following swaps exists and decreases the regret: Corollary 1 enunciates that any power swap can be achieved by employing sequentially the two types of elementary swaps. We can conclude that the application of the iterative algorithm guarantees convergence to the optimal solution u * (⋅) for all feasible initial power schedule under Assumptions 1 and 2. In the final optimal power profile, no elementary power swap is convenient to reduce the regret.
We are now ready to prove Theorem 1.

EXAMPLES
The performance of the proposed iterative algorithm is evaluated in this section through a series of case studies. All simulations are performed under a 24-h time horizon with the resolution of Δt = 30 min. For illustration, a single agent under various types of uncertain prices are considered which is followed by the implementation on a large population of agents. The required energy and power for the single EV is assumed to be E r = 30 kWh and P r = 10 kW, respectively. Therefore, the minimum required charging duration for fully charged is 3 h. be realistic to have more accurate prediction in the near future, an increasing price interval has been adopted. According to the resulting charging schedules, it is observed that the nature of the optimal power allocation, in this scenario, depends on the ratio between the available charging hours and the minimum required charging duration. In Figure 1, when the available charging time is less than twice of the minimum required charging duration, time instants with lower uncertainty would charge at a higher power. When the available charging time is greater than twice of the minimum required charging duration, the optimal charging schedule is reversed as in Figure 3. Figure 2 shows the optimal charging power is evenly allocated when the available charging time is equal to twice the minimum required charging duration. Notice that minimum regret policies are, in this case, optimal also with respect to nominal (average) prices (this is because such average price is constant and any policy leads to the same cost). On the other hand, neglecting uncertainty in the decision, by simply using nominal cost minimization, could lead to significantly suboptimal energy bills if prices happen to be lower or higher than expected. Moreover, the computational time at different size of the considered problems are recorded in Table 1, which indicates that our proposed algorithm has no advantages compared to the linear programming approach if the number of decision variables are small and the prices have a constant average value.
• Varying nominal prices, constant uncertainties: To consider more diversified patterns of overlapping price intervals, a slowly varying nominal price profile and constant uncertainty interval are exhibited. Based on the results shown in Figures 4-6, it is concluded that the optimal schedule allocates more power to time instants with smaller average prices if the uncertainties are identical. Regarding the computational complexity, we can obtain that the proposed algorithm significantly outperforms the linear programming approach especially when the number of decision variables are large as shown in Table 2. For instance, the number of constraints in the linear OP is • Varying nominal prices, increasing uncertainties: In reality, this scenario could happen more frequently in applications, for example, the prediction of the price is more accurate at the beginning of the window and less certain afterwards. Figure 7 shows the optimal charging powers are almost flat due to the similar average prices and uncertainties. While the optimal schedule allocates less power at time instants with higher average and uncertain prices as in Figures 8 and 9.  Moreover, our proposed algorithm, in this scenario, also results in significantly higher computational efficiency at larger availability window (Table 3).
• Constant lower prices, increasing uncertainties: Figure 10 shows the results under a price realization with identical lower bounds and increasing upper bounds. The uncertainties in this scenario is increasing along the time, so the optimal power profile is to allocate more power at the beginning and less amount of power in the future.
• Constant upper prices, increasing uncertainties: While the worst prices are the same as in Figure 11, the optimal minimum regret strategy is to consume more power at time instants with smaller lower prices. Notice that if upper-prices  are virtually constant but in fact slowly increasing, the minimum regret optimal policy will not change significantly. On the other hand, a worst case approach would concentrate all power at the initial times, potentially missing out considerable savings if the worst case does not materialize. Conversely, the cost of the minimum regret strategy when the highest prices occur is virtually the same as that of the optimal worst-case strategy.

CONCLUSION
This article proposes a novel optimization method for a single agent to schedule its energy absorption profile in order to minimize the worst-case regret over all price scenarios. The algorithm guarantees convergence to an equilibrium charging schedule whose optimality is justified. Under different uncertainty hypotheses and flexibility parameters, the resulting computational efficiency is compared with the corresponding linear programming approach. While only a single EV was considered in this article, it is possible to deploy this same algorithm in order to schedule power absorption profiles of EV fleets. However, unlike the case of a single EV which has negligible market power in electricity pricing, a large population of devices could induce price changes and this aspect has not been modeled in the current analysis, where a fixed (albeit uncertain) price profile is considered. Specifically, the worst regret of an individual agent may change if the broadcast price profile is updated by other devices through modifying their operation schedules. The coupling of the regret minimization and price variations can be simulated for large populations, but the overall convergence of the scheme is still an open research question. Future research will extend the algorithm to systems with large device populations and analytically characterize the convergence to the social equilibrium and optimality of the solution. The robustness and disturbance rejection properties will also be investigated. Finally, the performance of the worst regret-based scheme and worst cost method will be compared.

ACKNOWLEDGMENTS
This research was partly supported by two research projects: Integrated Development of Low-Carbon Energy Systems (IDLES, Grant EP/R045518/1) and Active Building Centre (ABC, Grant EP/V012053/1).

APPENDIX A. ITERATIVE ALGORITHM SCHEME
For ease of explanation, we also represent the maximum Q value in L in and the minimum Q value in L out by respectively. Then, we do the iterative optimization algorithm according to the following procedures which is mainly divided into two categories.

A.1 Swapping between two time instants
The first and second term in (A2) correspond to a(u, t) and b(u, t) in (24). While the last quantity in (A2), which represents c(u, t, t), is obtained by the inequality where is to preserve the order of Q values and hence the lower price. Then, the updated regret is Since the definition in (A3) implies [Π(t) − Π(t)] ≥ 0, the regret is nonincreasing. 2. |L out | ≥ 2, t 1 ∈ L out , t 2 ∈ L out , t 1 ≠ t 2 We swap power The last quantity in (A6), corresponding to c(u, t, t) in (24), is obtained by the inequality Then, the updated regret is Since the definition in (A7) implies [Π(t) − Π(t)] ≥ 0, the regret is nonincreasing. 3. | L| ≥ 2, t 1 ∈ L, t 2 ∈ L, t 1 ≠ t 2 Since the adopted prices for t ∈ L in the regret evaluation depends on its cardinality, we consider the following two cases: • |L in | + | L| = K From t 1 to t 2 , we swap power where , is determined by the following inequalities Then, with the prices Π(t 1 ) and Π(t 2 ) as |L in | + | L| = K, the regret variation is expressed as where c(u, t 1 , t 2 ) = min (A15) Accordingly, with the prices Π(t 1 ) and Π(t 2 ) as |L in | + | L| > K, the regret is updated by Therefore, in the case of swapping power between time instants t 1 ∈ L, t 2 ∈ L, the policy is to swap Δ as in (A10) where c(u, t 1 , . The regret is updated by -From t 2 to t 1 , we swap power and the regret is updated by Hence, when |L in | + | L| = K, the swap policy is where c(u, t 1 , , and the regret is updated by -From t 2 to t 1 , we swap power , and the regret is updated by Hence, when |L in | + | L| > K, the swap policy is where c(u, t 1 , t 2 ) = min , } is determined according to the inequality The regret is updated by -From t 2 to t 1 , we swap power , and the regret is updated by Hence, when |L in | + | L| = K, the swap policy is where c(u, t 1 , t 2 ) = min , and the regret is updated by -From t 2 to t 1 , we swap power where c(u, t 2 , t 1 ) = , and the regret is updated by Hence, when |L in | + | L| > K, the swap policy is 6. L in ≠ ∅, L out ≠ ∅, t 1 ∈ L in , t 2 ∈ L out • From t 1 to t 2 , we swap power in which there is no condition on c(u, t 1 , t 2 ) since the prices at all-time instants will be preserved for all Δ a > 0. The regret is updated by • From t 2 to t 1 , we swap power where , Then, the updated regret is Therefore, in the case of swapping power between time instants t 1 ∈ L in , t 2 ∈ L out , the policy is defined as follows:

A.2 Power transfer between a single time instant and multiple times
Besides the swaps between two time instants, one needs to consider power swaps simultaneously operating between some instants in L by preserving alignment of Q variables. Due to the interest in feasible swaps, we only consider time instants in L where either upper or lower bounds are not active with the corresponding cardinality | L * | = K L * ≥ 2 and | L * * | = K L * * ≥ 2.
To evaluate the regret after swapping, we need to first partition the set L * into two subsets L * in and L * out such that L * = L * in ∪ L * out , L * in ∩ L * out = ∅ and |L in | + L * in = K. The same applies to L * * , and the two sets are denoted by L * * in and L * * out . While this partition is not uniquely defined, the cardinality condition is enough to guarantee that any subsequent regret evaluation is going to be independent from the choice carried out in selecting the partition.
1. From L * to t 0 ∈ L in Let us denote the amount of power added to t 0 by (t 0 ). In order to decrease the corresponding Q function by the same value for any time in L * the power to be swapped out must satisfy the following equations: By solving the above equations, we obtain the swapped out power at ∈ L * where . (A48) We swap power where the second term corresponds to b n→1 in (31b) and the third term is selected according to Then, the regret is updated by 2. From t 0 ∈ L in ∪ L * to L * * Let us denote the amount of power removed from t 0 by (t 0 ). In order to increase the corresponding Q function by the same value for any time in L * * , the power to be swapped out must satisfy the following equations: ∑ By solving the above equations, we obtain the swapped out power at ∈ L * * where . (A54) We swap power where the first term corresponds to a 1→m in (31a) and the third term . (A56) Then the regret is updated by 3. From L * to t 0 ∈ L * * ∪ L out Let us denote the amount of power added to t 0 by (t 0 ). In order to decrease the corresponding Q function by the same value for any time in L * , the power to be swapped out must satisfy the following equations: By solving the above equations, we obtain the swapped out power at ∈ L * where . (A60) We swap power where the second term corresponds to b n→1 in (31b) and the third term Then the regret is updated by 4. From t 0 ∈ L out to L * * Let us denote the amount of power removed from t 0 by (t 0 ). In order to increase the corresponding Q function by the same value for any time in L * * , the power to be swapped out must satisfy the following equations: By solving the above equations, we obtain the swapped out power at ∈ L * * where . (A66) We swap power where the first term corresponds to a 1→m in (31a) and the third term , Then the regret is updated by

APPENDIX B. PROOF OF LEMMA 2
Notice that the function R(⋅) is convex and there exists > 0 such that R(u + ) < R(u), then for any ∈ (0, 1), the regret of power allocation u + fulfills Since R(⋅) is piecewise linear, there exists ∈ (0, 1) with = sufficiently small such that for all̃∈ (0, ) and̃=̃, the sets Lĩ n , L̃L and Lõ ut are constant, where Lĩ n , L̃L and Lõ ut corresponds to L in , L and L out for the schedule u +̃. Moreover, the power schedule u +̃is also feasible because u +̃=̃(u + ) + (1 −̃)u. Next, based on the power schedule u, let us partition S and S according to: Swapping power̃> 0 from S to S only changes the costs associated with the time instants belonging to these two sets, therefore it is essential to evaluate their relation to Lĩ n , L̃L, and Lõ ut which is dependent on the cardinality of S L and S L . Specifically, for S L , the following two scenarios should be considered: Similarly, for S L , the set S fulfills Note that |S L | + |S L | ≤ | L|, hence at least one of the following inequalities holds: Thus, the variation of total regret due tõ> 0 follows one the three cases: For the feasible swap̃> 0, Then, the regret reduction R(u +̃) − R(u) can be rewritten in any of the following two forms: (i) based on t ∈ S: where (t, S) =̃( t) ∑ t∈S̃( t) and the power swap̂( t,S) is defined aŝ Since by construction R(u +̃) − R(u) < 0, there exists t ∈ S such that the power allocation u +̂( t,S) is feasible and the resulting regret variation fulfils R(u +̂( t,S) ) − R(u) < 0, which implies conclusion 1 in the Lemma.
(ii) based on t ∈ S: and the power swap Since by construction R(u +̃) − R(u) < 0, there exists t ∈ S such that the power allocation u +̂( S,t) is feasible and the resulting regret variation fulfils R(u +̂( S,t) ) − R(u) < 0, which implies conclusion 2 in the Lemma.
For the feasible swap̃> 0, with the power swap policŷ( wherê(t) = (t, S)̃(t), ∀t ∈ S. Notice that̂( t,S) feeds energy into S according to the same relative powers as̃, which entails the same relative variations of Q and accordingly the same prices in the expression of the regret variation. Thus, there exists t ∈ S such that the power allocation u +̂( t,S) is feasible and the resulting regret variation fulfils R(u +̂( t,S) ) − R(u) < 0, which implies conclusion 1 in the Lemma.
For the feasible swap̃> 0, where S L = S * L ∪ S * * L is some set partition such that |S * with the power swap policŷ( wherê(t) = (t, S)̃(t), ∀t ∈ S. Notice that̂( S,t) extracts energy from S according to the same relative powers as̃, which entails the same relative variations of Q and accordingly the same prices in the expression of the regret variation. Thus, there exists t ∈ S such that the power swap u +̂t is feasible and the resulting regret variation fulfils R(u + (S,t) ) − R(u) < 0, which implies conclusion 2 in the Lemma.
Therefore, if the S → S (n → m) swap is convenient to decrease the total regret, there exists at least a 1 → m or n → 1 power swapping policŷ> 0 such that the total regret is reduced.

APPENDIX C. PROOF OF LEMMA 3
Following the same lines of proof in Lemma 2, there exists̃= −̃(t)⃗ e t + ∑ t∈S̃( t)⃗ e t sufficiently small such that the power swap u +̃is feasible, the sets Lĩ n , L̃L, and Lõ ut are constant, and R(u +̃) − R(u) < 0. Specifically, the regret variation is evaluated according to the following three cases: (i) | L| = K − |L in |: For the feasible power swap̃> 0, it holds S in ⊆ Lĩ n , S L ⊆ Lĩ n ∪ L̃L, S out ⊆ Lõ ut , hence, the total regret variation is calculated as where the change of power allocation is defined aŝ (t,t) = −̃(t)⃗ e t +̃(t)⃗ e t , ∀t ∈ S.
Sincê( t,t) transfers energy from t into t according to the same relative powers as̃, it preserves the same relative variations of Q and accordingly the same prices in the expression of the regret variation. Then, by Equation (C1), there exists t ∈ S such that the power allocation u +̂( t,t) is feasible and the resulting regret variation fulfils R(u +̂( t,t) ) − R(u) < 0, which implies conclusion 1 of the Lemma.
Note that̂( t,t) transfers energy from t into t according to the same relative powers as̃, which entails the same relative variations of Q and accordingly the same prices in the expression of the regret variation. Then, by Equation (C2), there exists t ∈ S such that the power allocation u +̂( t,t) is feasible and the resulting regret variation fulfils R(u +̂( t,t) ) − R(u) < 0, which implies conclusion 1 of the Lemma.
Notice that for all t ∈ S in ∪ S out , we can define the 1 → 1 power swap̂( t,t) = −̃(t)⃗ e t +̃(t)⃗ e t such that in which the same prices are taken as in Equation (C3). However, this is not true for t ∈ S L as the prices at t ∈ S * L takes their lower values, whereas those t ∈ S * * L takes upper values. In other words, the resulting regret R(u −̂(t)⃗ e t + (t)⃗ e t ) might not be a convex combination of R(u) and R(u +̃). The reason is that the lower prices should always be taken for all t ∈ S L in a 1 → 1 power swap operation, which contradicts with the convenient 1 → m power swap. Thus, in the following, we consider the situation where it is not convenient to have 1 → 1 swaps from t to any t ∈ ) − R(u) < 0 holds which corresponds to conclusion 1 and 2, respectively, in the Lemma.
Therefore, if the 1 → m swap is convenient, at least one of the two categories of power swaps decreases the total regret.

APPENDIX D. PROOF OF LEMMA 4
The proof follows the similar lines to the proof of Lemma 3, but with the reverse power swapping direction.