Improving reinforcement learning algorithms: towards optimal learning rate policies

This paper investigates to what extent one can improve reinforcement learning algorithms. Our study is split in three parts. First, our analysis shows that the classical asymptotic convergence rate $O(1/\sqrt{N})$ is pessimistic and can be replaced by $O((\log(N)/N)^{\beta})$ with $\frac{1}{2}\leq \beta \leq 1$ and $N$ the number of iterations. Second, we propose a dynamic optimal policy for the choice of the learning rate $(\gamma_k)_{k\geq 0}$ used in stochastic approximation (SA). We decompose our policy into two interacting levels: the inner and the outer level. In the inner level, we present the \nameref{Alg:v_4_s} algorithm (for"PAst Sign Search") which, based on a predefined sequence $(\gamma^o_k)_{k\geq 0}$, constructs a new sequence $(\gamma^i_k)_{k\geq 0}$ whose error decreases faster. In the outer level, we propose an optimal methodology for the selection of the predefined sequence $(\gamma^o_k)_{k\geq 0}$. Third, we show empirically that our selection methodology of the learning rate outperforms significantly standard algorithms used in reinforcement learning (RL) in the three following applications: the estimation of a drift, the optimal placement of limit orders and the optimal execution of large number of shares.


Introduction
We consider a discrete state space Z = N or Z = {1, . . ., d} with d ∈ N * .We are interested in finding q * ∈ Q ⊂ R Z solution of M (q, z) = E[m(q, X(z), z)] = 0, ∀z ∈ Z, where X(z) ∈ X is a random variable with an unknown distribution, and m is a function from Q × X × Z to R.Although the distribution of X(z) is unspecified, we assume that we can observe some variables (Z n ) n≥0 valued in Z and X n+1 (Z n ) n≥1 drawn from the same distribution of X(Z n ).Reinforcement learning (RL) addresses this problem through the following iterative procedure: where q 0 is a given initial condition, and each γ n is a component-wise non-negative vector valued in R Z .The connection between RL, problem (1), and Algorithm (2) is detailed in Section 2. It is possible to recover the classical SARSA, Q-learning, and double Q-learning algorithms used in RL by taking a specific expression for m and X n+1 .Note that Algorithm (2) is different from the standard Robbins-Monro (RM) algorithm used in stochastic approximation (SA) with m ∈ R Z whose z-th coordinate is defined such that m(q, x)(z) = m(q, x(z), z) for any z ∈ Z and γ n ≥ 0, mainly because, as it is frequent in RL, we do not observe the entire variable X n+1 (z) z∈Z ) but only its value according to the coordinate Z n .Indeed, the way (Z n ) n≥1 visits the set Z plays a key role in the convergence of Algorithm (2).RM algorithm was first introduced by Robbins and Monro in [26].After that, it was studied by many authors who prove the convergence of q n towards q * , see [3,5,6,17].The asymptotic convergence rate has also been investigated in many papers, see [3,16,27].They show that this speed is in general proportional to 1/

√
N with N the number of iterations.
In this work, we give a special focus to RL problems.Nowadays RL cover a very wide collection of recipes to solve control problems in an exploration-exploitation context.This literature started in the seventies, see [32,33], and became famous mainly with the seminal paper of Sutton, see [30].It largely relied on the recent advances in the control theory developed in the late 1950s, see [2].The key tool borrowed from this theory is the dynamic programming principle satisfied by the value function.This principle enables us to solve control problems numerically when the environment is known and the dimension is not too large.To tackle the curse of dimensionality, recent papers, see [29], use deep neural networks (DNN).For example, in [14], authors use DNN to derive optimal hedging strategies for finance derivatives and in [20] they use a similar method to solve a high dimensional optimal trading problem.To overcome the fact that environment is unspecified, it is common to use an RM type algorithm which estimates on-line quantities of interest.The combination of control theory and SA techniques gave birth to numerous papers on RL.
Our contributions are as follows.
• First, we conduct an error analysis to show that the classical asymptotic rate O(1/ √ N ) is pessimistic and can be enhanced in many situations.For this, we borrow tools from the statistical learning theory and show how to use them in a RL setting to get a O((log(N )/N ) β ) asymptotic speed with 1/2 ≤ β ≤ 1 and N the number of iterations.
• Second, we propose a dynamic policy for the choice of the step size (γ k ) k≥0 used in (2).
Our policy is decomposed into two interacting levels: the inner and the outer level.In the inner level, we introduce the PASS algorithm, for "PAst Sign Search".This algorithm builds a new sequence (γ i k ) k≥0 , using a predefined sequence (γ o k ) k≥0 and the sign variations of m(q n , X n+1 (Z n ), Z n ).The error of (γ i k ) k≥0 decreases faster than the one of (γ o k ) k≥0 .In the outer level, we present an optimal dynamic policy for the choice of the predefined sequence (γ o k ) k≥0 .These two levels are interacting in the sense that PASS influences the construction of (γ o k ) k≥0 .• Third, convergence of PASS algorithm is established and error bounds are provided.
• Finally, we show that our selection methodology provides better convergence results than standard RL algorithms in three numerical examples: the drift estimation, the optimal placement of limit orders, and the optimal execution of a large number of shares.When needed the proofs of convergence of our numerical methods are given.
The structure of this paper goes as follows: Section 2 describes the relation between RL and Equation (1).Section 3 reformulates (1) as an optimization problem and defines with accuracy the different sources of error.This enables us to derive the convergence speed of RL algorithms.Section 4 contains our adaptive learning rate algorithm.Finally, Section 5 provides numerical examples taken from the optimal trading literature: optimal placement of a limit order, and the optimization of the trading speed of a liquidation algorithm.Proofs and additional results are relegated to an appendix.

Reinforcement learning
We detail in this section the relation between (1) and RL since we are interested in solving RL problems.RL aims at estimating the Q-function which quantifies the value for the player to choose the action a when the system is at s. Let t be the current time, U t ∈ U be a process defined on a filtered probability space (Ω, F, F t , P) which represents the current state of the system, and A t ∈ A the agent action at time t.We assume that the process (U t , A t ) is Markov.The agent aims at maximizing with g the terminal constraint, f the instantaneous reward, ρ a discount factor, and T the final time.
Let us fix a time step ∆ > 0 and allow the agent to take actions only at times3 k∆ with k ∈ N. The Q-function is defined as follows: with (t, u, a) ∈ R + × U × A, A = {A t , t < T } a possible control process for the agent.Note that the action of the agent depend on s = (t, u) with t the current time and u the current state.We view the agent control A as a feedback process (i.e adapted to the filtration F t ).The Q-function satisfies the classical dynamic programming principle (DPP) with R t+∆ = t+∆ t ρ (s−t) f (s, U s , A s ) ds. Equation (5) shows that the optimal expected gain when the agent starts at s and chooses action a at time t is the sum of the next expected reward R t+∆ plus the value of acting optimally starting from the new position U t+∆ at time t + ∆.By reformulating (5), we obtain that Q solves the following equation: where • m is defined as follows: for any x = (u, r) ∈ X , and z 1 = (t 1 , u 1 , a 1 ) ∈ Z.
Note that Equation (6) shows that one can study Q only on the time grid4 D T = {n∆, n ≤ T /∆}.Thus, we define A k and U k such that A k = A k∆ and U k = U k∆ for any k ∈ N. The key variable to study is not the agent decision Thus, the rest of the paper formulates the results in terms of Z k only.
Actions of the agent.It is important in practice to visit the space D T × U × A sufficiently enough.Thus, to learn Q, it is common to not choose the maximising action 5 , but to encourage exploration by visiting the states where the error is large.We give in Appendix A examples of policies that promote exploration and others that maximize the Q-function.
Remark 1.In general, a solution q * of (1) does not necessarily solve an optimization problem in the form of (4).However, when M can be written as the gradient of some given function f (i.e.∇f = M ), q * becomes a solution of a problem in the form of (4).

Improvement of the asymptotic convergence rate
In [4, Part 2, Section 4], [16,Section 10] and [17,Section 7], the authors show a central limit theorem for the procedure (3) which ensures a convergence rate of O(1/ √ N ) where N is the number of iterations.In this section, we extend such convergence rate to Algorithm (2) and aim at understanding how one can improve it.For this, we decompose our total error into two standard components: estimation error and optimization error.

Error decomposition
In this section, the space Z = {1, . . ., d} is finite with d ∈ N * .In such case, we view q, and M (q) as vectors of R Z .Moreover, the process (Z n ) n≥1 is an homogeneous Markov chain.We consider the following assumption.Assumption 1 (Existence of a solution).There exists a solution q * of Equation (1).
Under Assumption 1, the function q * is a solution to the minimization problem where g can be selected as follows: • If M can be written as the gradient of some function f , see Remark 1, one can take g = f .
• Otherwise, it is always possible to set g(q) = M (q) .For simplicity, we place ourselves in this case for the rest of the section.
In our context we do not have a direct access to the distribution of X(z).Nevertheless, we assume that at time n we keep a memory of a training sample of n(z) independent variables (X z i ) i=1•••n(z) drawn from the distribution X(z) where n(z) is the number of times the Markov chain Z n visited z.We define q n as a solution of min with g n (q) = M n (q) , and the expected value under the empirical measure µ = n(z) j=1 δ Xj (z) /n(z) (i.e.empirical risk).We finally define q n k as an approximate solution of the problem (8) returned by an optimization algorithm after k iterations.Thus, we can bound the error g(q n k ) by optimization error , since q * minimizes g.

3.
2 Convergence rate of the estimation error

Slow convergence rate
We have the following result.Proposition 1.We assume that the Markov chain Z n is irreducible.There exists c 1 > 0 such that For sake of completeness, we give the proof of this result in Appendix C. Proposition 1 allows us to derive the following bound for the estimation error This bound is known to be pessimistic.

Fast convergence rate
We obtain the following fast statistical convergence rate.Proposition 2. Assume that the Markov chain Z n is irreducible, and with 1 2 ≤ β ≤ 1, c > 0, and n(z) = n(z) ∧ 1.Then, there exists c 2 > 0 such that The proof of this proposition is given in Appendix D. Since the conclusion of Proposition D relies on the condition (10), we give below two settings under which this condition is fulfilled.
Fast convergence rate for classification problems.It is possible to establish (10) for classification problems when • The loss function g satisfies regularity conditions, of which the most important are : Lipschitz continuity and convexity, see [1,Section 4].
• The data distribution satisfies some noise conditions, see for instance [31].
It is also possible to get rid of the log(n) factor in (10), see the end of Section 5 in [8].
Fast convergence rate for reinforcement learning.Since numerical examples focus on RL applications.We propose to derive Inequality (10) for RL problems.To do so we adopt the same notations of Section 2.
Let t be the current time.We denote by A = {A t , t < T } the control process of the agent.Since we work under a Markov setting, A t depends only on t, and the current state of the system U t .Moreover, we assume that the agent can take at most a finite set of actions which means N A = |A| < ∞.For each A, we define q A as follows: for any (t, u, a) ∈ R + × U × A. By definition of q A , we have where R t+∆ , and ρ ∆ are defined in (5).Moreover, DPP ensures that the Q-function achieves equality in (11).Thus, we can define the following loss: for any control process A, with m, and X introduced in (6).Note that this setting is very similar to the classification problem one • The variable X plays the role of the input variable.
• The control process A can be assimilated to the function to learn.Since there are finite numbers of time steps and actions to perform, the strategy A should predict a finite set of options that can be interpreted as labels.
Thus, we can try to apply the same techniques and recover similar bounds.Such results are given in the proposition below.
Proposition 3. Let B > 0. We define the class of controls A B such that • Then, there exists a constant C > 0 such that with probability at least 1 − δ, with Var(A, z) the variance of the random variable m(q A , X(z), z).
• If in addition there exists c > 0, and β ∈ [0, 1] such that then with probability at least 1 − δ, The main steps of Proposition's 3 proof are given in [8,Theorem 8].It is then standard to derive (10) from Proposition 3.

Convergence rate of the optimization error
We turn now to the optimization error.This means that the expected value in (7) is replaced by the empirical risk, which is known.In such case, one can use many algorithms to find q n .We present in the table below the most important properties of some gradient methods.
Table 1: Asymptotic properties of some gradient methods.Note that d is the dimension of the state space Z and is a desired level of accuracy.Here corresponds to 1/n.GD stands for Gradient Descent, SDG for Stochastic Gradient Descent, Proximal for Stochastic proximal gradient descent [10,28], Acc.prox.for accelerated proximal stochastic gradient descent [24,28], SAGA for Stochastic accelerated gradient approximation [12], and SVRG for stochastic variance reduced gradient [15].

Conclusion
Following the formalism of [7], we have decomposed our initial error into • Estimation error: its convergence is O(1/ √ n) in pessimistic cases with n the number of iterations.In the other situations, the convergence is faster (i.e.O (log(n)/n) β ) with 1/2 ≤ β ≤ 1.
• Optimization error: the convergence is exponential under suitable conditions.In unfavourable cases, the convergence rate is O(1/n).
The comparison of these error sources shows that the estimation error is the dominant component.Thus, one can overcome the O(1/ √ n) asymptotic speed, in some situations, by improving the estimation error.

Optimal policy for the learning rate γ
In this section, we take Z = N and consider the following type of algorithms: One can recover the classical SARSA, Q-learning, and double Q-learning algorithms used in RL by considering a specific expression for m and X n+1 .In such algorithms the choice of γ n is crucial.One can find in the literature general conditions on γ n needed for convergence of (2) such as However, since the set of processes (γ n ) n≥0 satisfying these conditions can be large, and even empty (when (Z n ) n≥0 is not recurrent), many authors suggest to take γ n proportional to 1/n α .The exponent α may vary from 0 to 1 depending on the algorithm used, see [13,22].Nonetheless, such a choice may be suboptimal.For example, Figure 1.a shows that the blue curve is a way higher than the orange one.Here, the blue (resp.orange) curve shows how the logarithm of the error varies with n when γ n = η/n (resp.γ n is constant).The constant η selected here ensures the fastest convergence for the blue curve.
In this paper, we propose to use a stochastic learning rate (γ k ) k≥0 ; our learning policy is decomposed into two interacting levels: the inner and the outer level.In the inner level, we use the PASS algorithm, for "PAst Sign Search".This algorithm builds a new sequence (γ i k ) k≥0 , based on a predefined sequence (γ o k ) k≥0 and the sign variations of m(q n , X n+1 (Z n ), Z n ), whose error decreases faster than the predefined one.In the outer level, we propose an optimal methodology for the selection of the predefined sequence (γ o k ) k≥0 .These two levels are interacting in the sense that the PASS algorithm influences the construction of (γ o k ) k≥0 .

The algorithms
In this part, we introduce three algorithms.We start with our benchmark which is the standard algorithm used in RL.Then, we present a second algorithm inspired from SAGA [12], which is a method used to accelerate the convergence of the stochastic gradient descent.Under suitable conditions, SAGA has an exponential convergence.Finally, we describe the PASS algorithm that modifies the learning rate (γ k ) k∈N based on the sign variations of m(q n , X n+1 (Z n ), Z n ).The main idea is to increase γ n as long as the sign of m(q n , X n+1 (Z n ), Z n ) remains unchanged.Then, we reinitialize or lower γ n using a predefined sequence (γ o k ) k∈N when the sign of m(q n , X n+1 (Z n ), Z n ) switches.This algorithm can be seen as an adaptation of the line search strategy, which determines the maximum distance to move along a given search direction.Actually, the line search method requires a complete knowledge of the cost function because it demands to evaluate several times g q k + γM (q k ) − g q k for different values of γ, with g being the loss and M representing a proxy of ∇g.However, our approach has neither access to g nor M .It can only compute m(q n , X n+1 (Z n ), Z n ) when the state z = Z k is visited.Moreover, to get a new observation it needs to wait 6 for the next visit of the state z = Z k .Nevertheless, it has instantaneous access to previously observed values.Thus, the main idea here is to use these past observations.Some theoretical properties of these algorithms are investigated in Section 4.3.Algorithm 1 (RL).We start with an arbitrary q 0 ∈ Q and define by induction q k7 as follows: Algorithm 2 (SAGA).We start with an arbitrary q 0 ∈ Q, M 0 = 08 and define by induction q k and M k 7 as follows: with i picked from the distribution p = ( For the next algorithm, we give ourselves a predefined learning rate called (γ k ) k≥0 , a function h : R + × R + → R + to increase the current learning rate, and another one l : R + × R + → R + to lower it.The function h is used to accelerate the descent, while the function l goes back to a slower pace.Algorithm 3 (PASS).We start with an arbitrary q 0 and define by induction q k and γk 7 as follows: 1 is the index of the last observation when the process Z visits the state Z n .• Else, do

Assumptions
In this section, we present the assumptions needed to study the convergence of Algorithms RL, SAGA and PASS.We assume that Assumption 1 is in force.Hence, there exists q * , a solution of (1).We write m * for the vector m * (x, z) = m(q * , x, z), ∀(x, z) ∈ X × Z. Recall that E[m * (X(z), z)] = 0, ∀z ∈ Z.Let us consider the following assumptions: Assumption 2 (Pseudo strong convexity 2).There exists a constant L > 0 such that Note that Assumption 2 is natural in the deterministic framework.For instance, if we take a strongly convex function f and call m its gradient (i.e m = ∇f ).Then, m satisfies Assumption 2.
Additionally, the pseudo-gradient property (PG) considered in [5, Section 4.2] is close to Assumption 2. However, Assumption 2 is slightly more general than PG since it involves only the component's norm (q k −q * )(Z k ) instead of the vector's (q k −q * ).To get tighter approximations, we also introduce the quantity L k as follows: Note that L k ≥ 0 under Assumption 2. It is also the biggest constant that satisfies Assumption 2 for a fixed k.In particular, this means that L k ≥ L. Assumption 3 (Lipschitz continuity of m).There exists a positive constant B > 0 such that for any random variables X and X valued in X we have Assumption 3 guarantees that m is Lipschitz.Authors in [5, Section 4.2] use a similar condition.To get better bounds, we introduce B k such that We have B k ≤ B since B k is the smallest constant satisfying Assumption 3 for a fixed k.We finally add an assumption on the learning (γ k ) k≥0 .Assumption 4 (Learning rate explosion).For any z ∈ Z, we have When the process Z is Markov and γ k (z) bounded, Assumption 4 ensures that Z is recurrent.To see this, we first assume that γ k (z) is uniformly bounded without loss of generality.In such a case, there exists A such that γ k (z) ≤ A, for all k ≥ 1.Thus, we Since the left hand side of the previous inequality diverges under Assumption 4, we have which proves that Z is recurrent.

Main results
In this section, we compare Algorithms RL, SAGA, and PASS and prove the convergence of PASS.Let c be a positive constant and k ∈ N. We define the error function as follows: for Algorithms RL, and PASS, for all z ∈ Z and j ∈ {1, . . ., M }.We write E k for the total error E k = e k ν = z∈Z e k (z)ν z with (ν z ) z∈Z a non-negative sequence. 9We also use the following notations: Proposition 4. Let z ∈ Z.Under Assumptions 1, 2, 3, and when there exists r 1 ≥ 1 such that 9 The sequence (νz)z∈Z is used to ensure that the error is bounded when needed.
with A = {Z k = z}.The constants α k and M k vary from one algorithm to another as follows: and ) The proof of Proposition 4 is given in Appendix E. Equation ( 14) reveals that the performance of Algorithms RL, SAGA, and PASS depends on the interaction between two competing terms: • On the one hand the slope α k controls the decrease of the error from one step to the next.
• On the other hand the quantity M k gathers two sources of imprecision: the estimation and optimization errors.Both sources of imprecision have a variance term v n (because the distribution of Z is unknown), and a positive constant (coming from the noisy nature of observations).
There is a competition between these two terms: to decrease M k we need to send γ k towards zero while the reduction of α k requires a relatively small but still non-zero γ k .Thus, γ k should satisfy a trade-off in order to ensure the convergence of the algorithms.The RM conditions ( 13) are a way to address this trade-off.Now, in order to analyse the properties of each algorithm, we compare for a fixed γ k its respective values of α k , and M k in Table 2.For sake of clarity, we choose to present the variable (1 − α k ) instead of α k in this table; note that a large value of 1 − α k means that α k is small and thus induces a fast convergence.
Table 2: Comparison of the algorithms RL, SAGA and PASS.
• If in addition (Z n ) n≥1 is an homogeneous Markov chain, and • Moreover, under the same condition, there exists a constant B ≥ 0 such that with ¯ = b n , and n = e 1 (z 1 )a n (z Equations (17), and ( 18) ensure the convergence of the error E n towards zero in both probability and L 2 .Equation (19) gives an upper bound for the error.In particular, it shows how the terms ¯ n , μn k , and ā * ∞ n interact together to decrease of the error E 0 [e n (z 1 )].We recall that ¯ n is a noise term that gathers the sources of imprecision, μ represents the influence of the slope factor α, ā * ∞ n is related to the probability distribution of the process (Z k ) k≥0 .Note that the right-hand side of ( 19) is not trivial and it converges towards 0 when there exists

The upper level
In practice, to apply PASS we need an appropriate predefined sequence (γ k ) k∈N .It is possible to take γ k proportional to 1/k α with α ∈ (0, 1] as proposed in [22,26].However, in this section, we present an optimal dynamic policy for the choice of the learning rate (γ k ) k∈N .To do so, we assume that with A = {Z n = z}, S n does not depend on the learning rate and verifies E[1 A S n ] ≤ 0. Equation ( 20) is consistent with Proposition 4.Moreover, we force e n to stay below the upper bound x 2 = arg sup x∈R+ g(x) with , ∀x ∈ R.
Such constraint is not that restrictive since we know that the error e n converges towards 0. Since α n , and M n are both functions of the learning rate, the idea is to choose the learning rate γ n such that which gives , for Algorithm PASS, The constants L, L n , B, B n , d 1 and c n are defined in Proposition 4. Note that, the value L/B is known to be a good choice for the learning rate.Thus, the proposed γ n introduces a variation around this value that takes into account both the variance v n , and an estimate of the past observed error e n .The algorithm PASS adds a supplementary optimization layer since the global constants L, and B are replaced by the more local ones L n , and B n .
We write e n γ to point out the dependence of the error e n on the chosen control γ.
The proof of the above result is given in Appendix G. Proposition 5 shows that γ ensures that fastest convergence speed of the error and This guarantees its optimality.
We end this section with some practical considerations.Note that e n is not known in practice because we do not have access to q * .However, one can take the average value of m(q n , X n+1 (Z n ), Z n ) over the last p ∈ N * visit times as a proxy of e n (Z n ).Moreover, the constants L and B are also unknown in practice.To tackle this issue, a first solution consists of starting with arbitrary values for B and L and generating a sequence of learning rates.If the error m(q n , X n+1 (Z n ), Z n ) increases, we take a larger value for B and a smaller one for L otherwise B and L values are kept unchanged.Finally, an alternative solution for the choice of the upper level learning rate consists of considering a piece-wise constant (PC) policy.To do so, one can track the average error of m(q n , X n+1 (Z n ), Z n ) over the lasts p visit times.If this average error does not decrease, the step size is divided by a factor α.
When γ n (z) = 0 if z = Z n , we recover the standard PASS algorithm.Thus, the vectorial version is slightly more general and uses the scalar product between vectors instead of the product between two coordinates.
5 Some examples

Methodology
The code and numerical results presented in this section can be found in https://github.com/othmaneM/RL_adap_stepsize. Here, we compare four algorithms.The two first ones are two different versions of RL.In the first version, the learning rate γ k (z) is taken such that γ k = η/n k (z) with η > 0 selected to provide the best convergence results and n k (z) the number of visits to the state z.In the second version, the step size follows the piece-wise constant policy (PC) described at the end of Section E.2.2.The third algorithm is SAGA where the step size is derived from PC policy.Finally, we use the PASS algorithm presented in the previous sections with a predefined learning rate following the PC policy.We consider three numerical examples to compare the convergence speed of these algorithms: drift estimation, optimal placement of limit orders and the optimal liquidation of shares.

Drift estimation
Formulation of the problem.We observe a process (S n ) n≥0 which satisfies with W n a centred noise with finite variance.We want to estimate the quantities f i with i ∈ {1, • • • , n max }.Using ( 22) and E[W t ] = 0, we get Thus, we can estimate f i using stochastic iterative algorithms.The pseudo-code of our implementation of PASS for this problem can be found in the Appendix B under the name Implementation 1.
Numerical results.Figure 2 shows the variation of the L 2 -error when the number of iterations increases.We can see that the algorithm PASS outperforms standard stochastic approximation algorithms.Moreover, other algorithms behave as expected: the standard RL decreases very slowly (but we know it will drive the asymptotic error to zero), the constant learning rate and SAGA provides better results than RL, while PASS seems to have captured the best of the two worlds for this application: very fast acceleration at the beginning and the asymptotic error goes to zero.
L 2 -error against the number of iterations

Optimal placement of a limit order
Formalisation of the problem.We consider an agent who aims at buying a unit quantity using limit orders, and market orders during the time interval [0, T ] (see [19] for detailed explanations).In such case, the agent wonder how to find the right balance between fast execution and avoiding trading costs associated to the bid-ask spread.The agent state at time t is modelled by X t = (Q Bef ore , Q Af ter , P ) with Q Bef ore the number of shares placed before the agent's order, Q Af ter the queue size after the agent's order, and P t the mid price, see Figure 3.The agents wants to minimise the quantity where • T is the final time horizon.
• T exec = inf{t ≥ 0, P t = 0} is the first time when the limit order gets a transaction.
• τ is the first time when a market order is sent.
• X = (Q Bef ore , Q Af ter , P ) is the state of the order book.
• F (u) is the price of the transaction (i.e.F (u) = p + ψ when the agents crosses the spread and F (u) = p otherwise).
We show in Section 2 that the Q-function is solution of (6), see details in https://github.com/othmaneM/RL_adap_stepsize. Thus, we can use Algorithms RL, SAGA and PASS to estimate it.The pseudo-code of our implementation of PASS is available as Implementation 2 in Appendix B.

P rice
Figure 3: The state space of our limit order control problem.
Numerical results.Figure 4 shows three control maps: the x-axis reads the quantity on "same side" (i.e.Q same = Q Bef ore + Q Af ter ) and the y-axis reads the position of the limit order in the queue, i.e.Q Bef ore .The color and numbers gives the control associated to a pair (Q same , Q Bef ore ): 1 (blue) means "stay in the book", while 0 (red) means "cross the spread" to obtain a transaction.The panel (at the left) gives the reference optimal controls obtained with a finite difference scheme, the middle panel the optimal corresponds to the controls obtained for a RL algorithm where the step-size (γ k ) k≥0 is derived from the upper level policy, and the right panel the optimal control obtained with our optimal policy (i.e.upper level and inner level combined).It shows that after few iterations our optimal policy already found the optimal controls.Figure 5 compares the log of the L 2 error, averaged over 100 trajectories, between the different algorithms.We see clearly that our methodology improves basic stochastic approximation algorithms.Again, the other algorithms behave as expected: SAGA is better than a constant learning rate that is better than the standard RL (at the beginning, since we know that asymptotically RL will drive the error to zeros whereas a constant learning rate does not).
a) Theoretical optimal control b) step_cste optimal control c) PASS optimal control Comparison optimal control after 300 iteration for different methods: left is the optimal control, middle is RL with a step size derived from the upper level and right is our optimal policy for the step size (i.e.upper level and inner level combined).

Optimal execution
Formalisation of the problem.This is not the first work where RL is used to solve optimal trading problems.For example, authors in [23] apply RL techniques to solve optimal execution issues and in [20] they use deep reinforcement learning to solve a high dimensional market making problem.However, we consider here a different application.An investor wants to buy a given quantity q 0 of a tradable instrument (see [9] and [18] for details about this setting).The price S t of this instrument satisfies the following dynamic: where α ∈ R is the drift and σ is the price volatility.The state of the investor is described by two variables its inventory Q t and its wealth X t at time t.The evolution of these two variables reads with ν t the trading speed of the agent and κ > 0. The term κν t corresponds to the temporary price impact.The investor wants to maximize the following quantity it represents its final wealth X T at time T , plus the value of liquidating its inventory minus a running quadratic cost.The value function V is defined such that We remark that v(t, w, q, s) = V (t, w, q, s) − w − qs verifies Using (23), and (24), we can see that the variable M t T is independent of the initial values W t , and S t .This means that v is a function of only two variables the time t and the inventory q.The dynamic programming principle ensures that v satisfies We fix a maximum inventory q.Let k = (k T , k q ) ∈ (N * ) 2 , ∆ = T /k T , D T = {t k T i ; i ≤ k T }, and D q = {q kq i ; i ≤ k q } with t k T i = i∆ and q kq i = −q + 2iq/k q .To estimate v we use the numerical scheme (v k n ) n≥1, k∈(N * ) 2 defined below: with Z n = (n∆, Q n∆ ) and A(Z n ) ∈ D q is the set of admissible actions 10 .When the final time T is reached (i.e.n = k T ), we pick a new initial inventory from the set D q and start again its liquidation.At a first sight, it is not clear that v k n approximates v.However, we have the following result.
The proof of Proposition 6 is given in Appendix H. see Appendix B for a detailed implementation of the algorithm with the corresponding pseudo-code (as Implementation 3).
Numerical results.Figure 6 shows the value function v for different values of the elapsed time t and the remaining inventory Q t .The panel (at the left) gives the reference value function.It is computed by following the same approach of [25].The middle panel the value function obtained obtained after 120 000 iterations for RL algorithm where the step-size (γ k ) k≥0 is derived from the upper level of our optimal policy, and the right panel the value function obtained with our optimal policy (i.e upper level and inner level combined).It shows that our optimal strategy leads to better performance results.We also plot, in Figure 7, a simulated path for the variations of the log L 2 error for different algorithms.Here again, we notice that our methodology improves the basic RL algorithm and that the ordering of other approaches is similar to the one of the "drift estimation" approximation (i.e.SAGA and the constant learning rate are very similar).

A Actions of the agent
We present here policies that encourage exploration and others that maximize the agent's decisions.
Exploration policies: Since it is important to visit the state space sufficiently enough, we propose to set the conditional distribution of the random variable A k such that with where b > 0 encourages the exploration, q k satisfies (2) and r 1 (z) is the last observation time of the state z.
Optimal policies.To give more importance to the maximizing action, one may consider the following policy: Any mixture of these two procedures can an also be considered.

B Implementations
We give here the pseudo code used for each one of the three numerical examples considered in Section 5.
We use Implementation 1 for the numerical experiments.
(28) We use Algorithm 2 for the numerical experiments.Note that we do not need to send a market order to know our expected future gain.
Optimal execution of a large number of shares.To solve this problem we use the same functions h and l considered in the previous problem, see (28).Then, we apply Algorithm 3. In this problem, it is crucial to select actions according to the policy (26) in order to encourage exploration.The coefficient β used by the agent to select its actions is taken constant equal to β = 5.We consider the same policy for all the tested algorithms.

C Proof of Proposition 1
Proof of Proposition 1.Let z ∈ Z.Standard uniform convergence results ensure that Observe ∆X next = S t+1 − S t

17:
if the average value of E over the last w = 5 episodes is not reduced by p = 1% then for each step within episode do 5: Take the action stay in the order book 6: Observe the new order book state X next 7: for a ∈ {0, 1} do 8: if the first visit time to X next then 9: q 0 (X 0 , a) ← q 0 (X 0 , a) − γ0 (X 0 , a)m a (q 0 , X next , X 0 ) q 0 (X 0 , a) ← q 0 (X 0 , a) − γ0 (X 0 , a)m a (q 0 , X next , X 0 ) q 0 (X 0 , a) ← q 0 (X 0 , a) − γ0 (X 0 , a)m a (q 0 , X next , X 0 ) 16: end if 17: E past (X 0 , a) ← m a (q 0 , X next , X 0 ) 18: end for 19: end for

21:
Save the norm E of the vector E past (t).

22:
if the average value of E over the last w = 40 episodes is not reduced by p = 5% then γ o (t) ← max γ o (t)/2, 0.01 (this is done each w episodes) 24: end if 25: end for Implementation 3 PAst Sign Search (PASS) for (RL) optimal execution problem 1: Algorithm parameters: step size (γ o ) n≥0 ∈ (0, 1], number of episodes n initial guess q 0 , past error value E past Initialise γ0 = γ o 0 2: for episode in n do 3: Select the initial inventory Q 0 4: Observe the new price state S next and set X 0 = (t, Q 0 ) Observe the new price state S next 7: if the first visit time to X 0 then 8: end if 16: Select an action A and observe Q next 18: end for 20: Save the norm E of the vector E past (t).

21:
if the average value of E over the last w = 300 episodes is not reduced by p = 1% then γ o (t) ← max γ o (t) − 0.01, 0.01 (this is done each w episodes) 23: end if 24: end for with c > 0 a positive constant.Since the Markov chain (Z n ) n≥1 is irreducible and the set Z is finite, the sequence (Z n ) n≥1 is positive recurrent and we have with µ the unique invariant distribution of (Z n ) n≥1 .Thus, we have This shows that u n (z) is bounded by a constant called u ∞ (z) and ensures that with c 1 (z) = cu ∞ (z).Since Z is finite, we close the proof by summing Inequality (29) over all the coordinates z.

D Proof of Proposition 2
Proof of Proposition 2. Let z ∈ Z.We follow the same approach used in the proof of Proposition 1 to get . Using (10), and the same manipulations used in the proof of Proposition 1 and Inequality (9), we complete the proof.

E Proof of Proposition 4
Proof of Proposition 4. Let k ≥ 0, A be the set A = {Z k = z}, and m(q k ) ∈ R Z such that m(q k )(z ) = m(q k , X k+1 (z ), z ) for any z ∈ Z.We split the proof into three cases.In each one of these steps, we prove ( 14) for a given algorithm.

Case (i):
In this step, we prove (14) for Algorithm RL.Let us fix z ∈ Z.For simplicity, we forget about the dependence of m, and q on z and write respectively m(q k ), and q k instead of m(q k )(z), and q k (z).We have We use now two independent copies of X k respectively denoted by X 1 , and X 2 . 11We also write m(q k ) X = m(q k , X(z ), z ) to emphasize the dependence of m(q k ) on X.Using Jensen's inequality, and Assumption 3, we get Thus, we deduce that , which shows (14) for Algorithm RL.
Case (ii): Here we show (14) for Algorithm SAGA.Let z ∈ Z, and We forget here about the dependence of m, q, and M on z and write respectively m(q k ), q k , and 11 Note that the dependence of X 1 , and X 2 on k is omitted since there is no possible confusion.

Case (iii):
In this final step, we show (14) for Algorithm PASS.Here again, we forget about the dependence of the variables on z as in the previous steps.We have For the term (i), using Assumption 2 and E k [m * ] = 0, we have . Using Assumption 3, and E k [m * ] = 0, we get Thus, we deduce that We write γ k for the quantity , we use the following canonical decomposition of the function p k : and γ k = γk to get with d 1 = (r 1 − 1) 2 B k .Thus, using (34), we conclude This completes the proof.

F Proof of Theorem 1
For simplicity, the proof is split into two parts.

F.1 Proof of Equation (17)
Proof of Equation (17).Using Proposition 4, we get with R n = 1 A µ n e n , and L n = 1 A M n .Using the assumption n≥1 γ 2 n < ∞, and the expression of M n , we obtain that n≥1 L n < ∞.We can then apply the supermartingale convergence theorem, to deduce that e n converges towards a random variable with probability 1, and n≥1 R n < ∞.Since n≥1 γ 2 n < ∞, we know that γ n converges towards 0. Thus, for n large enough we have We introduce the following notations.Let j ∈ N * and (µ n , a n , b n ) n≥1 be a sequence valued in R 3 + .We write (µ, b) j = (µ j n , b j n ) n≥1 for the delayed sequence µ j n = µ j+n and b j n = b n+(j−1) with n ≥ 1.Additionally, we define recursively the sequence (a µ,b n ) n≥1 as follows: Lemma 1.By convention, an empty sum is equal to zero.Let (v n ) n≥1 be the sequence defined as follows: where ( n ) n≥1 is a sequence.Then, we have Proof of Lemma 1.Let us prove the result by induction on n ≥ 1.By definition, Equation (36) is satisfied for n = 1.By applying the induction hypothesis (36) to all j ≤ n, we get Lemma 2. Let n ∈ N, (a µ,b n ) n≥0 be the sequence defined in (35), (µ n ) n≥0 be a positive nondecreasing sequence, and r n = 1 − µ n .
• When n≥0 r n = +∞, we have • There exists a non-negative constant B such that Proof of Lemma 2. • Step 1.We first prove by induction the following subsidiary inequality: where (ã n k ) k≥0 is a non-negative sequence that does not depend on (µ n ) n≥0 .We also prove that Inequality (39) is optimal since it becomes an equality when the sequence (µ n ) n≥0 is constant.Inequality (39) holds clearly for n = 2 with ã2 = a.Using (35) and the induction assumption, we get where the inequality μl k ≤ μn k for any l ≤ n comes from the monotonicity assumption on (µ n ) n≥0 .Note that μl k = μn k when the sequence (µ n ) n≥0 is constant.In such case, we can replace all the previous inequalities by equalities.The inequality (40) gives with and ãn+1 k = 0 otherwise.Combining (40), and (41) proves (39).
• Step 2. Here, we show that m≥1 ãn m ≤ 1 for all n ≥ 2. To do so, let us consider the worst case where µ n = 1 for all n ≥ 0. In such case, Inequality (39) is optimal which gives a n = n−1 k=1 ãn n−k .It is easy to show using a direct induction, and (35) that a n ≤ 1.Thus, we deduce that m≥1 ãn m ≤ 1.

• Step 4. Let us prove by induction
When a 1 b ∞ = 1, we obtain a j b k = 0 for any k = k or j = 1.Thus, a direct induction applied to (42) shows that ãn k = 0 for any k ≥ 2. By sending n to +∞, we find ã∞ k = 0 for any k ≥ 2 which guarantees (43).We can then assume that a 1 b ∞ < 1.Under this assumption, Equation (44) reads with a l = a l /(1 − a 1 b ∞ ).The same lines of arguments show that (a * ∞ k ) k≥1 also satisfies (45).We can then apply the induction assumption to complete the proof (43).

F.2.2 Propagation of the error
We introduce the following notations.Let n ∈ N * , and z 1 ∈ Z.We write τ z1 = inf{l > 0, Z l = z 1 }, and We have the following result.
Proof of Proposition 7. Using the last-exit decomposition, see Section 8.2.1 in [21], we have with n , a j , and b j defined in Proposition 7. The variables α n , and M n are defined in Proposition 4.
In the second equality, we use that e n (z 1 ) does not change as long as the state z 1 is not reached.This completes the proof.

F.2.3 Proof of Inequalities (18) and (19)
Proof of Theorem 1.We split the proof in two steps.In Step (i), we show (18) and then in Step (ii) we show (19).
Step (i): In this part, we first prove (18) when the space Z is finite.Then, we show how to extend (17) to the general case.Sub-step(i-2): In this second step, we show (18) when the space Z is countable.Let > 0. Since E[E 1 ] < ∞, there exists k 0 ∈ N such that k≥k0 E[e 1 (z k )]ν k < 2 .
We write A k0 for the set A k0 = {z k , k ≤ k 0 }.Since A k0 is finite, we use Sub-step(i-1) to show the existence of k 1 ∈ N such that k≤k0 E[e k1 (z k )]ν k < 2 , ∀k ≥ k 1 , ∀z k ∈ A k0 .
We take now k ≥ k 1 .Using (E[e l (z)]) l≥1 is non-increasing for any z ∈ Z, we get Step (ii): In this step, we show (19).By applying Lemma 2, we obtain the existence of a constant B such that For n = 0, Inequality (50) is directly satisfied since the initial error e 0 does not depend on the choice of the learning rate.Let γ ∈ Γ.We assume now that e n γ ≤ e n γ , a.s.Using Equation (20) and the definition of (γ n ) n≥0 , we have with g(x) = x − L 2 x 2 2B(x + (2 + v n )) . The previous expression of g is obtained by minimising the A study of the function g shows that it is non-decreasing on the interval [0, x 2 [ with x 2 = arg sup x∈R+ g(x) and g(x) ≤ x for any x ∈ R + .Thus, we have necessarily 1 A e n γ ≤ 1 A g(e n−1 γ ) ≤ g(x 2 ) ≤ x 2 for any n ≥ 1, and γ ∈ Γ.Note that when x 2 = +∞, we do not need to assume that 1 A S n ≤ 0. Using that e n γ ≤ e n γ a.s, the monotonicity of g on [0, x 2 [, and Equation (52), we get We complete the proof by combining (51), (53), and the induction assumption.

H Proof of Proposition 6
Proof of Proposition 6.We prove this result in three steps.First, we show that v can be approximated by a numerical scheme vk .Then, we replace vk by another scheme v k that also converges towards v.
Finally, we show that v k n tends to v k when n → ∞.
Step (i): We start with our initial control problem where the agents may choose its trading speed at any time.It was studied by many authors, see for example [25], who show that the optimal trading speed verifies with h 1 : [0, T ] → R, and h 2 : [0, T ] → R + a positive function.This means that the optimal inventory follows: with Q(0) = q 0 given.Thus, Q verifies Using Equation (55), and basic inequalities, one can show that if we take a large enough q ∈ R + and place ourselves in S q = [−q, q] then Q(t) ∈ S q , when Q(0) ∈ S q .Hence, we rewrite the dynamic programming principle as follows: where Ā(t, q) ⊂ S q is the set of admissible actions 12 .We can focus on the set Ā(t, q) instead of R since other controls are not optimal.Now, we approximate this problem in a classical way using the numerical scheme vk defined such that vk (n t , n q ) = sup ν∈Da E M nt nt+1 − φQ 2 nt∆t ∆ + vk (n t + 1, n ν q+1 )|Q nt∆t = n q ∆ q , ∀(n t , n q ) ∈ D T × D q , with M nt nt+1 = M nt∆ ∆t(nt+1) , n ν q+1 the index such that Q ν (nt+1)∆t = n ν q+1 ∆ q and Ā = A ∩ {i∆ q , i ∈ Z}.The convergence of (v k ) k≥∈(N * ) 2 towards v on the set D T × D q when k → ∞ is standard.
Let us show that vk and v k have the same limit.For this, we use a backward recurrence.For the moment, we assume that sup ν E[M nt nt+1 ] − E[sup ν M nt nt+1 ] ≤ K∆ 2 t and we will prove it

Figure 1 :
Figure 1: L 2 -error for the estimation of the drift when γ k is constant in orange and when γ k ∝ 1 k in blue.

Figure 2 :
Figure 2: The L 2 -error between f k and f for different numerical methods averaged over 1000 simulated paths.

Figure 4 :
Figure4: Comparison optimal control after 300 iteration for different methods: left is the optimal control, middle is RL with a step size derived from the upper level and right is our optimal policy for the step size (i.e.upper level and inner level combined).

L 2 -Figure 5 :
Figure 5: The log L 2 -error against the number of iterations averaged over 1000 simulated paths.

18 : 2 : for episode in n do 3 :
γ o (t) ← max γ o (t)/2, 0.01 (this is done each w episodes) 19: end if 20: end for Implementation 2 PAst Sign Search (PASS) for (RL) optimal placement problem 1: Algorithm parameters: step size (γ o ) n≥0 ∈ (0, 1], number of episodes n initial guess q 0 , past error value E past Initialise γ0 = γ o 0 Select initial state X 0 4: with μn k = e − n j=n−k+1 rj and a * ∞ k = lim n→∞ a * n k .The sequence (a * n k ) k≥1 is defined recursively such that a * 1 k = a k and a * (n+1) k = k l=1 a k+1−l b n−k+l a * n l for all k ≥ 1.

k
for any > 0 when n becomes large enough.By summing the previous inequality over all the possible values of k we get lim Thus, there exists B ≥ 0 such that a µ,b n ≤ B n k=1 μn k ã * ∞ n−k for all n ≥ 2 which proves (38).

5 Proof of of Propostion 5 .
n−k , with μn k , and ā * ∞ defined in Equation (38).We can then use Equation (47), to getv n ≤ B l−j ¯ j .Since α n converges towards 1 almost surely, the variable b n is bounded from below and thus there exists a constant B such that E[e n (z 1 )] ≤ B For simplicity, we prove the result for Algorithm 1 however the same argument holds for Algorithm 3. Let us prove by induction on n ∈ N that e n γ ≤ e n γ , a.s, ∀γ ∈ Γ.