Differences and similarities between reinforcement learning and the classical optimal control framework

In this contribution , we discuss Reinforcement Learning as an alternative way to solve optimal control problems. Especially for biomechanical models, well‐established classical techniques can become complex and time‐consuming, because biomechanical models have often much more actuators than degrees of freedom. Furthermore, the solution of such a technique is normally only applicable to this specific setting. This means, that a slightly change of the initial value of the model or the desired end position does make the computed solution useless. We give a short overview to Reinforcement Learning and apply it to an optimal control problem containing the above mentioned challenges. We use an algorithm, which updates the weights and biases of a neural network, which takes the role of a controller, using simulated trajectories of the model generated by the current neural network.


Optimal control techniques
We focus on the usability of Reinforcement Learning (RL) [1] in order to find the optimal control of biomechanical models performing desired tasks. Typically, such an optimal control problem consists of equations of motion representing the biomechanical model and forming the constraints of the optimization problem. This ensures, that the motion stays consistent to the model. Additionally, an objective function is given and one is typically interested in minimizing (resp. maximizing) this function with respect to the control. In order to solve such a problem, there are several well established ways.
RL benefits from another representation of this optimal control problem. It is assumed that instead of the discrete equations of motion only a transition probability is known. That means, only a distribution that describes the transition from the state x t ∈ S, where S is the state space, at time t to the state x t+1 ∈ S at time t + 1 is needed. Here, a Markov property for the model is assumed and thus, the transition distribution from time t to time t + 1 only depends on the state x t ∈ S and a given control (or action) u t ∈ A from the control space A and not on previous states and controls. This transition probability is denoted as P ut (x t , x t+1 ). This is the only restriction RL needs with respect to the dynamical model, which is a huge advantage in comparison to the classical framework, where we need to have the equations of motion. This allows, for instance, to apply RL to real world models, where the equations are not fully known or disturbances can not be modelled. The only missing part in the RL-framework is the objective function. It is necessary to define the desired task. For this, a reward function r : S × A → R is defined, which rates the current situation with respect to the task. Overall, the quadruple (S, A, P u , r) forms a Markov decision process [2].
The RL algorithms aim to maximize the expected value of the summed up rewards of one simulation trajectory. There are several ways to do this. First, one has to distinguish between policy-based, which directly trains a controller (also policy) and value function-based algorithms improving step by step a so called value function, which can be used to evaluate the next action. Our application trials convinces us to use a policy-based algorithm, where we parametrize the policy by a neural network. To be more precise, the policy π θ (u t |x t ), which gives us the probability for the next control u t for a known current state x t , consists of a neural network getting the x t as input and outputs the parameters of a Gaussian distribution from which we can sample the next control. The weights and biases of the neural network (represented by θ) can now improved iteratively in order to get a better controller. These improvements result from iterative methods for the following RL optimization problem: where τ denotes the set of state-control pairs of the trajectory. Conceivable algorithms in order to solve this optimization problem are, among other, gradient based methods [3] or trust region methods [4].

Human arm
At this point, we consider a biomechanical model, which was considered in [5]. The multibody model of a human arm consists of rigid bodies representing the bones and 29 Hill's muscle models in order to actuate the arm (see Fig 1b). The goal  of our motion is to reach a targets with the finger tip of the arm. We train one controller for all points on a one eighth circle (represented by ten points in Fig. 1b). The neural network gets the state of the model as input as well as the desired end position of this trajectory. In the classical optimal control framework, one would have solved each target for itself. This is, besides the advantages of RL to be able to handle overactuated systems (system with much more actuators than degrees of freedom), another useful property of RL. We applied a Proximal policy optimization technique [6] based on the implementation by P. Coady [7]. The reward rates every state-action pair, except the last state of a trajectory, with a −0.1. The reward for the last one is ten times minus the distance between the finger tip and the desired end position. If the finger tip does reach the end position a positive reward is outputted. The training takes a few days on a standard laptop computer (Intel ® Core TM i7). Fig.1b, Fig.1d and Fig.1c show snapshots of a trajectory reaching a point. The Fig.1a shows 45 trajectories of the finger tip for random targets on the eighth circle. The summed up rewards during the training in each iteration is plotted in Fig.1e.

Conclusion
We have seen that RL is suitable for solving optimal control problems with more actuators than degrees of freedom like the model of a human arm. A policy can be trained, which can also handle small changes in the desired end position. Thus, RL is able to handle the challenges given by the biomechanical model and its task. The generated movement looks human-like and plausible.