The effect of model uncertainties in the reinforcement learning based regulation problem: An experimental case study with inverted pendulum

This work aims to improve the classical white‐box model of an inverted pendulum in order to reach a more accurate representation of an actual pendulum on a cart system. The purpose of the model is to train different controllers based on machine learning algorithms. In the context of this paper, the inverted pendulum system is driven by a belt drive that is controlled by a stepper motor. Due to the nature of the controller, the input to the stepper motor is in the form of a non‐smooth bang‐bang‐like signal that moves the cart to the left, right, or terminates its movement. One of the main challenges, in this case, is to find a proper function to model the stepper motor as its dynamics cannot be captured with a constant gain. It has been shown that the transient behavior of the stepper motor when changing direction or stopping is not negligible in the closed‐loop control performance. Accordingly, a grey‐box scheme, which accounts for the uncertainties that are not included in the vanilla white‐box model, is utilized to achieve a lower model mismatch compared to the actual pendulum. Initially, the equation of motion was derived using the Euler‐Lagrange equation with force on the cart as the control input. But in the real‐time experiment, the interface is realized by the stepper motor's frequency modulator, hence a transfer function representing the relationship between the frequency and the force applied on the cart (in the model) is calculated as a black‐box model. To improve the accuracy of the transfer function, an experimental data‐driven design of this function is performed based on modern schemes in system identification. For this purpose, the applied frequency to the stepper motor and the states from the actual system are recorded. Then, the applied force on the cart is calculated using the equation of motion and the recorded states. It is also shown that the frequency‐force transfer function uncertainty due to exogenous disturbances is non‐negligible and with the aim of having a more accurate model, an artificial neural network is introduced. Finally, the effectiveness of this grey‐box model is shown by training and implementing a deep Q‐network based controller to swing up and balance the inverted pendulum.

implementing a deep Q-network based controller to swing up and balance the inverted pendulum.

INTRODUCTION
Recent rise in the use of artificial intelligence (AI) and machine learning (ML) in the context of control and robotics can be attributed to improvements in algorithms and an increase in computing power.In any field, the competency of these techniques highly depends on the availability of reliable data.Reinforcement learning (RL) algorithms like Deep Q-learning (DQN) [1] or policy gradient methods [2,3] are successful in performing certain control tasks but they require a lot of data to perform effectively.For this reason, these algorithms are trained in simulation rather than on the actual physical system, due to limitations in acquiring rich real-time data.There are two main disadvantages of online training using RL algorithms.Firstly, the cost of generating data from the real physical system.Running the actual system for a longer period of time may not only be harmful to its life but also can require a lot of energy.In the works of Deisenroth et al. [4], the approximation using Gaussian process has successfully reduced the number of epochs required to train the RL-based controller for different control tasks.And secondly, guaranteeing safety while acquiring data can be challenging.For example, teaching a controller to successfully park a car [5] can require a lot of unsuccessful trials before the controller can perform the task autonomously.Also depending on the complexity of a system it might take a long time for the AI or ML based controller to learn a specific task.So most of the time, it is advantageous to obtain a model of the system and the environment that can accurately represent it along with the constraints present in the real physical system.Due to the complexity of controlling a system with unstable equilibrium, an inverted pendulum (IP) is one of the most commonly used benchmark problems to test different control algorithms.A lot of studies have been conducted in the context of controlling an IP system [6], especially using RL agents [7,8].Most of this work focuses on algorithm development and implementation in simulation.In our previous work [9], the swing-up and balancing task is implemented on a real cart-pole type IP system.The swing-up part is performed by an RL-DQN-based controller and the balancing part is performed by a PID controller.The RL-based DQN agent was not adequate enough to balance the pendulum rod for a longer duration and for that reason, the PID controller was employed.The main reason for this problem is that the model used for training the controller in simulation is not accurate.In this work, we aim to obtain a better representation of the actual system with a lower modelling error and then use this model to train the RL-based DQN agent or controller in simulation.This results in an improvement in performance when the controller is deployed to control the actual system.
This paper is arranged in five sections.The first section introduces the aim of this work.The introduction is then followed by a background of the system modelling where our previous work along with the model used is explained briefly.In the methodology of modelling and control section, the grey-box modelling technique is explained which improves the modelling error.In the subsequent section, the results of the grey-box modelling for enhancing the control efficiency are shown, and finally, the paper is concluded.

BACKGROUND OF THE SYSTEM MODELLING
As motivated in the introduction, the uncertainty of a white-box model propagates in the model-based control law.In order to improve upon the robustness of the stabilizing controller, a grey-box model is employed to capture unmodelled dynamics.The benchmark IP system in this study is explained in more detail in the literature [9].A model of the IP system is derived by means of the Euler-Lagrange equations of motion and it is given by

Nominal parameter Value
Half the length of the pendulum (  ) , and   is the moment of inertia of the pendulum about its center of mass.
This model is then used accordingly to train a DQN agent [10] to swing up and balance the pendulum.Due to a high level of modelling uncertainties, the balancing part of the task is accomplished by introducing a PID controller when implemented in the actual system.In order to enhance the robustness of the closed-loop system, the DQN-based controller is trained on the extended grey-box model.Here we improve the model and show how effective the DQN-based controller is when trained using a more accurate model.Essential nominal parameters of the IP system are taken directly from [9] and presented in Table 1.Here the state [9] of the system is denoted by angle (), angular velocity ( θ) of the pendulum, position (), and velocity ( ẋ) of the cart. is the force on the cart and  is the acceleration due to gravity.The schematic of the actual IP system is shown in Figure 1.The pendulum is hinged to the cart which on its own is guided by a rail and the cart is controlled by a belt and pulley system which itself is driven by a stepper motor.In the model of the system, the actuation system is not accurately considered.In the literature [9], a constant gain is used as a transfer function between the input to the stepper motor and the force on the cart.In this work, we use a grey-box modelling technique to include unknown dynamics and non-linearities in the IP system such as the cart's drive mechanism.Also, to accurately calculate the position of the cart a linear magnetic encoder is attached to the guide rail as opposed to the observer estimating it in the literature [9].

METHODOLOGY OF MODELLING AND CONTROL
The novelty of this paper is the grey-box modelling.The available equation of motion of the IP system and actual data from the real system are combined together to improve the IP model's accuracy.And then the improved model is utilized to train RL-based agents or controllers which show improved performance when deployed on the actual system.With the help of the following two subsections, the complete methodology is explained in detail.
The schematic of the physical IP system.IP, inverted pendulum.
F I G U R E 2 Schematic of the grey-box modelling methodology.

Grey-box modelling
Classical white-box modelling is a technique where a model of the physical system is known.In the case of a dynamic system, it relies on the physics of the system for deriving the equation of motion.But if the system has unknown dynamics and non-linearities, then it can be cumbersome to obtain the complete equation that describes the system.This problem can be alleviated using the grey-box modelling technique.For example, in the context of this work, Equation (1) does not accurately represent the physical IP system.This is because the actual input to the system is the frequency of the PWM signal for the stepper motor which is driving the cart through a belt and pulley drive and not the force on the cart, , as it is in the white-box model.For the model to accurately represent the physical system, it is essential to bridge this gap.The grey-box modelling methodology uses data from the actual system and the available physics information to generate a transfer function of the dynamics that are not considered in the white-box model.Figure 2 shows a schematic of this methodology.The white-box model is preceded by a transfer function, portrayed by the yellow box in Figure 2, that represents the dynamics that are captured from the actual data.The input to this yellow box is the actual input to the physical system while the output is calculated using the white-box model.Next, various system identification techniques including artificial neural networks are exhausted to determine the block.This part is elaborated with examples in the next section.One advantage of using the grey-box model over a black-box one [11] is related to the complexity of the model.A less computationally complicated model is sufficient to represent the unknown dynamics as the dominant system behaviour is already captured by the equations.

Reinforcement learning based controller training
The overall process of building the grey-box model and using it to train an RL agent is described in a step-by-step manner as follows.
Step 1 Train the controller using the white-box model and a suitable RL algorithm.
Step 2 Deploy the controller on the actual system and record the input-output data.If the performance of the controller is not up to the mark then go to step 3.

Step 3 Follow the grey-box modelling process from the previous subsection using the captured input-output data and the white-box model. Step 4 Use the grey-box model to retrain the already trained controller.
Step 5 Deploy the final controller and compare the performance.

RESULTS OF THE GREY-BOX MODELLING FOR ENHANCING THE CONTROL EFFICIENCY
In this section, the results are discussed to show the effectiveness of the methodology.This section is divided into three parts.The first part shows the presence of model inaccuracy in the white-box model.The subsequent subsection shows how the grey-box modelling decreases the modelling error.Finally, in the last subsection, the benefit of using the more accurate model while training the RL agent is highlighted.Also, it is noteworthy to mention that even though the system F I G U R E 3 System response recorded from the simulated white-box model and the actual system.IP, inverted pendulum.dynamics (1) are invariant with respect to , the initial condition dictates the evolution of  and so for comparison, it is kept constant.

Inaccuracy of the white-box model
For the purpose of validating the white-box model, the same input signal is applied to the model and the actual system.As a DQN Agent will be used for swinging up and balancing the pendulum, the input signal is generated by modifying a sine sweep signal and it is shown in the first graph in Figure 3.The DQN action space has three discrete values corresponding to applying 10 force towards left or right or stopping completely.The rest of the graphs in Figure 3 show the comparison between the recorded state from the actual pendulum and the states estimated by the white-box model.From the graphs, the inefficiency of the model is evident, especially in the angular position of the pendulum.Thus a more accurate model is essential for performing tasks like balancing where controlling the  is critical.
F I G U R E 4 RMS errors of the states of the IP for different models.IP, inverted pendulum; RMS, root-mean-square.

F I G U R E 5
Structure of the NN used in the model with ReLU as the activation function and 6 neurons in the hidden layer.

Effectiveness of the grey-box model
Now with the objective of improving the model, the grey-box modelling technique, as described in Section 3.1, is implemented.At first, a new input signal similar to the signal in the preceding subsection is applied to the IP set-up and from the data acquisition system the states are recorded, which are used to calculate the force on the cart, , from equation (1).
To realize a transfer function or a model that can accurately predict the calculated force on the cart from the input signal to the system, three different approaches are considered.These approaches include using a 3 rd order state space model [12], a non-linear ARX model with a polynomial of order three [13] and a simple feedforward NN to represent the transfer function.The state-space and the ARX models are trained using the system identification toolbox in MATLAB [13].
The grey-box model with the NN significantly reduces the modelling error which can be validated from the RMS error calculated from the simulation result and the actual system shown in Figure 4.The network takes the frequency of the stepper motor and the previous states as input and predicts the force on the cart.The schematic of the network is shown in Figure 5.When the same input signal, from Figure 3 is applied to the grey-box model with the trained NN, the estimated states are shown in Figure 6.This model has better modelling accuracy than the vanilla white-box model, particularly in the case of estimating the  and θ.
F I G U R E 6 System response recorded from the simulated grey-box model and the actual IP system.IP, pendulum.

F I G U R E 7
Balancing of the IP by the DQN Agent which is trained using the white-box model.DQN, deep Q-learning; IP, inverted pendulum.

Training the agent using the grey-box model
The improved grey-box model is used to train the controller in simulation before deploying it to perform the swing-up and balancing task.To compare the effectiveness of the improved model, the performance of the agent, when trained only vanilla model, is also recorded.Figures 7 and 8 show the states of the IP for both cases.The agent, when trained using the improved model, is performing better.While the swing-up process is similar, it is able to balance the pendulum for a longer duration.From the graph of  in Figure 8, at least three times the agent is able to maintain the balance for more than 5 s.On the other hand, without the better model, the Agent is only able to swing up the pendulum and then it keeps rotating repeatedly.
F I G U R E 8 Balancing of the IP by the Agent which is using the grey-box DQN, deep Q-learning; IP, inverted pendulum.

CONCLUSION
In this work, it is shown that the model mismatch can lead to undesirable results in RL based regulation problems.Then for a physical IP swing-up and balancing task, the performance of the RL based controller is enhanced by improving the training model using the grey-box modelling technique described here.This technique can be easily generalized for other applications, where the white-box model cannot accurately describe the actual system.From Figure 6, on one hand the improvement in the estimation of  and θ by the grey-box model is significant, but on the other hand, an increase in RMS error for  is observed when compared to the white-box model which needs further investigation.If the modelling error for all the states of the IP can be decreased more, then the Agent will perform even better.In outlook, implementation of other RL based algorithms with continuous action space will be considered.And in future, the IP system will be upgraded in terms of hardware such that a quantitative analysis of energy required by the controller to train online and train using the improved model in simulaiton can be carried out.

A C K N O W L E D G M E N T S
Open access funding enabled and organized by Projekt DEAL.