Autonomous boat driving system using sample‐efficient model predictive control‐based reinforcement learning approach

In this article, we propose a novel reinforcement learning (RL) approach specialized for autonomous boats: sample‐efficient probabilistic model predictive control (SPMPC), to iteratively learn control policies of boats in real ocean environments without human prior knowledge. SPMPC addresses difficulties arising from large uncertainties in this challenging application and the need for rapid adaptation to dynamic environmental conditions, and the extremely high cost of exploring and sampling with a real vessel. SPMPC combines a Gaussian process model and model predictive control under a model‐based RL framework to iteratively model and quickly respond to uncertain ocean environments while maintaining sample efficiency. A SPMPC system is developed with features including quadrant‐based action search rule, bias compensation, and parallel computing that contribute to better control capabilities. It successfully learns to control a full‐sized single‐engine boat equipped with sensors measuring GPS position, speed, direction, and wind, in a real‐world position holding task without models from human demonstration.

However, fully autonomous USV remain a more distant ambition since human intervention is still often necessary in real industrial maritime operations (United Nations Conference on Trade and Development, 2018). For example, the recent work (Eriksen et al., 2019) had excellent results in real USV collision avoidance by combing MPC with other traditional control methods. It requires a dynamical model based on human knowledge to predict USV behaviors. Furthermore, the parameters of the controller were heuristically selected for good performance. Another recent work (Skulstad et al., 2019) learned a controller for autonomous ship driving by neural networks which requires preprepared samples for supervised learning. Its implementation was therefore limited to simulation due to the expensive cost of collecting real training data.
As an integral part of contemporary machine learning, reinforcement learning (RL; Sutton & Barto, 1998) enables agents to learn an optimal or suboptimal control policies from unknown environments via trial-and-error interactions (Kober, Bagnell, & Peters, 2013), therefore presenting itself as an appealing prospect for a fully autonomous USV. A RL-based autonomous USV approach could iteratively learn control policies adaptive to different environments without prior human knowledge nor heuristic parameter tuning. Although RL has been previously explored in both autonomous ground (Vincent & Sun, 2012;Williams, Drews, Goldfain, Rehg, & Theodorou, 2018) and air (Kim, Jordan, Sastry, & Ng, 2004;Tran et al., 2015) vehicles, its application to USV remains relatively limited (Liu, Zhang, Yu, & Yuan, 2016). Some recent works tried to implement state-of-the-art deep RL algorithms to USV in path following and collision avoidance (Zhao, Roh, & Lee, 2019;Meyer, Robinson, Rasheed, & San, 2020) but were limited in simulation tasks. The lack of RL towards real USV control is possibly driven by: 1. Difficulty of predicting the uncertainties in a dynamic ocean environment, for example, frequent disturbances due to unpredictable wind and ocean current, signal noise.
2. Difficulty of quickly providing suitable control signals under such a rapidly changing and often unpredictable environment.
3. A very high sampling cost when using real USVs for data collection.
The next section summarises several existing RL studies in mitigating these difficulties.

| Related works
As a potential solution to naturally model system dynamics with uncertainties, Gaussian processes (Rasmussen & Williams, 2006) represents them as Gaussian distributed random variables. Cao, Lai, and Alam (2017) proposed a controller combining a GP model with MPC for an online optimization framework suited towards quickly changing situations to handle both the prediction of uncertainties and rapid responses to changing environmental conditions. Unlike a RL based approach, it learned the GP model from precollected training data without exploration, and utilized a robust MPC controller to successfully control an unmanned quadrotor in simulation while considering state uncertainties.
Since model-based RL (Polydoros & Nalpantidis, 2017) learns policies from a trained model instead of the environment for better sample efficiency (Ghavamzadeh, Engel, & Valko, 2016) proposed both GP model based actor-critic and policy gradient algorithms and investigated them in a simplified boat steering task in simulation. A GP-based temporal difference RL approach was proposed in John, Jinkun, and Brendan (2018) and tested in an autonomous submersible navigation task in an indoor pool. PILCO (Deisenroth, Fox, & Rasmussen, 2013), a state-ofthe-art model-based RL method reduces model bias by explicitly incorporating GP model uncertainty into planning and control. Assuming target dynamics are fully controllable, it learns an optimal policy by long-term planning from the initial state. However, applying PILCO to USV is difficult due to unforseeable disturbances such as wind and current. A proper feedback control against these disturbances by replanning is computationally demanding since a large number of parameters in a state-feedback policy are optimized, while ignoring them in the long-term planning may result in poor control performance due to accumulated model error. Williams et al. (2018) introduced MPC into model-based RL and successfully implemented it in driving autonomous ground vehicles.
This study approximates the dynamics model by means of neural networks, whose formulation makes it difficult to follow the fully Bayesian formalism for naturally considering state uncertainties. It also would require a large number of samples for model learning and hyper-parameter tuning.
As the first attempt to combine the respective benefits of GP models, MPC and model-based RL, (Kamthe & Deisenroth, 2018) extended PILCO to avoid full-horizon planning in model-based RL by introducing MPC to moderate the real-time disturbances within a closed control loop. This study successfully showed its sample efficiency in simulated cart-pole and double pendulum tasks without considering external disturbances. One possible limitation of applying this study towards challenging real-world control problems is the relatively heavy computational cost, since its optimization is executed with a dimensionality that has been expanded for a deterministic dynamical system with Lagrange parameters and state constraints under Pontryagin's maximum principle (PMP). However, in autonomous boats where state constraints are less important, a simpler and more computationally efficient method may be feasible.
From these works, the combination of GP, MPC, and model-based RL is a potentially suitable solution towards a sample efficient learning in unpredictable environments. However, the corresponding application in real-world challenging tasks remains limited due to the gap between theory and real word implementation, for example, the computational cost, the hardware delay and so on. The motivation of this study is to fill this gap by developing a MPC and GP model-based RL approach specialized for autonomous boats in real ocean environment.

| Contribution
In this article, we present a novel RL approach specialized for autonomous boats: sample-efficient probabilistic model predictive 332 | control (SPMPC). Enjoying the sample efficiency of model-based RL, SPMPC iteratively learns a GP model of boat dynamics to increase the robustness of control against unpredictable and frequently changing noises and disturbances. Furthermore, it efficiently optimizes control signals under a close-loop MPC to reduce the heavy computation cost of the full-horizon planning in Deisenroth et al. (2013). Unlike the method in Kamthe and Deisenroth (2018), SPMPC directly optimizes the long-term cost with neither the expanded dynamics nor state constraints by separating the uncertain state and deterministic control signal during prediction for computational efficiency.
An instrumented autonomous boat driving system was then built, consisting of a full sized boat equipped with a single engine and sensors for GPS position, speed, direction, and wind ( Figure 1). Several features were proposed including quadrantbased action search rule, bias compensation, and parallel computing to improve the boat's control capabilities. The proposed system was then evaluated in a real-world position holding task.
Experimental results show the capability of the proposed system in terms of both robustness to disturbances and sample efficiency. To complement the real experimental results, several simulation experiments were also conducted to investigate SPMPC's learning behaviors and control performances under more challenging conditions.
Our preliminary work was published as a conference paper in Cui, Osaki, and Matsubara, 2019) 1 . This article builds upon the preliminary work as follows: 1. Implementation of a parallel computation based communication node to alleviate control delay.
2. Addition of domain knowledge to limit the search range of optimization for improved control capability.
3. Extending the state with engine speed and rudder angle to reduce the effect of the delay between the control signal and hardware. 4. Extending the real-world experiment to a position holding task based on the target reaching task in Cui et al. (2019), and analyzing the results with more detail.

| Outline
The remainder of this paper is organized as follows. Section 2 details the algorithm of the proposed SPMPC. Section 3 describes the SPMPC-based autonomous boat driving system. The real boat experiment in a position holding task is presented in Section 4. Several simulation experiments are conducted in Section 5 to investigate the RL learning behaviors, model accuracy, the effects of changing algorithm settings, and extension towards a position reaching and holding task. Finally, discussions and conclusion follow in Sections 6 and 7.

| APPROACH
In this section, the algorithm of SPMPC is detailed. As a model-based RL approach, SPMPC stores the knowledge of the environment in a learned GP model (Section 2.1). Figure 2 shows an example of SPMPC driving a boat. At step t after observing the current state, SPMPC can predict future states (e.g., boat velocity and direction, shown in opaque gray) with uncertainties caused by disturbances such as wind and current using the modified moment-matching approach introduced in Section 2.2. Given a long-term cost function for example, the squared Euclidean distance to the target (red cross), SPMPC then optimizes a control sequence to minimize the cost function, and employs an MPC framework as a quick feedback controller against the dynamical ocean environment which features unpredictable and unobservable disturbances (Section 2.3). By repeating this process at each step, SPMPC can control the boat to minimize the task cost while considering both the uncertainties and frequently changing dynamics in the challenging ocean environment. We finally introduce how to run SPMPC in a RL process in Section 2.4.

| Gaussian process (GP) model
The GP is a collection of random variables, any finite number of which have a joint Gaussian distribution. It is widely used as a nonparametric regression model (Rasmussen & Williams, 2006). Consider a stochastic dynamical system: where the unknown latent transition function is defined as with two hyper-parameters: α f 2 a is the overall variance of f a , and Λ a is the diagonal matrix of squared characteristic length-scales for each where * = (˜˜*) ** = (˜*˜*)

| Efficient Gaussian processes prediction with uncertainties
In general, considering the model uncertainties in a long-term prediction using GP is difficult since the prediction with uncertain input , at each step follows: which is a non-Gaussian predictive distribution and cannot be computed analytically. Approximating such an intractable marginalization of model input by traditional methods such as Monte-Carlo sampling is computationally demanding, especially for each candidate of the control sequence during optimization. As one solution, analytic movement matching (Deisenroth, Huber, & Hanebeck, 2009;Girard, Rasmussen, Candela, & Murray-Smith, 2003) approximates the non-Gaussian predictive distribution in Equation (5) by a Gaussian distribution that possesses the same mean and variance, and can therefore be expressed analytically. In this study, we propose a modified moment-matching to efficiently optimize the deterministic control sequence by separating the uncertain state and deterministic control in the prediction: By assuming the state and control signal are independent, the SE covariance function in Equation (2) can be separated as: Introducing Equations (8) and (9) to Equation (5), we obtain the exact analytical expression of analytic movement matching with deterministic action * u given the mean and variance of the states: The expression is detailed in Appendix A.1.

| MPC with multiple steps prediction
Now we are able to efficiently predict the future state of the GP model with uncertain input state and deterministic control via Equation (10). In SPMPC, a MPC controller is employed to provide quick feedback to the constantly changing environment. Defining a one step cost function (⋅) l for a specific task, for example, squared Euclidean distance for target reaching and position holding, SPMPC optimizes a H steps control sequence * … * + − u u , , t t H 1 to minimize the expected long-term cost: arg min , where the H steps states are predicted via Equation (10) with consideration of the uncertainties, γ ∈ [ ) 0, 1 is the discount parameter that encourages optimization to focus on more recent states. The multistep prediction of x s is calculated via Equations (6) and (10) given an initial variance Σ 0 .
is the constrained space of control signals. Any constrained nonlinear optimization method can be applied to search the optimal control sequence; in this study we utilized a single-shooting sequential quadratic programming (SQP; Nocedal & Wright, 2006) implemented by MATLAB optimization toolbox. After optimizing the control sequence on the fly, SPMPC only executes the first control signal * u t to the system and then moves to step + t 1, where it repeats the process of observation and optimization to form an implicit closed-loop controller that minimizes the cost function while considering both real-time disturbances and uncertainties/errors in GP prediction.

| RL process of SPMPC
In this study, SPMPC employs a RL process that iteratively improves the control performance by explorations and therefore adapts to

| USV system
As shown in Figure 1, the boat used in this study is a Nissan JoyFisher 25 (length: 7.93 m, width: 2.63 m, height: 2.54 m) fitted with a single steerable SUZUKI DF150AP outboard engine and two sensors: a Furuno SC-30 GPS position/direction/speed sensor and a Furuno WS200 wind sensor. Tables 1 and 2 presents all observed states and control signals in this system. The observed states include the vessel's position, velocity and direction, the relative wind speed and direction, and the real value of the engine throttle and rudder angle. The control signal is the rudder steering angle and engine throttle/gear shifts that respectively alter the vessel's angular and linear velocities. Note that the vessel was not equipped with a water current sensor. Therefore both the unobservable ocean current and the observable but unpredictable wind will strongly affect the navigation of the proposed system in real settings. Here these disturbances are alleviated by the following SPMPC system which predicts the vessel dynamics in multiple steps with consideration of environmental uncertainties.

| SPMPC system
The SPMPC system for autonomous boat is designed based on the RL process introduced in Section 2.3. Following Table 2, the states of the s s r r . Building upon our previous work (Cui et al., 2019), the state x and target y are extended with the real engine speed and rudder angle to further consider the delay between the control signals and the actual status of the boat's hardware. In the multistep prediction in Equations (6) and (10), we assume the wind states do not change during prediction by fixing v rw and ψ rw from the first step since the wind is difficult to predict in real ocean environment.

| Bias compensation
Following the ideal situation described in Algorithm 1, at each time step t SPMPC first observes the state x t , then optimizes the control t t H 1 and applies * u t to the system before observing the target y t . However, in the real-world implementation SPMPC optimizes the control sequence while continuously sending * − u t 1 to the system. The observed state x t is therefore biased by * − u t 1 during the optimization and may worsen the controller's performance, especially when the optimization time is lengthy. Figure 5 The architecture of the autonomous boat driving system. The algorithm utilizes the SPMPC system, when the USV system read data from sensors and send control signals to the boat. Three plugins are implemented between the two systems for improved control capabilities. Green and blue arrows represent flows of input state and output control signal data respectively.
where Δt is the time of executing the previous control signal * − u t 1 . The state with biased position will be the input of multistep optimization.
Although this model could improve the control performance according to the results seen in our previous work, it remains limited since the boat's velocity v s and direction ψ s are fixed following Equation (12), that is, neither acceleration nor turning can be predicted in such a bias compensation.
In this study we update the bias compensation by employing the GP model learned in GPMPC to predict the biased state via an additional GP regression with one step prediction: which naturally considers the full dynamics of the boat including position, velocity, and direction. Please note it is a normal GP regression without uncertainties in input since only one step prediction is required. Although at the early stages of SPMPC the initialized GP model may have large error compared with the model in Equation (12), we postulate the GP model could iteratively capture the boat's dynamics through the RL process and eventually give a reliable estimate of the biased state.

| Parallelized communications node
The role of the communication node in Figure 4 is transferring states read from sensors and control signals optimized by SPMPC. It has been improved to utilize a parallel computation structure to further improve overall control capabilities. As shown in Figure  To reduce Δt, we develop a parallel computation structure to process the communication and bias compensation/optimization across different CPU cores. As shown in Figure 6b, at time step t, CPU 1 is set to receive the state x t and send control signal * − u t 1 .
Unlike the previous structure that directly uses x t in bias compensation and optimization of the current step, [ ] x t is stored for step + t 1. In parallel, CPU 2 predicts the biased stateˆ= ( * ) and searches for the optimal control signal * u t , where − x t 1 and * − u t 1 were stored in step − t 1 (we define x 0 as initial state, * u 0 is a zero vector).

| Quadrant-based action search rule
According to the results from our previous work's real-world target reaching task (Cui et al., 2019), it is more challenging to hold the boat's position near the target than reaching the target. This is due to the fact that a control signal which is optimized for the reaching task  Table 3.
Training was conducted at 11:00, then testing time followed at around 14:00. Please note that although the training and testing areas are not far to each other, their local disturbances (wind, current, etc.) were very different according to the weather information shown in Figure 9. Therefore, these conditions can be used for evaluating the generalization ability of SPMPC.

| Real experiment results
In this section, we present the results of the real experiments. Two experiments were conducted in this part. The first experiment investigated the performance of the proposed SPMPC system with all plugins (Section 4.2.1). The second investigated whether the additional states of the engine and rudder contribute to better performance (Section 4.2.2).

| Evaluation of SPMPC system with additional states
The second experiment was conducted on July 30, 2019 to investigate whether adding states of real engine speed (η r ) and real rudder angle (δ r ) to the state (and target) of GP model, that Without considering variance, it appears that SPMPC could not predict the effect of wind correctly, and as a result preferred to further accelerate the boat with 25% throttle to against the wind.
On the other hand, with consideration of variance, SPMPC decided to slow down the boat and utilized the wind to keep its position.

| Real experiment summary
In this section, the proposed autonomous boat driving system was implemented on a real boat and tested in an autopilot task: position holding in an open ocean environment. The first experiment demonstrated the learning capability and sample efficiency of the SPMPC system with bias compensation, parallelized communication node, and a quadrant-based action search rule. The second experiment investigated the positive effect of real engine speed and rudder angle in control performance. Finally, a study of SPMPC's behavior in a test trajectory lasting more than 8 min was conducted, indicating that the proposed system is capable of learning a suitable controller against disturbances with good sample efficiency.

| SIMULATION EXPERIMENT
In this section, the simulation results of SPMPC with different setting is presented as a complement of the experimental results presented in the previous section.

| Simulation experiment setup
The simulation used in this section was jointly developed by NAIST and Furuno Electric Co., Ltd. It approximates the boat dynamics in the ocean with disturbances including wind and current, based on expert knowledge and real driving data. A position holding task similar to the real experiment was conducted. The system settings and experimental settings follow Sections 3 and 4, respectively. In simulation the velocity and direction of both wind and current are parameters that change between each step. One time step for SPMPC in this simulation is about 2 s. Please see Appendix A.2 for more details.

| Convergence test
The first simulation experiment is to investigate the RL behaviors of  We also investigated the effect of N trial and L rollout in the RL process in Figure 19 using the setting of "SPMPC with rule, 500 initial samples." These results indicate that a suitable balance between N trial and L rollout is necessary for SPMPC to perform well. A limited N trial resulted in poor generalization ability due to less diversity in experienced environmental settings. An insufficient L rollout can also prevent SPMPC from learning boat dynamics subject to longer-term disturbances.
All results in this subsection empirically confirm that SPMPC's RL behavior that updated the model over numerous iterations can successfully learn the position holding task in simulation, and outperform the baseline. Furthermore the proposed quadrant-based action search rule greatly improved the control performance of SPMPC.

| Control performance test
The second test is to evaluate whether the prediction length and the uncertainties of predicted state contribute to better control results,

| Rule and cost function test
In Section 3.5, a quadrant-based action search rule is proposed to limit the range of control signals for the position holding task. One alternative solution to limit excessive control signals in MPC is directly adding a penalty term of control signal in Equation (14) following: 1 2 s s s t a r g e t s 2 2 where p s and u s are the position and control signal in the sth step, p target is the target position, α is the weight of the penalty term.
The third test is to evaluate the control performance of the quadrant-based action search rule compared with the penalty term in cost function defined in Equation (15). Five configurations of SPMPC include using cost function Equation (15) with α = [ ] 1, 10, 100 and using cost function Equation (14) with/without the rule were tested.
To uphold the boat's range of control inputs, only the throttle signal with range [−100%, 100%] is limited in Equation (14). Following the settings in Section 5.2.2, we set = H 3 steps with variance on in the level 1 environment.
The results of average position offset are shown in Figure 21 where the SPMPC with the proposed action search rule outperformed all other configurations while the SPMPC using Equation These results indicate the proposed quadrant-based action search rule is a suitable approach for SPMPC to limit excessive control signals in position-keeping task. It dynamically determines a search range of control while fairly considering all candidates in optimization. As a comparison, adding a corresponding penalty term to the cost function could not balance the control capability and excessive control signals.
The static parameter α in the penalty term encouraged lower control signals that are usually insufficient to against the uncertain disturbances while expanding it in a dynamic way is difficult.

| Sparse GP test
In this test, the effect of sparse GP's parameter in SPMPC was in-

| Target reaching and position holding task
In the last test, we explored the potential of applying SPMPC with the quadrant-based action search rule to a reaching-target and position-stay task as the first step to combining the real-world experiments in our previous work (Cui et al., 2019) and this article. action search rule is compared in Figure 24. Their test trajectories were shown in Figure 25. These results clearly show that SPMPC with the quadrant-based action search rule is capable of learning this task with high sample efficiency and robustness against the disturbances in a changing ocean environment.

| Simulation experiment summary
In this section, the proposed autonomous boat driving system was evaluated in simulation with different settings in algorithm and environment. These results empirically confirmed that: (1) SPMPC iteratively improved both the control performances and model prediction capability through the RL processes; (2) the multistep prediction with uncertain input played an important role in SPMPC especially in environments with large disturbances; (3) the optimal selection of sparse GP pseudo-input to balance the performance and calculation cost; (4) the possibility of applying SPMPC in a task combining target reaching and position holding.

| DISCUSSIONS
Since the proposed SPMPC is a general RL approach that iteratively learns the model of vessel in ocean environment without requiring prior model knowledge, it can therefore be easily implemented to a wide range of USVs with different engines and sensors. In this study, SPMPC was evaluated in position-keeping task and in both real word experiment and/or simulation. It is straightforward to extend the application of SPMPC to more complicated scenarios. For example, extending the cost function in Equation (14) to further keeping the direction of boat in position-keeping task, or introducing boat's rotational motions to the cost function to improve the driving Suitability. Furthermore, with additional radar devices, environmental sensors and other pattern recognition technologies, the proposed method could be a potential solution to other challenging tasks such as collision avoidance and auto-docking.
For future work, in regard to implementation, a real-world task combining goal reaching and position holding will be conducted based on the simulation results introduced in Section 5.2.4. A current sensor will be added to the boat to detect the state of wave, the software will be further updated by moving to C++ and CUDA for higher control frequency and better optimization ability. It would also be of interest to directly learn expert driving skills by building a GP model based on human demonstrations instead of random generated samples, to potentially achieve more human-like autonomous driving.
Algorithmically, since multiple GP models are trained for each target dimension with a SE kernel in this study, both multidimensional output GP (Álvarez, Luengo, Titsias, & Lawrence, 2010) and advanced kernel function approximation approaches such as (Le, Sarlós, & Smola, 2013) would improve computational efficiency.