An intelligent course keeping active disturbance rejection controller based on double deep Q‐network for towing system of unpowered cylindrical drilling platform

Towing is a widely used mode of transportation in offshore engineering, and towing of unpowered platforms is of particular significance. However, the addition of unpowered facilities has increased the difficulty of ship maneuvering. Moreover, the marine environment is complex and changeable, and sudden winds or waves can have unpredictable effects on the towing process. Therefore, it is of great significance to overcome the influence of the harsh marine environment while navigating the towing system following a planned course to a target sea area. To tackle the time‐varying disturbances, a course control method for a towing system of unpowered cylindrical drilling platform is designed based on double deep Q‐network (DQN) optimized linear active disturbance rejection control (LADRC). To be specific, to tackle the difficulty of LADRC tuning, double DQN is applied to select the best parameters of the LADRC at any time according to the states of the system, without relying on the specific information of the model and the controller. The course control performance of the towing system is evaluated in a simulation environment under various disturbances. Moreover, the Monte Carlo experiment is used to test the robustness of the controller when the ship's mass changes and the robustness of the proposed method is verified by testing with various heading angles. The results show that the LADRC with adaptive parameters optimized by double DQN performs well under external interference and inherent uncertainty, and compared with the traditional LADRC, the proposed method has better course control effects.


INTRODUCTION
its main application include two aspects. One is to tow non-self-propelled ships in a maritime accident to a safe area to prevent other ships from being blocked. The other one is to tow large structures such as offshore drilling platforms and floating docks to help the realization of marine resource extraction tasks. 1 In other words, the unpowered facility requires the assistance of tugboats to move to the target area. For a long time, towing operations have mainly relied on the experiences of the captain and pilot, which is challenging to set and control the towing course accurately and may increase the risk factor of towing navigation. For example, ships are affected by special weather such as sea wind, waves, or fog, and human experiences have severe limitations. Therefore, studying the course control of the towing operation system is of great significance for ensuring the safety of towing operations, reducing maritime towing transportation accidents, and preventing marine pollution. 2 Currently, the research on ship maneuverability mainly focuses on single ships, while the study on the maneuverability of towing systems is relatively few. Moreover, the main difficulties faced by the maneuverability control of the towing system are the dynamic model is very complex; the navigation environment is challenging to predict and overcome; the actuators like the rudder are saturated, to name a few. As the dynamic model of the towing system is highly complex, the current related research mainly focused on its simplified model, 3 where the "separate" modeling method proposed by the manipulative modeling group (MMG) 4-6 was used commonly. For example, Bo Woo Nam 7 utilized a three degrees-of-freedom (DOF) maneuvering mathematical model to describe the nonlinear dynamics of the towed vessel in a calm sea, which involved the surge, swaying, and yaw motion of the ship, and the simulation results are directly compared with the test data of the model. Marco et al. additionally considered the rolling motion and established a four DOF towing system model. 8 Hongbo Sun et al. established the six DOF motion equations of tugs and barges and proposed a towline-barge coupled motion model. 9 This article mainly considers the motion control of the towing system in a plane. Therefore, a three DOF dynamic model is enough.
We all know that a controller needs to overcome the adverse effects of the environment, such as wind or waves during the navigation of the ship. As far as we know, there are not many studies on the motion control of the towing system, where the PID method is still widely used today. For example, Liang et al. used a PD controller to control the tug rudder, and their research showed that changing the PD parameters can effectively reduce the tug's heading angle oscillation amplitude. 10 Pang et al. applied fuzzy-PID controller to the towing system, thus achieving the planned path. However, PID can only produce control actions after the disturbance has an impact on the system. 11 Therefore, Han 12,13 proposed the active disturbance rejection control (ADRC), which has the characteristic of estimating unknown disturbance and eliminating it. However, ADRC has many parameters that are needed to be adjusted, so it is not conducive to engineering applications. Later, Dr. Gao proposed linear ADRC (LADRC) on the basis of ADRC, which significantly simplifies the design of the controller. 14, 15 The LADRC does not depend on the model's accurate information, which uses the linear extended state observer (LESO) to estimate all unknown disturbances of the system and uses the PD control combination to obtain the control input so as to suppress the disturbances' influences acting on the system. And compared with other disturbance attenuation control methods, such as H ∞ control and stochastic control, ADRC has more potential for industrial applications with the advantages of model-free and the design process is rather simple. In terms of motion control of the towing system, Tao et al. 16 used LADRC to achieve linear path tracking control, and the simulation results proved that the control performance of LADRC is better than PID. Besides, LADRC has also shown empirical results in other fields, such as hydraulic system, 17 unmanned aerial vehicle (UAV), 18 and power system. 19 Parameter adjustment has always been a non-negligible part of the controller design process. Compared with ADRC, LADRC has been simplified significantly, but parameter adjustment is still challenging. Generally speaking, parameter adjustment methods are mainly divided into two categories. One is to use heuristic algorithms, such as genetic algorithm (GA), 20 particle swarm optimization (PSO), 21 and whale optimization algorithm (WOA) 22 to optimize a set of fixed parameters. The other is to adjust the controller parameters in real time, such as using fuzzy adaptive control. 23 However, the former can only get a set of parameters under a certain specific situation. When the environment suddenly changes, this set of parameters may not be able to obtain a good control effect. Moreover, the second type of method relies on model information or human experience. In order to make up for the deficiencies of these methods, this article uses reinforcement learning (RL) algorithms for controller parameter tuning.
RL is an algorithm that solves sequential decision problems, which can select the optimal action according to the reward value obtained from the interaction between the environment and the agent, thereby completing the final task. Due to this feature, RL has shown good results in robot path planning, 24 multiagents, 25 and computer games. 26 However, the application of RL in the traditional control field is still in the research stage. As a matter of fact, the adjustment of controller parameters can be regarded as a process of finding the optimal strategy. Therefore, it is theoretically feasible for the reinforcement learning algorithm to optimize the controller parameters. At present, scholars have used Q learning of RL to optimize parameters of the LADRC. For example, Chen et al. 27 used LADRC to control the heading angle of the ship without considering dynamics and used Q learning to optimize the controller parameters. Zheng et al. 28 applied the Q-learning-optimized LADRC to the power system and got a good control effect. However, in practice, in the optimization process of Q learning, the system state must be discretized, which may generate a large state space and is inconvenient for the storage and calculation of the Q table. Based on Q learning, deep Q-network (DQN) is developed, which is no longer limited to state discretization. DQN combines Q learning with deep learning and uses a deep neural network to map the state-action value function of RL, 29 where the replay buffer is added to train the deep neural network, and a target network is designed to calculate the loss function. However, DQN would overestimate the action-state value function during the training process, which affects the final decision and fails to obtain the optimal strategy. To solve this problem, many scholars put forward improved algorithms such as double DQN, 30 dueling DQN, 31 and rainbow DQN. 32 As far as we know, the double DQN plays a role in tuning optimization of the parameters for LADRC in this article for the first time.
Aiming for the course control of a towing system with an unpowered cylindrical drilling platform, we first built a three DOF motion model, then apply the double DQN algorithm to adjust the LADRC parameters in real time for course control, finally carry out the towing system course control simulations. The contributions of this article can be summarized as: (1) A three DOF motion equation for a towing system of an unpowered cylindrical drilling platform is established. The LADRC controller is designed to realize the course control despite the nonlinear characteristics and large inertia within the system as well as the disturbances in the marine environment.
(2) Double DQN is applied to optimize adaptive parameters of LADRC, in which a multilayer fully connected neural network is designed, and the error-based reward function is defined according to the system model.
(3) The robustness of the LADRC and double DQN has been verified. The remainder of this article is organized as follows. Section 2 describes the mathematical model of the towing system with unpowered cylindrical drilling platform under environmental disturbances. Section 3 designs the double DQN optimized LADRC based course controller of the towing system. Simulation results are reported and discussed in Section 4. And Section 5 concludes the article.

TOWING SYSTEM MODELING
The towing system studied in this article consists of a tugboat and a cylindrical drilling platform without self-propelled capability. The system modeling is based on the 3-DOF MMG model, and the motion between the tugboat and the drilling platform is only coupled by the streamer, which satisfies the catenary model. Considering the influence of the ship's fluid power, propeller thrust, rudder force, towing force, and their corresponding moments, we model the towing system as follows.

Motion mathematical model of towing system
Since the heave motion, pitch motion, and roll motion of the ship have a relatively small influence on the course, and they are usually ignored when designing the ship course controller. The plane motion coordinate system of the towing system is shown in Figure 1. As we can see, OXY is the inertial coordinate system, O 1 X 1 Y 1 is the towing coordinate system, and O 2 X 2 Y 2 is the towed coordinate system. And the movement of the tugboat and the unpowered platform is described by forward speed u 1 , u 2 , lateral drift speed v 1 , v 2 , and yaw angular velocity r 1 , r 2 in the body-fixed frame. Assuming the ship is a rigid body, the position coordinates and heading angle of the ship are expressed as: where i is the heading angle of the ship. It can be seen that the premise of obtaining the position and heading angle of the tug and the drilling platform is to know their respective speeds u i , v i and steering angle speeds r i . According to the 3-DOF MMG model, the motion equations of the tugboat and the cylindrical drilling platform in the towing system can be expressed as: where m i is the mass of the corresponding ship. The expression of ship's additional mass m xi , m yi , and additional moment of inertia are shown in Equation (4). In addition, X, Y on the right side of the equation represent the forces acting on the x and y axes, respectively, and N represents the moment acting on the ship. The subscripts H, P, R, T, and E denote hull hydrodynamic force, propeller force, rudder force, rope pulling force, and external environmental forces, respectively. It should be noted that the towed platform is only driven by the pulling force of the towline, thus, The additional inertial mass and the additional moment of inertia are expressed as: where C bi and d are the square coefficient and average draft of the ship, respectively. B and L represent the width and length of the ship.

2.2
Force and moment of the towing system

Hull hydrodynamics
In this article, the Kishima model 33 is used to estimate the viscous fluid dynamics and moments, which can be expressed by: where X i (u) is the hull resistance of the ship during direct voyage. On the right side of the equation, X vvi , X vri , X rri , X vvvvi are the longitudinal nonlinear hydrodynamic derivatives, Y vi , Y ri , Y |v|vi , Y |r|ri , Y vvri , Y vrri are the lateral linear and nonlinear hydrodynamic derivatives of the ship, respectively, and N vi , N ri , N |v|vi , N |r|ri , N vvri , N vrri are rotational linear and nonlinear hydrodynamic derivatives.

Hydrodynamic estimation of tugboat propeller and rudder
In the MMG model, 34 the longitudinal thrust X P1 generated by the propeller needed to be calculated: where t p refers to the number of the propeller thrust deratings. , n, and D p are the sea water density, the speed, and diameter of the propeller. K T and J P represent the propeller thrust coefficient and advance speed coefficient, respectively. The calculation of the tugboat rudder force can be estimated as 34 : where is the actual rudder angle of the tugboat. t R represents the derating of rudder resistance. a H denotes the coefficient of the influence of the steering on the lateral force of the hull. x R and x H denote the longitudinal coordinate of the rudder center and the distance from the center of the lateral force of the hull to the center of gravity of the ship, respectively. And F N is the rudder normal force.

Modeling of the towline
In this article, the catenary model is used to establish the towline model, and the towline force is decomposed into the motion coordinate system of the tugboat and the towed cylindrical drilling platform: where T H and R t are horizontal towing cable tension and horizontal resistance of towing cable at the towing point, respectively. x pi denotes the longitudinal distance from the towing point to the center of gravity of the respective ship. And i can be derived from Figure 1 .

Disturbance dynamic model
The marine environment dramatically influences the maneuverability of ships, and many shipwrecks occur in harsh seas. Under the circumstances, how to overcome the impact of wind and waves is of great significance. Since the data required for accurate winds or wave models are challenging to obtain, disturbances are generally established based on wind direction angle (wave direction) or wind speed (wave speed). In order to verify the effective anti-interference performance of the proposed method through simulation, this article models winds and waves as follows.
Wave: Assuming the relative flow speed is U c , and the relative flow direction angle is c , then the relative speed of ship motion is: In this case, Equations (1) and (2) can be estimated as: Wind: For wind with a constant direction and speed, the model is generally established as where a is the air density. A f and A s are the orthographic area and the side projection area above the ship's waterline. u rwi and v rwi denote the relative speed. C wx , C wy , and C wn are the wind coefficient on the x-axis, the y-axis, and the wind moment coefficient around the z-axis, respectively. It is worth noting that there are mismatched disturbances and uncertainties in the towing system, such as unmodeled dynamics, external winds and waves, and parameter perturbations. Therefore, it is very important to eliminate or suppress the influence of these disturbances and uncertainties on the system. Moreover, since the drilling platform itself has no driving force, this also increases the challenge of controlling the towing system.

DOUBLE DQN OPTIMIZED LADRC
The LADRC has significant advantages in suppressing the influence of disturbances on the system. They can also estimate and compensate for disturbances without knowing the specific model of the system. Therefore, this article applies the LADRC to the course control for the cylindrical drilling platform's towing system. The purpose of the controller is to obtain a suitable 1 , so that the heading angles of the towing system 1 and 2 could follow the planned course.

LADRC controller design
Because only the tugboat involves the force of the rudder in the towing system, the design of the controller is only for the tugboat. According to Equations (1) and (2), the following equation can be derived: Deformation of the above formula can be written as: where f can be regarded as the total disturbance containing both the model dynamics and external disturbances of the system and b 0 is the nominal input gain. A core idea in LADRC is to add a state to estimate the disturbance, which is defined as one of the following states: Thus, the following state space equation can be obtained: Then the extended state observer equation can be constructed as: where L = [ 1 , 2 , 3 ] T is the observer gain. The value of L has a great influence on the accuracy of states estimation. Gao 35 converted the adjustment of L into the tuning of observer pole: That is, take 1 where − o is the observer pole. Another important structure in LADRC is the PD control combination, which plays the role of eliminating the influence of the estimated disturbancex 3 . Take 1 of Equation (13) as: Under the premise of f ≈x 3 , substituting Equation (18) into Equation (13) can get: Let where d is the planned value of the heading angle 1 is the state feedback control gain. Similarly, the pole configuration method can be used to obtain: where k p = 2 c , k d = 2 c are represented by the state feedback control pole − c . To sum up, LADRC uses the LESO to estimate the total disturbance f , then suppressing the disturbance by the PD control combination. The parameters that need to be adjusted in the entire control system are: o , c , and b 0 .

Main principle of double DQN optimization
RL is an intelligent algorithm that experts and scholars have favored in recent years. It has the characteristic that the optimal strategy can be obtained through the constant interaction between the agent and the environment without knowing the specific structure of the environment or the controlled system. Q learning of RL has a good optimization effect for systems with discrete actions and discrete states. However, in practice, the state of the system is often continuous, or the number of the states is very large. In this case, Q learning shows certain limitations: the dimension of the Q table is too large, resulting in a "dimensional disaster" or the Q table is difficult to converge. DQN is an algorithm developed based on the combination of Q learning and deep learning, where the Q table is expressed by a deep neural network so that the system state is no longer required to be discrete. Nevertheless, DQN is prone to over-fitting, so this article uses the double DQN algorithm to optimize the LADRC controller parameters. The reward value r is the most essential data in the process of interaction, through which the value of taking action a in a certain state s can be evaluated, and a numerical basis for the choice of actions are provided. Usually, we use cumulative rewards R c considering future reward values as reference data for final parameter selection. That is: where R c = ∑ ∞ t=0 t r t+1 . And is the discount factor, which can reflect the importance of estimates of future reward values. The larger the Q value, the closer the corresponding a is to the optimal value. In Q learning, each state-action pair corresponds to a Q value, but in double DQN, the Q value is replaced by a deep neural network shown in Figure 2. Assume that the system has n state values to form a group of states and m actions to form a group of state values at a certain time.
There are a total of j groups of actions. Then in the Q learning algorithm, the action-state value Q ( s i , a j ) can be obtained through the ith group of states and the jth group of actions. Moreover, in double DQN, the ith group of the state is input to the deep neural network, and Q values with j quantity can be output, and each Q value corresponds to a set of action values. It can be expressed as: where stands for the weights of neural network. In other words, when the network is well trained, the network can output the Q values corresponding to all the action values in a certain state, then the action value corresponding to the maximum Q value is the optimal action-value that needs to be selected in this state. The schematic diagram of optimizing the parameters of LADRC through double DQN is shown in Figure 3.

3.2.2
Working process of double DQN As mentioned above, the ultimate goal of double DQN is to train a deep neural network to fit Q values so that the optimal solution can be obtained through the network output. The process diagram of double DQN is shown in Figure 4. There are two deep neural networks with the same structure and different weights, where Q-network is used to estimate the Q value andQ-network is to get the Q value of the next moment. And it is worth noting thatQ-network does not train neural network weights, which are assigned by Q-network every T n steps. That is to say, only Q-network needs to update the weights through training, and the training of the network involves the loss function like Equation. (24) and replay memory. The replay memory, also called the experience pool or replay buffer, is used to store the neural network data.

F I G U R E 4 Internal relationship structure diagram of double DQN
where r is the instant reward value obtained by performing an action in RL, and the design of r in this article is shown in Equation (28). According to the loss function, the updated weight value can be obtained by using the gradient descent method: where denotes the learning rate. During the interaction between the agent and the environment, the selection of action a in the current state s is based on the -greedy policy: Generally speaking, the meaning of the above expression is: choose the action that maximizes the Q value with the probability of 1 − , otherwise, choose an action randomly. Similarly, it can be seen from Equation (24) that when the state is s ′ at the next moment, the corresponding action a ′ is the action value that maximizes F ( s ′ , a ′ ; ) .

Double DQN based LADRC parameters optimization
In this article, double DQN is used to solve the problem of adaptive parameter tuning of LADRC in the presence of uncertainties. Therefore, the towing system, including the LADRC, is regarded as the environment. Then the agent, after interacting with the environment, the agent can then obtain the state values and reward values so that the deep neural network of double DQN can be trained. Before that, we need to preprocess the states and actions for the environment.

Define of states and discretization of actions
The state is the direct characteristic expression of the towing system, and it should be able to reflect the movement trend of the system. In this article, the state at time k is defined by the error and the derivative of the error: As can be seen from Equation (27), at each sampling moment, a set of state vectors s = [s 1 , s 2 ] will be generated. Then the number of neurons in the input layer of the neural network in Figure 2 can be determined as two. Based on the state, the reward function can be designed. The reward function should encourage the agent to adopt the optimal parameters, which are reflected in the system's state. The closer the system is to the target state, the greater the reward value should be. Regarding the state, we can get the following law: (1) The smaller the s 1 , the closer the system is to the planned course; (2) When s 1 ⋅ s 2 ≤ 0, the towing system is approaching to the target heading angle; otherwise, it is far away. Therefore, the instant reward function is designed as: The above formula shows that the closer the heading angle error is to 0, the greater the reward function is. Moreover, to avoid excessive overshoot in the system, we use the sign function to generate a certain reward and punishment signal.
As mentioned earlier, the parameters o , c , and b 0 that the controller needs to optimize are the action values in double DQN. Studies have pointed out that when b 0 is within a certain range, 36 the stability and convergence of LESO can be guaranteed. Therefore, o and c are mainly adjusted in this article. In addition, double DQN can only handle discrete action spaces, so we perform the following discretization: where h 1  , respectively. Therefore, there are a total of n 1 * n 2 action vectors to form the final action space. Then the number of neurons in the output layer of the neural network in Figure 2 can be determined as n 1 * n 2 .

Parameter tuning
In order to help the reader understand the whole process more clearly, the flowchart is given in Figure 5. As it can be seen in Figure 5, the whole process is divided into three stages. The first is the observation period, which is mainly for obtaining the required data; the second is the training period, which trains the neural network weights according to the data; and the last is the online phase, which is to find the optimal action values at each moment by the well-trained neural network.
In general, the double DQN is similar to an agent with self-learning ability. It learns some rules through data, so that it can take optimal actions in different states. Therefore, the proposed method is of great significance for the towing system with a changeable and unknown environment.

SIMULATION EXAMPLES
In this section, simulation experiments on the towing system with a specific cylindrical drilling platform are carried out, where the towed platform has no self-propelled capability. And the relevant parameters of the towing system are shown in Tables 1-3. The course of the towing system is driven by the tugboat's rudder angle; therefore, the purpose of the control is to obtain the suitable rudder angle 1 , so that the course of the towing system could follow the prescribed value d , which is softened according to Equation (27). In addition, for safety reasons, the actual rudder angle is subject to certain restrictions where r is the planned heading angle after softening.

Training of double DQN
The double DQN used in this article for adaptive adjustment is based on the manually obtained parameters. For the towing system, the action space is given in the format of Equation (31)

F I G U R E 6 Course control of towing system with wind disturbances
And b 0 is fixed to 0.00009. In addition, the parameters that need to be customized in the double DQN algorithm are given in Table 4.
The following will show the control effects of the proposed methods for the towing system under wind and wave disturbances, where the initial speed of the ships is taken as: u i = 6 m/s, v i = 0 m/s, r i = 0 m/s. Moreover, the initial heading angles of the tugboat and the cylindrical drilling platform are both set to 0 • .

Course control of towing system with wind disturbances
In It can be seen from Figure 6 that both the tugboat and the cylindrical drilling platform can finally reach the planned heading angle. On the one hand, compared with the LADRC with fixed parameters, the proposed double DQN-LADRC can reach the set value with smaller overshoot and undershoot and shorter settling time. This conclusion is also verified in Table 5. On the other hand, Figure 7 gives the adaptive parameters obtained by double DQN, which are selected according to the system states defined in Equation (27). As for the wind disturbance, the partial enlarged view in Figure 6 shows the response curves of the towing system when it is subjected to wind disturbance, in which the training parameters have not changed because the state values have not changed much. It can also be seen that the system has a longer response time, which is affected by the speed of the ship on the one hand, and is determined by the characteristics of the towing system itself on the other hand. Actually, without the drilling platform, the ship can quickly stabilize to the planned value. As for the entire towing system with unpowered cylindrical drilling platform, if the tugboat quickly stabilizes the planned value, the drilling platform will be difficult to stabilize.

Course control of towing system with water flow disturbances
In this section, the towing system's simulation experiments with the influence of uniform water waves are carried out. Assume that the towing system is affected by water waves with a direction of 45 • and a speed of 4 and 8 m/s, respectively, within a period of 10,000-11,000 s. Then the control results and parameters are illustrated in Figures 8 and 9. It can be seen from Figure 8 that the greater the water flow velocity, the greater the disturbance to the towing system, and the more severe the navigational state changes. In addition, under the influence of larger disturbances, the agent adjusts the parameters according to the state values in the reinforcement learning, as shown in Figure 9. Comparing   Figures 8 and 9, we can observe that in the case of wave disturbance of 8 m/s, the actual heading angle deviates greatly from the set value after 10,000 s. At this time, the parameters o and c have been adjusted in real time. Table 6 is the explanation of the performance index containing overshoot, undershoot, settling time T s and the IAE. The above two experimental results show that, on the one hand, the proposed double DQN-LADRC overcomes the limitation that it is difficult to obtain the optimal results by manually adjusting the parameters, and it can find the optimal parameters by the reward function. The superiority of the proposed method is proved by comparison with the control effect of conventional LADRC. On the other hand, the proposed method can adjust the parameter value adaptively according to the defined system state value. Generally, the bionic algorithm can only get a set of optimal parameters under certain conditions, but when the system operating conditions change, the set of parameters cannot achieve the desired effect. The method proposed in this article only needs to initialize the system state randomly when training the agent, so that the agent can deal with more situations.

Robustness test
As the main marine transportation, the mass of ships will inevitably undergo certain changes. The robustness is an important indicator to evaluate the performance of the proposed method.

System model parameter uncertainty
Take the towing system with wind disturbances as an example, keep the controller parameters shown in Figure 7 unchanged, and randomly change the mass of the tug and the drilling platform at the same time between ±30% of their masses for 50 times. The simulation results are shown in Figure 10. It can be seen that the heading angle of the tugboat and the cylindrical drilling platform can be stabilized to the planned value within this range, which proves the robustness of the controller. The results show that the controller parameters obtained by double DQN can still keep the course stable even though the tugboat and the platform's mass is changed, and the towing system with various planned heading angles realizes adaptive control. That is to say, the designed double DQN optimized LADRC controller has good robustness.

CONCLUSIONS
In this article, the course control for the towing system of unpowered cylindrical drilling platform based on double deep Q-network (DQN) optimized linear active disturbance rejection control (LADRC) was studied. Firstly, a three DOF dynamic model of the towing system under various disturbance conditions based on the MMG model and the catenary model was proposed. Then, the LADRC was designed for the tugboat, which controls the rudder angle of the tugboat to drive the towing system following the planned course. To determinate the parameters of LADRC, we applied the double DQN algorithm for the real-time tuning of its controller parameters. Unlike the bionic algorithms that get parameters offline or the fuzzy control algorithm that changes parameters online, the proposed method adopts the Markov idea of reinforcement learning so that the optimal parameters of the controller can be trained without determining the model and controller information. At last, we performed simulation experiments to validate the proposed method. The simulation results show that the double DQD-LADRC can make the towing system reach a better course control performance under the condition of wind disturbances compared with the LADRC with fixed parameters. Moreover, the simulation results under different velocities of water flow disturbances reflect the proposed method's ability to deal with disturbances. In addition, we verify the robustness of the proposed method through model parameter perturbation and different target values. At present, ship path tracking control is also a hot research issue, but the path tracking control of the towing system is still in urgent need of research, which will be considered in our future work.