Optimization control of the double‐capacity water tank‐level system using the deep deterministic policy gradient algorithm

Process control systems are subject to external factors such as changes in working conditions and perturbation interference, which can significantly affect the system's stability and overall performance. The application and promotion of intelligent control algorithms with self‐learning, self‐optimization, and self‐adaption characteristics have thus become a challenging yet meaningful research topic. In this article, we propose a novel approach that incorporates the deep deterministic policy gradient (DDPG) algorithm into the control of double‐capacity water tanklevel system. Specifically, we introduce a fully connected layer on the observer side of the critic network to enhance its expression capability and processing efficiency, allowing for the extraction of important features for water‐level control. Additionally, we optimize the node parameters of the neural network and use the RELU activation function to ensure the network's ability to continuously observe and learn from the external water tank environment while avoiding the issue of vanishing gradients. We enhance the system's feedback regulation ability by adding the PID controller output to the observer input based on the liquid level deviation and height. This integration with the DDPG control method effectively leverages the benefits of both, resulting in improved robustness and adaptability of the system. Experimental results show that our proposed model outperforms traditional control methods in terms of convergence, tracking, anti‐disturbance and robustness performances, highlighting its effectiveness in improving the stability and precision of double‐capacity water tank systems.


INTRODUCTION
Machine learning is a rapidly evolving interdisciplinary field that draws on the expertise of multiple disciplines, including computer science, robotics, statistics, psychology, and other related areas.Its main focus is on looking for patterns in data to forecast unidentified trends.The intersection of robotics and psychology in machine learning has a significant effect on the development of reinforcement learning (RL). 1 Reinforcement learning surpasses traditional control methods due to its ability to learn and improve through an interactive trial-and-error approach that relies on observations obtained from the dynamic environment. 2In recent years, there has been a trend toward implementing comprehensive intelligence in industrial production.The surge of cloud computing and communication network connects the related industries, unleashing the full potential of the industrial network orchestrated by machine learning and AI due to their capability to collect and generate large volume of network data. 3Moreover, AI-based control algorithms, characterized by robust autonomous learning and the ability to handle complexity, have revitalized various control schemes and led to new research avenues. 4Consequently, a very important technological orientation of industrial intelligence is the systematic study of intelligent control. 5s a significant subdivision of intelligent control, the incorporation of RL into system control technology is anticipated to pave the way for novel research paths, presenting tremendous research potential and promising application prospects.
Reinforcement learning and traditional control methods are commonly employed in the development of system control strategies for industrial production and management processes.However, certain proposed models may be susceptible to variations in working conditions, leading to performance limitations.Abe et al. 6 employed RL in the continuous decision-making process for optimizing the phase of microperistaltic pumps.The findings indicated that RL performed effectively in the optimization of the pump actuation sequence.Nevertheless, the real-time operational delay in the actuation sequences affected the state transition in RL.Dey et al. 7 achieved high precision and stable convergence in controlling the water level in a tank system using Fuzzy logic method.However, the method was not able to accurately detect errors caused by delays in the pump environment.Efheij et al. 8 implemented a PID controller based on Arduino atmega328p to monitor an industrial water-level system.The response of the controller showed slight overshoot and almost zero steady-state error.However, in the face of dynamic changes in the water level and associated characteristics, the controller necessitated frequent parameter adjustments.
Another critical factor for evaluating a successful RL control method is its robustness.The ability to withstand and adapt to changes in the environment is essential for training and upgrading the actor network and critical network. 9arish and Peter 10 proposed a solution to enhance the robustness of RL training by imposing random perturbations on the system input, which is known as the Linear Quadratic Gaussian (LQG) method.While this method was shown to effectively improve robustness during RL training, it came at the cost of reduced model performance.Model convergence is also a critical measure of a system's stability and ability to resist interference.While Q-learning is a widely used model-free RL algorithm, it can have limited convergence capacity.Qi et al. 11 developed a real-time energy management system using deep Q-network, which combines Q-learning with a deep neural network to provide optimal control decisions in a continuous environment.Nevertheless, the model's convergence performance is suboptimal, resulting in system instability and susceptibility to crashes.Cheng et al. 12 proposed a multi-agent deep deterministic policy gradient (MADDPG) offloading algorithm for mobile devices that maximizes long-term utility in terms of execution latency and energy consumption.However, the large number of mobile devices can make the training process unstable, which can limit the effectiveness of the proposed algorithm.In summary, numerous existing RL algorithms and traditional control methods are limited by weak anti-interference capability, poor convergence performance, and low robustness.
Recent studies have highlighted the potential of RL algorithms, including the deep deterministic policy gradient (DDPG) algorithm, for the control of nonlinear systems.Mendiola-Rodriguez et al. 13 conducted a study on the anaerobic digestion systems of Tequila vinasses, utilizing the DDPG algorithm to reduce chemical oxygen demand (COD).Their findings demonstrated the algorithm's ability to handle plant-model mismatch, as well as its robustness against disturbances and uncertainties, indicating its potential for applications in nonlinear system control.Yoo et al. 14 introduced a pioneering methodology that integrates RL and optimal control techniques to address the non-stationary and irreversible nature of batch processes.The approach employs a Monte-Carlo deep deterministic policy gradient with phase segmentation (MC-DDPG) and has demonstrated remarkable efficacy in managing substantial uncertainties and intricate nonlinear dynamics.While these studies successfully applied DDPG and its variants to directly control nonlinear systems, our work focuses on employing the DDPG algorithm to control a linearized version of the nonlinear double-capacity water tank system.The linearization of the system may lead to a more computationally efficient implementation, while still capturing the essential dynamics of the nonlinear system.
This article focuses on the control of a double-capacity tank-level system, a complex system that exhibits nonlinearity and time delay.This system finds wide applications in industries such as chemical and power plants, where even minor deviations can lead to significant financial loss and potential accidents. 7However, due to the underlying complexity in the control mechanism of double-capacity water tank, classic physical models or traditional control methods can hardly be applied to effectively control the system parameters like water level and level deviation.
Given the challenges associated with controlling complex and dynamic systems, this article proposes the use of an advanced RL model, specifically the DDPG algorithm, to effectively regulate the double-capacity water tank system.By leveraging the power of artificial intelligence and machine learning, we aim to address the limitations of traditional control methods and achieve optimal control of the system parameters such as water level and level deviation.The double-capacity water tank is classified as a continuous action-space system, making DDPG an ideal control method due to its deterministic action output.This property contributes to stabilizing policy updates and enhances the efficiency of directional exploration within the tank environment.In our work, to improve the processing efficiency of the system, a fully-connected layer is incorporated into the observer side during the construction of the critic network, enhancing its feature extraction performance.Considering the scenario of a continuous water-tank system, node parameters are optimized and a RELU activation function is added to the design of actor-critic network, ensuring that it can continuously react to changes in the environment while minimizing gradient loss.Robustness and convergence performances are key concerns in a water tank system, so we incorporate PID controller output into the observer side of the DDPG pure control system so as to achieve better feedback performance.
In this study, we have developed DDPG pure control and DDPG adaptive compensation control systems for the control of a double-capacity water tank.Through a comparative analysis with proportion-integration-differentiation (PID) control and Fuzzy PID control, our results demonstrate that DDPG surpasses these traditional control methods, showcasing superior adaptability, tracking performance, disturbance resistance, and robustness.It shows that DDPG adaptive control system has the best control effect, combing the adaptability and convergence of the pure DDPG method and the robustness of the PID method.Overall, the DDPG algorithm demonstrates superior performance metrics and holds promising potential for application in industrial process control systems.
The rest of this article is organized as follows.Section 2 demonstrates the design process of the DDPG-based control methods with proper tank environment building and innovative network construction.The present study focuses on the double-capacity water tank system and entails the development of two control systems through Simulink simulation.Specifically, the DDPG pure control system and the DDPG adaptive compensation control algorithm are employed to evaluate their respective performance in controlling the water tank system.Section 3 analyzes the design logic of the refined DDPG framework and demonstrates the control process of DDPG algorithm.Section 4 conducts a comprehensive comparative analysis on four distinct control methods, namely, PID control, Fuzzy control, DDPG pure control, and DDPG adaptive compensation control.This comparative study encompasses four fundamental aspects: convergence, tracking, anti-disturbance, and robustness performances.The principal objective of this analysis is to discern and appraise the control efficacy of different methods.Section 5 summarizes the main work of this article, predicting the future research prospect of DDPG-based control methods.

Controlled object model
Due to its special properties in the water-level control scenario, the double-capacity water tank makes a good controlled object for researching DDPG control algorithms.The inflow and outflow rates of the tank, which are continuous in nature, are managed by the system in an effort to adjust the water level inside the tank.Meanwhile, the water level also serves as a continuous state variable.Therefore, DDPG is a kind of RL technique that is appropriate for this problem given its continuous state and action space.In our study, we have linearized the double-capacity tank model to enable a more thorough exploration of its performance in the context of continuous process control.
In the simulation, it is assumed that the tank model satisfies the conditions listed in Table 1.

F I G U R E 1
Structure diagram of double capacity water tank-level system.
Figure 1 shows a double-capacity tank-level control system composed of two single tanks in series, with the input quantity being valve opening variation Δu of the regulating valve u and the output quantity being the level increment Δf of tank 2.
According to the material balance equation, the following relationship can be obtained: Based on the continuity equation, the following relationship can be derived through first-order Taylor expansion: Δf . ( Combining the equations listed in (2), the differential equation of double-capacity water tank will be of this form: Under zero initial conditions, considering that there is a delay time of  s for the change in water volume caused by the change in the opening of the regulating valve u, the transfer function of the double-capacity tank-level system can be obtained by the Rasch transform: Drawing from the assumptions outlined in Table 1, the transfer function of the double-capacity tank-level system is obtained by considering the operating condition in which the level height of tank 2 is maintained at a final value of 10 dm.
Based on the transfer function, a double-capacity water tank-level system model is built in Simulink and encapsulated in the "water tank system" module, as shown in Figure 2.

2.2
DDPG pure control system model

System architecture
The DDPG intelligent agent itself is applied to the control loop as a pure controller while the components other than the intelligent agent and double-capacity water tank serve as the external environment in the architecture of the DDPG pure control system.The observer inputs are the tank-level height f and level deviation e, which reflect the state of control system.The key component of the DDPG pure control system design is the neural network, which can effectively train and utilize multiple input parameters.This helps to reduce the impact of disturbances and enhance control accuracy.The DDPG pure control method is applied to the control system and its structure block diagram is shown in Figure 3.

Environmental model
In the environmental model, the observer inputs are initialized tank level height f and level deviation e.The range of tank level height f is from 0 to 20 dm.The termination symbol is designed to reflect whether training is completed.When the level height f goes beyond the upper limit of 20 dm or goes below the lower limit of 0 dm, the value of termination symbol d equals to 1.In other cases, the value of termination symbol d equals to 0.
The reward value requires consideration of several parameters, mainly including the level deviation e and the termination sign, where the deviation is taken as an absolute value for the operation.The level height is used as the control target and its deviation from the set value is the priority in the design of the reward value.The reward is designed to be higher when the deviation is smaller, which incentivizes the agent to bring the level height closer to the target value.The specific reward function is shown below: (5)

Network model
The DDPG control model consists of four deep neural networks: two Critic networks and two Actor networks, with networks of the same type having identical structures. 15To simplify the system, a single Critic network and a single Actor network can be constructed, respectively.The Critic network model consists of two parts, the observer side and the action side.We add an additional fully connected layer in the observer side of the critic network and use RELU as the activation function.The input layer of the critic network receives inputs from both the observer side (s) and the action side (a).
Behind the fully connected layer lies the hidden layer, while the superposition layer combines the outputs of the fully connected layers on both sides.Finally, the output layer produces the evaluation value of the current policy.For the actor network which has similar structure, it receives state information from the environment and output the corresponding action policy.The overall network model is shown in Figure 4.

System model
Using the DDPG pure control as the control method and "water tank system" as the controlled object, the model of double-capacity water tank level control system is built in Simulink and the control system model is shown in Figure 5.

System architecture
In the DDPG adaptive compensation control system, the parts other than the intelligent agent and water tank are regarded as the external environment while the DDPG intelligent agent is used as the front controller.To increase the system's capacity to regulate itself, PID is used as a feedback controller in the control loop.The observer inputs become tank level height f , the level deviation e and the output value of the feedback controller, which reflect the state of the control system.The DDPG adaptive compensation control method is applied to the control system and its structural block diagram is shown in Figure 6.

Environmental model
The construction method of the environment model is roughly the same as that shown in Section 2.2.2.The main differences are (i) the observer inputs have three channels, which are the tank level height f , the level deviation e, and the output value u of the feedback controller; (ii) during the reward function calculation, when | e |≥ 0.1 dm, the environment imposes a penalty on the intelligent agent with a negative reward value of −5.

Network model
The building technique of the network model is roughly the same as that described in Section 2.2.3, with two key differences: (i) the inputs of the Critic network model are the input s on the observer side, the input a ′ on the action side, with

System model
Using DDPG adaptive compensation control as the control method and "water tank system" as the controlled object, a double-capacity water tank-level control system model is built in Simulink and the control system model is shown in Figure 8.

DDPG-BASED CONTROL ALGORITHM
Deep deterministic policy gradient was proposed by the DeepMind team in 2016 as a strategy algorithm that incorporates deep learning neural networks into DPG. 16The DDPG algorithm employs an Actor-Critic network to approximate the policy function  and utilizes the DQN algorithm to train the network function Q, which enables the computation of temporal difference errors and the implementation of gradient updates from the Online Network to the Target Network. 17 function in the Critic network represents the expected value R t obtained after executing action a t output by actor network and policy  in state s t , with a discount factor of : F I G U R E 8 DDPG adaptive compensation control system model.

F I G U R E 9
Design block diagram of DDPG control algorithm.
The Q network in DDPG is obtained by simulating the Q function using the critic network, with the parameter denoted as  Q .
The performance of the strategy  is measured by the function G, which is defined as follows: where s denotes the environmental state and its distribution function is   .The design block diagram of the DDPG control algorithm is shown in Figure 9.
DDPG is a RL algorithm that uses a neural network to learn a policy function, which outputs deterministic actions based on the agent's observations.The agent then interacts with the environment to receive feedback in the form of a reward signal, which is used to update the policy network and the Q-value function. 18It aims to maximize G  () and minimize the loss incorporated in Q network, finally completing data mapping from input to output 19 through continuous policy and value-based training incorporated within the actor-critic network.It creates an experience replay buffer, similar to that used in the DQN method, to store the experiences of the intelligent agent during the previous t time steps.At the same time, it randomly samples from the experience memory to facilitate the gradient information transfer from the evaluation network (critic network) to the action network (actor network), aiming to update the parameters of online network and target network through backpropagation and avoid overfitting.The efficiency of the process of experience replay and gradient descent is improved by introducing a fully-connected layer in the observer side of the critic network so that it can better extract key features for water-level control from the tank environment.Refined node parameters and employment of RELU function prevent the issue of vanishing gradients to guarantee the quality of network parameter updates.
The algorithm design process is shown as follows (Algorithm 1): Randomly sample k data from the experience pool and input them to the Actor -Critic network 16: Calculate the online value Calculate the target value

RESULT ANALYSIS
The effectiveness of a control system is generally evaluated through quantitative measures.To ensure the desired performance criteria, it is essential to compare and analyze different control algorithms through simulation experiments under the same initial conditions.Here, we assume that the target water level of tank 2 is fixed at 10 dm.In the simulation experiments, traditional PID controller and Fuzzy controller are compared with DDPG control algorithm to verify the performance of DDPG-based control framework.PID controller is the extensively employed regulator in industries. 20In order to eliminate the static difference of the system and, ultimately, stabilize the system output, it works with pre-defined parameters and applies the results of its operation to the controlled object through the actuator.Fuzzy controllers are considered by researchers to be the excellent choice for studying complex systems as opposed to ordinary linear controllers. 21n order to achieve better control outcomes, Fuzzy control incorporates fuzzifying the precise values of the PID controller parameters, creating inference rules, and performing inverse defuzzification.

General parameters
The training parameters of the intelligent agent under the control of DDPG pure control algorithm (a) and DDPG adaptive compensation control algorithm (b) are designed to satisfy the conditions listed in Table 2: Server and programming configurations are presented in Table 3:

Convergence performance testing
The efficacy of RL training is directly measured by convergence performance, which can determine whether a system have provable convergence guarantee to a globally optimal and feasible strategy. 22To assess this characteristic, we utilize the reward curves obtained from the training of the intelligent agent.
During the training of the intelligent agent, the training ends when any of the following conditions are met: (i) the simulation distance exceeds 30 steps; (ii) the cumulative average reward of the intelligent agent is higher than 2500.Assuming that the training is performed under the same parameters, the cumulative reward curves of DDPG-based methods is shown in Figure 10.

F I G U R E 10
Cumulative reward curves under different DDPG control methods.
Judged from Figure 10, it can be shown that under the same cumulative average reward threshold of 1200, the DDPG pure control method requires 30 training steps and takes 5 minutes and 43 seconds, while the DDPG adaptive compensation control method requires 24 training steps and takes 4 min and 18 s.It is justified that DDPG adaptive compensated control method can reach the cumulative reward threshold quickly, converge faster, and exhibit better convergence performance.
Upon examining the cumulative reward curves, we observe that both the DDPG pure control and DDPG adaptive compensation control methods exhibit a consistent upward trend, characterized by a smaller fluctuation range and greater stability performance.However, the difference between the two is whether their exploration is directional or not.The exploration of the DDPG pure control method is non-directional, with constant trial and error tolerance in the early stages of training while the exploration of the DDPG adaptive compensation control method is directional, which guarantees that its total compensation is always positive and allows it to continuously get closer to the desired value.The positive reward values lead to a shorter training time, allowing the intelligent agent to quickly obtain the optimal strategy.

Tracking performance testing
Tracking performance is a crucial parameter that has been extensively investigated in various fields, and its optimization in system control holds great significance. 23Achieving exceptional quality in a system implies that the actual value of the system index should quickly follow the set value.In this section, we focus on exploring the tracking performance of the double-capacity water tank system by adjusting the input conditions.Assuming the level input is set to −1 dm at 55 s and 2 dm at 85 s during system operation, we conduct simulations under the same initial conditions and present the level output curve in Figure 11.
Judged from Figure 11, it can be concluded that when the set value of tank level changes, the DDPG-based control method responds faster than the PID-based control method, with a smaller fluctuation range and consequently superior tracking performance.

Anti-disturbance performance testing
In the actual control process, the system is often affected by disturbances from the external tank environment.Anti-disturbance performance, a crucial control system parameter, provides a direct indication of the water tank system's robustness and its capacity for extremely dynamic control setting. 24In this part, internal and external disturbances are imposed on the tank system to access its performance.

F I G U R E 12
Level output curve of different control systems under disturbance.
Assume that when the system runs to 52.5 s, there is an internal-level disturbance of 1dm between the controller and the double-capacity tank-level system, and that at 90s, there is an external level disturbance of 1 dm between the double-capacity tank-level system and the feedback loop.Simulation is performed under the same initial conditions and the level output curve is shown in Figure 12.
Judged from Figure 12, it can be shown that the DDPG-based control method takes less time to return to the steady-state value and has a smaller fluctuation range than the PID-based control method for the same disturbance.In the DDPG-based control method, the intelligent agent is trained to converge to the optimal value with minimal fluctuation, even in the presence of internal and external disturbances.This approach effectively alleviates the impact of stochastic external factors and significantly enhances the self-adaptive performance of the double-capacity tank system F I G U R E 13 Level output curve of different control systems when the valve coefficient changes.

Robust performance testing
Robustness is a critical parameter that measures a system's ability to adapt to the external environment and handle a wide range of testing scenarios. 25Scholars have been actively involved in constructive endeavors aimed at improving the robustness of system control.Mendiola-Rodriguez et al. 26 emphasize the importance of employing an integrated approach to improve sustainability in control processes.By considering multiple aspects of the process and using suitable sustainability metrics, decision-making can be streamlined.The study underscores the advantages of implementing an integrated approach to increase the controllability of the system.In the development of our DDPG adaptive compensation control model, we incorporated a PID controller to enhance the system's robustness against the uncertainties imposed by the tank environment.In addition, a fully connected layer is introduced on the observer side of the critic network to mitigate the uncertainty's impact by improving the system's feature extraction capabilities.Random factors, like sudden changes in valve coefficients, can affect the dynamic characteristics of the tank system and thus impact its robustness in practical working conditions.In this section, to evaluate the robustness of the system, we introduce variations in the load valve coefficient K_w and retrain the agent accordingly.
Assume that the condition of the system changes, that is, the coefficient K_w of the load valve w in a double-capacity tank system varies by 5% at a certain moment, which makes the outflow flow rate increase by 5%.The simulation is performed under the same initial conditions with a change in the valve condition, and the level output curve is shown in Figure 13.
Judged from Figure 13 and compared with the system state before and after the environmental change, it can be observed that the DDPG pure control method exhibits sluggish response and inadequate robustness when the valve coefficient fluctuates within a specific range, which can be exacerbated in scenarios with higher degrees of uncertainty.Specifically, after running for 40 s, there is an error of 0.15 dm.The PID-based control method takes less time to recover to the steady-state value, and the system responds faster with a smaller fluctuation range, reflecting its good robustness.It's worth highlighting that the DDPG adaptive compensation control algorithm demonstrates exceptional robustness, which can be attributed to our deliberate efforts to design it in a way that can accommodate uncertainties.

CONCLUSION
In this article, DDPG pure control and DDPG adaptive compensation control methods are proposed to adjust the water level of a double-capacity tank-level system by building a comprehensive tank system model with enhanced feedback regulation capability and training DDPG-based intelligent agents using actor-critic network with refined structure and parameters.The performances of DDPG-based control methods are compared with traditional PID controller and Fuzzy controller in terms of system's convergence, tracking, anti-disturbance and robust performances.Simulation results indicate that our proposed DDPG adaptive compensation control algorithm combines the strengths of adaptive performance of DDPG method and robustness of PID method, outperforming conventional schemes in the control of double-capacity tank-level system.Under the same condition, it can converge faster, respond more quickly and track the target water level more precisely.Furthermore, compared with traditional control methods, the DDPG adaptive compensation control system demonstrates superior self-adaptive performance in the presence of random factors in the water tank environment, integrating the robustness of the PID method and adaptive capability of the DDPG algorithm.The increasing performance demands and wider application scenarios of process control have prioritized the use of DDPG adaptive compensation control, which has been found to efficiently address complex problems with continuous action, and provide data-driven control of dynamic systems with strong robustness. 27In our upcoming research, we'll use the DDPG adaptive compensation control method to look into multi-tank-level systems that call for sophisticated nonlinear constraints.Moreover, we hope to achieve autonomous control over tank-level system with DDPG algorithm.One potential solution to maximize the strengths of DDPG adaptive compensation control algorithms is to perform supervised learning on the optimized control actions to build new neural network controllers, allowing for algorithm migration.The self-updating of controllers after migration is likely to enhance the effectiveness of the DDPG adaptive compensation control approach.

F I G U R E 2 "
Water tank system" module.F I G U R E 3 DDPG pure control system structure diagram.F I G U R E 4 Network model diagram of DDPG pure control algorithm.(i) Critic network.(ii) Actor Network.

F I G U R E 5
DDPG pure control system model.F I G U R E 6 DDPG adaptive compensation control system structure diagram.

F I G U R E 7
Network model diagram of DDPG adaptive compensation control algorithm.(i) Critic Network.(ii) Actor Network.anadditional component of the output u of the feedback controller; (ii) the inputs of the actor network model become a combination of state s on the observer side and action a on the action side.The overall network model is shown in Figure7.

TA B L E 2 3
Parameters of intelligent agent in DDPG control algorithm.Server and programming configurations.

F I G U R E 11
Level output curve of different control systems when the set value changes.
Parameters of double-capacity water tank-level system.
Initialize the environment state s 7: for the termination of maximum reward or each episode = 1, • • • , N do 8: Read control action u t 9: Execute action a t =  (s t |  ) 10: Obtain final control action u final =  * u t + a t 11: Obtain the reward r t , the state s t+1 at the next moment and the terminator d 12: u t+1 = u final 13: s t+1 = s t 14: Record the sample (a t , s t , u t , r t , s t+1 , u t+1 , d) to the experience pool process 1: Randomize the reference signal h and the initial height h 0. 2: Initialize random noise signal UO 3: Initialize experience pool T and its capacity t 4: Initialize actor parameters  (s|  ) and critic parameter Q T, and overwrite the previous records if the capacity is insufficient for t = 1, … , T do 15: