Distributed deep reinforcement learning for optimal voltage control of PEMFC

Funding Information National Natural Science Foundation of China (51777078). Abstract Proton exchange membrane fuel cells (PEMFCs) are promising components in the renewable energy field due to their high energy efficiency and low pollution output. However, these cells are also characterized by considerable nonlinearity, which in turn adversely affects the PEMFC output voltage. Conventional control algorithms cannot guarantee sufficient output voltage control, as they lack the robust adaptive ability required for adapting to these fluctuations in PEMFCs. In this paper, an optimal output voltage controller based on distributed deep reinforcement learning, which controls the output voltage by regulating the fuel input of the PEMFC, is proposed. In addition, an ensemble intelligence exploration multi-delay deep deterministic policy gradient (EIM-DDPG) algorithm is proposed for this controller. An ensemble intelligence exploration policy plus a classification experience replay mechanism are included within the EIM-DDPG algorithm to improve the exploration ability of the algorithm and thus increase the robustness and adaptive capability of the controller. As a result, the model-free optimal output voltage controller offers high robustness and adaptability. The simulation results in this paper demonstrate that the proposed optimal controller can realize the effective control of PEMFC output voltage.


INTRODUCTION
Fuel cell research and development has proliferated in recent years in response to increasing global competition over conventional energy sources. The proton exchange membrane fuel cell (PEMFC), which runs on hydrogen, is the most widely studied fuel cell type. PEMFCs have several advantages, including short start-up times, high specific power, long average lifetimes and low working temperatures, which together render this type of fuel cell a potentially suitable candidate for automobile power generation [1][2][3].
In a conventional PEMFC, hydrogen from the anode is used as the fuel, and the oxygen from the cathode is used as the oxidant. These gases react to form water, releasing energy in the process. It has been indicated that the output energy of a PEMFC is directly affected by the hydrogen supply stream. When the load current is rapidly changing, the overall voltage of the PEMFC will decline if its fuel flow is insufficient. Prolonged fuel shortages lead to electrode reversal and carbon corrosion, catalyst shedding, and permanent stack underperformance. To This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2021 The Authors. IET Renewable Power Generation published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology stabilize the PEMFC output voltage, a reasonable control of the fuel flow is required. It is therefore necessary to devise a control strategy for PEMFC hydrogen flow [4].
There are some scholars in this field having conducted investigation into PEMFC fuel control to propose various innovations, including classic proportion integration differentiation (PID), the feedforward-feedback algorithm, adaptive control, model predictive control (MPC), neural network control (NNC), and compound control etc. Table 1 shows the advantages and disadvantages of these algorithms, as discussed in detail below: 1. PID and feedforward-feedback algorithms. Chen et al. applied an observer-based static feedforward controller to achieve fuel flow control for PEMFC, which led to high response speed [5]. Ou et al. relied on PI controller to adjust hydrogen flow, which produced satisfactory control performance [6]. Zhao et al. proposed the coefficientoptimized frictional order PID controller (FOPID) for regulating PEMFC air flow, which exhibited superior adaptability Feedforward-feedback control and PID Observer feedforward [5] Low cost; Simple structure and easy modification Difficult to cope with nonlinearity; long regulating time and narrow range of control policies PI controller [6] Coefficient-optimized FOPID [7] PSO-PID controller [8] Adaptive control T2-FLS controller [9] Ability to cope with Nonlinear system; anti-interference; excellent robustness Lack of systematic design; low control accuracy; poor dynamic quality MRAC [10] Model predictive control Off-line robust MPC [11] Excellent dynamic performance; excellent stability and strong anti-interference ability Complex controller, Large calculated amount, Over-reliance on accurate mathematical model DMPC [12] Neural network control Neural network feedforward [13] Avoid modelling; learning ability; more accurate model and controller with more data Low robustness; over-reliance of samples Adaptive NNC [14] Compound control Type-II fuzzy-PID [15] The integration between controllers allows for making good for deficiency The design is too difficult; have difficulty in error correction Optimal PID [16] PID plus fuzzy controller [17] to PID [7]. Ahmadi et al. put forward a PID controller with coefficients optimized by particle swarm optimization (PSO) to apply control on the PEMFC output voltage [8]. Although each of these algorithms possesses a simple structure and is relatively easy to implement, however, it is difficult for them to be adapted to the nonlinear characteristics of the PEMFC due to their poor adaptive ability. In addition, these algorithms require a long time for regulation and stabilization, which makes it almost unlikely to achieve optimal control performance. 2. Adaptive control. Recently, a modifying robust controller has been proposed on the basis of type-2 fuzzy logic systems (T2-FLS), which controls PEMFC gas flow in line with the fuzzy rule [9]. This controller is characterized by excellent robustness and adaptability. However, it suffers a flaw in design, that is, accurate control is not achievable using the algorithms based on fuzzy rules. Proposed by Han et al., model reference adaptive control (MRAC) is capable of adaptive control through model prediction, which is conducive to achieving a better control performance than traditional linear controllers [10]. However, the MRAC algorithm is not applicable in practice due to its demanding requirement on the accuracy of modelling. 3. Model predictive control. Chatrattanawet et al. proposed an off-line robust MPC algorithm for the effective control over oxygen and hydrogen flows in a PEMFC [11]. Liu et al. put forward a decentralized model predictive control (DMPC) policy, where a pair of DMPC controllers are applied to control the gas and temperature of PEMFC, thus lead-ing to excellent control performance [12]. However, despite the capability of such predictive control algorithms as the PEMFC to cope with the disturbances in complex systems and address the parameter uncertainty within such systems, it remains difficult to apply them in practice due to the heavy reliance on modelling and the complexity of their design. 4. Neural network control. Vinu et al. [13] worked out the design of a voltage output feedback controller, which is capable to impose control on the output voltage through the neural network feedforward controller based on harmony search algorithm optimization technology, thus achieving high response speed. Abbaspour et al. [14] came up with the design of a robust algorithm for applying control on the gas flow within the PEMFC. Their design incorporates online learning for the timely adjustment of control quantity according to the state of the PEMFC, thus producing high control performance. However, despite the excellent adaptability of these neural network control algorithms, each of them requires a large number of samples for training, which can incur high training cost. Besides, their control performance can be affected by the quality of samples to a significant extent. 5. Compound control. Aliasghary [15] applied an interval type-II fuzzy-PID controller to control PEMFC fuel flow, which is aimed to improve the capability of the controller to cope with various uncertainties and thus reduce the risk of fuel starvation. Proposed by Aliasghary, the interval type-II fuzzy-PID controller can achieve higher feedback speed than any conventional fuzzy-PID controllers. Baroud et al. [16] put forward the design of a fuzzy adaptive controller, the performance (based on key performance indexes) of which is superior to the normal fuzzy logic controller and the classical PID controller. In order to eliminate the possibility of oxygen deprivation and achieve the maximum PEMFC energy efficiency, Beiram et al. [17] combined an optimal PID with a fuzzy controller. Compared with the feedforward controller and feedforward PID controller, their algorithm produced better control performance. However, the coupling of the optimal PID and fuzzy controllers involve a complicated arrangement, which adds to the difficulty in ensuring satisfactory control performance.
According to Table 1, for most control algorithms, a better control performance is accompanied by higher complexity. The aim of working out a simple PEMFC design with excellent adaptability and robustness remains difficult to achieve. In addition, the working principles of the PEMFC are complicated as they encompass the laws of thermodynamics, electrochemistry and materials science. Also, the internal workings of the PEMFC are complicated. The influencing factors in the output voltage performance of the PEMFC include material parameters (e.g. catalyst type), model parameters (current density, system structure parameters, etc.) and operating parameters (e.g. internal temperature, partial pressure of cathode and anode). It is difficult to measure the model parameters and material parameters of the PEMFC as they change over time. These problems cause hindrance to PEMFC research. Besides, there are various operating parameters that can affect the performance of PEMFC output voltage control significantly. In order to maintain the optimal PEMFC performance, the reference voltage is required to vary continuously with the load current. Having multiple inputs and multiple outputs, the PEMFC shows non-linear and time-varying characteristics. Up to now, there have been a variety of PEMFC designs proposed for operation on those non-linear control systems which are too complex to achieve easy and accurate control of PEMFC output voltage.
Thus, it is imperative to develop a model-free algorithm which can be applied to control the output voltage. It is expected to be characterized by simple design, excellent adaptability, high robustness, and the high compatibility with a non-linear PEMFC system. As a member of the deep reinforcement learning (DRL) family, deep deterministic policy gradient (DDPG) is a model-free algorithm [18][19][20]. Capable of direct control based on the input data, the DDPG algorithm is characterized by the simplicity of structure and outstanding adaptability. Unlike traditional manipulated algorithms, DDPG determines the manipulated policy through complete reciprocal action with the environment [21,22] without recognizing the mode, thus addressing the uncertainties arising from nonlinear systems [23][24][25]. Due to its poor performance in robustness, however, DDPG is rarely applied in PEMFC systems, which thus necessitates accurate control. In order to solve this problem, Fujimoto et al. [26] proposed TD3, a technique embracing policy deferred update, to improve the robustness of the algorithm. Proposed by Horgan et al. [27], APEX-DDPG can prevent the algorithm from falling into local optimization, thereby improving robustness through the introduction of the distributed reinforcement learning framework. Ultimately, the exploration ability can be improved. Schau et al. [28] proposed the technique of prioritized experience replay, which lowers the training cost by selecting the samples with high value, thus making it easier for the algorithm to converge. As a result, the level of robustness improves. However, these algorithms improve the robustness only by enhancing their own exploration capacity or training efficiency, as a result of which there are more diversified samples obtained. Furthermore, it is difficult for DDPG to obtain effective guidance and gather more valuable samples during training for improving its robustness. Consequently, the outcome of improvement is poor.
Allowing for this, a new algorithm termed ensemble intelligence exploration multi-delay deep deterministic policy gradient (EIM-DDPG) is proposed in this paper. EIM-DDPG is an improvement on DDPG as it is a distributed DRL innovation with high exploration ability and training efficiency. EIM-DDPG adopts an ensemble intelligence policy to overcome the low exploration ability associated with DDPG. Besides, the Q-value exaggeration issue encountered by DDPG can be addressed using various technologies. As a model-free control algorithm with a simple design, EIM-DDPG is effective in ensuring adequate control on the output voltage as its deep neural network is fully trained. This algorithm is characterized by less workload of calculation, simpler structure and model nonreliance. Besides, the degree of control matches the changing state of the PEMFC. Therefore, the algorithm demonstrates excellent adaptability and robustness. The EIM-DDPG design provides a comprehensive solution to the weaknesses of many other algorithms as discussed above, including the low control accuracy of adaptive control algorithms, the reliance of MPC on models, the self-correction inability of neural network control, and the complex structure of compound control algorithms.
There are three significant innovations achieved in this paper as follows: 1. A dynamic output voltage control framework considering the nonlinear characteristics of PEMFCs is proposed. 2. Taking advantage of the high adaptability and model-free features of deep reinforcement learning, an intelligent controller based on deep reinforcement learning with higher adaptability and robustness is proposed. 3. An EIM-DDPG algorithm, which relies on this controller to employ an ensemble intelligence exploration policy and classification experience replay mechanism, is proposed to improve the robustness of the proposed controller.
The rest of this paper is structured as follows. In Section 2, the PEMFC prototype is explained. Section 3 elaborates on the EIM-DDPG innovation. In Section 4, the simulation results of online application are reviewed and analysed. In Section 5, the discovery of this method is summarized.

Model of PEMFC
PEMFC voltage is the combination of the three elements: the thermodynamic electromotive force, the polarization over-voltage and the ohmic over-potential. The standard potential is 1.229 V. When initiating the PEMFC, the eventual output voltage will be less than the standard one due to the polarization effect existing inside the PEMFC model. As a result, we conclude that the polarization effect can negatively contribute to the voltage loss in operations such as PEMFC model. Whereas, the activation polarization over-voltage, the ohmic polarization over-voltage and the concentration polarization over-voltage [35] are the constituent elements of the lost voltage. Therefore, the output voltage V cell of a single cell can be calculated as below: For N single cells in succession, the voltage V can be figured as below: 2.1.1 Thermodynamic electromotive force Thermodynamic electromotive force refers to the electric potential at which the entire electrochemical system is in equilibrium when there is no current flowing inside the PEMFC. Under this circumstance, we predict that the thermodynamic electromotive force is related to the temperature of the system and the pressure of the reaction gas. According to gas electrochemical reaction mechanism, the thermodynamic electromotive force (i.e. Nernst electromotive force) of a single cell can be listed as following equation: Once particulate conclusions are replaced, formula (3) can be rephrased as follows:

Activation overvoltage
With a catalyst, the anode hydrogen inside the PEMFC model cultivates hydrogen ions and electrons. Whereas, if the hydrogen ions formed a voltage drop when it reaches the cathode through the proton exchange membrane, it needed to consume the activation energy to finalize the process. The above-mentioned voltage is called the activation polarization voltage. The activation overvoltage of PEMFC includes two parts: anode overvoltage and cathode overvoltage. The total activation overvoltage η act is as follows: where

Ohmic polarization overvoltage
The ohmic polarization overvoltage is largely constituted by such elements as the equivalent membrane resistance R m of the proton membrane, the resistance R c that prevents protons from passing through the proton membrane and the voltage drop produced, the η ohm is indicated as follows: The within opposition of a cell can be analytically declared as below:

Concentration overvoltage
Concentration overvoltage is nurtured by mass transfer that affects the concentration of hydrogen and oxygen. Thus, concentration overvoltage η con can be described as follows:

Charge double-layer
In PEMFC, there is a double-layer charge layer phenomenon that the hydrogen ions will gather up on the surface of the electrolyte while the electrons will accumulate on the exterior side of the electrode. The voltage they generate is equivalent to a capacitor C connected in parallel across the polarized resistance R d , so that the exterior side of both the electrode and the electrolyte and the nearby charge layer carry out charge and energy storage. Taking the polarization voltage on R d as V d , the voltage change of a single cell can be formed as a differential equation.
The voltage variation of a single cell is outlined as follows:

Cathode model
Taking the continuity balance of the masses of oxygen and nitrogen in the cathode into serious consideration, the continuity equation derived from the law of conservation of mass is stated as follows In addition, since the control of air flow and the action of an air compressor both require a series of transmission processes, there is a time constant (T O 2 ), which is written in Table 2.

Anode model
Taking the mass of hydrogen in the anode into account, the state equation for m H 2 ,an can be pointed out as follows: In addition, the hydrogen flow rate also has a time constant T H 2 , which is shown in Table 2.

Air compressor
The air compressor model detects the dynamic characteristics of the compressor [29] by virtue of a rotating model, the specific In between, the air compressor characteristic curve acts on the air mass flow rate of the compressor. The air compressor characteristic curve is shown in Figure 1(a).

Output voltage control of PEMFC
The hydrogen pressure in the hydrogen inlet can be calculated with the following formula (16) [29]: The proposed controller manipulates the flow rate of hydrogen gas and thus controls the output voltage, to sustain the  Figure 1(b). In order to control the variables, we shall elaborate on the influence of hydrogen flow rate on the output voltage and conduct discussions on the control algorithm. We exclude the DC/DC converter in this model. And the output voltage is therefore equal to the stack voltage. Consequently, it is necessary to study the impact of hydrogen flow rate in an extreme situation and analyse the algorithm (with DC/DC converter applied situation excluded) to take control of the variables in the model of PEMFC.

Online application of EIM-DDPG controller
The EIM-DDPG controller is used to control the hydrogen flow, in this way the output voltage of PEMFC can attain the reference voltage. The control interval of the controller is 0.1 s. The objective of control is to rigorously maintain the voltage and precisely track the associating value. The concrete control framework is shown in Figure 2. During online application, the state of the PEMFC is inputted into the actor neural network in order to obtain the current control quantity according to π φ (s t |θ π ). Hence, the EIM-DDPG can work out the most optimal solution according to the current state of the PEMFC, with the feedback moment being shorter than the interval of the control (0.1 s), thus realizing real-time control over the output voltage of the PEMFC during online application.

Action space
The action is L(t), and the real hydrogen flow rate L(t) is calculated as follows: where L(t) is the hydrogen flow; L max is the upper limit of hydrogen flow.

State space
In order to more comprehensively obtain the status of the PEMFC, three variables are used as the status in this paper. The state space is as follows:

Selection of reward function
The reward function includes three function items, among which reward item β is an item set to help the algorithm converge correctly. In addition, the μ 1 e 2 (t) is set to complete the control objective and μ 2 a 2 (t-1) is set to reduce the fluctuation of hydrogen flow. As a data-driven controller with a larger control interval, in order to prevent voltage oscillations, the μ 2 a 2 (t-1) item is added in reward function to limit changes in hydrogen flow. The reward function is as below:

Deep reinforcement learning
Reinforcement learning (RL) [18] is a model-free algorithm.
With the aim of maximizing long-term reward accumulation, RL makes the agent continually learn from the optimal decision actions under different states. The deep reinforcement learning process is usually divided into two stages: the pre-learning stage, and the online application stage.

DDPG
DDPG was first proposed by Lillicrap [20]. parameters of policy network are updated according to a gradient, as expressed in formula (21): The parameters of the value network are updated according to the minimum loss function L(θ): DDPG imports the gradient information (based on the Q value function) into the policy information, and implements the policy guided by that information.
Although DDPG can address the problem of continuous action spaces, it has the following imperfections: 1. the algorithm falls into local optimization due to Q value overestimation. 2. its exploration ability is insufficient. 3. The algorithm cannot get correct convergence guidance.
The low robustness of this algorithm is ascribed to these three problems.

Pre-learning of EIM-DDPG
EIM-DDPG is an extension of DDPG [20]. Three tricks are employed to fix the Q-value overestimation problem encountered by DDPG: (i) clipped multiple Q learning, (ii) policy delayed updating, and (iii) target policy smoothing regularization, which altogether lead to better stability and training performance. The tricks is employed in the leader of the EIM-DDPG. During exploration, the conventional DDPG algorithm adds noise to action. However, a one-actor network cannot ensure sufficient diversity of samples during exploration in the environment. This problem is solved by employing a further three tricks: (iv) a distributed reinforcement learning training framework (v) classification experience replay (vi) an ensemble intelligence exploration policy.
EIM-DDPG adopts a distributed DRL training scheme. It has multiple RL-explorer, multiple HA-explorer, one leader and two public experience pools. The leader consists of three critic networks and one actor network. Each RL-explorer has one actor network, which in turn has its own network parameter and environment. Every HA-explorer has a particular PID or FPID controller and utilizes a heurist algorithm optimizer of a specific tuner to optimize the coefficients of the controller. The different explorers explore the environment in parallel. Firstly, each RL-explorer generates training samples e i RL = (s t i-RL , a t i-RL ,r t i-RL ,s t+1 i-RL ) according to its environment; and, each HA-explorer, according to the controller, outputs the hydrogen flow to act on the environment and generates demonstration samples e i HA = (s t i-HA ,a t i-HA ,r t i-HA ,s t+1 i-HA ) according to the unified status and reward function ( Figure 3). All the samples are added separately, in accordance with classification experience standards, into the two public experience pools. Then the leader samples a mini-batch (which consists of various samples) from the experience pools in accordance with the classification experience replay mechanism, and learns continually. Finally, the actor network of each RL-explorer renovates their network criteria in accordance with the updated actor network from the leader. The EIM-DDPG framework is demonstrated in Figure 3. The EIM-DDPG flow is demonstrated in Figure 4.

RL-explorer
The proposed algorithm consists of multiple RL-explorers with different exploration algorithms, and HA-explorers with different heuristic algorithms ( Figure 3). The design of the proposed algorithm is guided by three novel exploration strategies: ε-greedy policy, Gaussian noise strategy, and OU noise strategy.
Six of the explorers are ε-RL-explorers. The process of each ε-RL-explorer is guided by the following principle: Another six explorers are OU-RL-explorers; their underlying principle is expressed as follows: A further six explorers are Gaussian -RL-explorers, the underlying principle of which is expressed below:

HA-explorer and ensemble intelligence policy
Based on ensemble learning, the ensemble intelligence policy is introduced to improve the algorithmic exploring efficiency. As for ensemble intelligence, optimizers with different algorithms are adopted to solve the same problem and find the optimal solution. The different samples acquired from different optimizers are utilized so as to enrich the sample diversity of the leader. Each HA-explorer, with the optimizer of a specific rationale tuning the controller, sends a solution to the problem according to its optimizing results and stores the solution to the public experience pools. Therefore, the leader can train and update its parameters through random sampling.
A large number of algorithms are employs to improve the control performance of the controller [30][31][32]. A large FIGURE 3 Distributed training framework based on EIM-DDPG number of optimization algorithms can be used to optimize the coefficients of the controller [33][34][35][36]. The HA-explorers adopt different heuristic algorithms as the tuner algorithm of PID and FPID controllers to optimize their coefficient. The optimization algorithms employed include the following: genetic algorithm (GA) [37,38], particle swarm optimization (PSO) [39], bat Algorithm (BA) [40], chicken swarm optimization (CSO) [41], and grey wolf optimization (GWO) [42]. The controllers of different coefficients acquired from different optimization mechanisms provide the leader with diverse group of samples. HA-explorers are only responsible for sampling, and do not have to receive any information from the leader. one GA-PID-HA-explorer, one GA-FPID-HA-explorer, one PSO-PID-HA-explorer, one PSO-FPID-HA-explorer, one BA-PID-HA-explorer, one BA-FPID-HA-explorer, one CSO-PID-HA-explorer, one CSO-FPID-HA-explorer one GWO-PID-HA-explorer, and one GWO-FPID-HA-explorer are set in paper. During the coefficient optimization stage, the operational aim of the different algorithms is listed below: Before offline training of the EIM-DDPG, each HA-explorer has to optimize the coefficient for both PID and FPID, which is used by the HA-explorer in offline training.

Leader
The leader consists of three critic networks, one actor network and two public experience pools. Three techniques are applied to leader.

Clipped multiple Q learning
In the leader, the current actor network selects the optimal action and the target critic network evaluate the policy.
Clipped multiple Q-learning is adopted by EIM-DDPG to compute the target values:

Policy delayed updating
In the leader, after the critic network is updated d times, the actor network is updated once which ensures the actor network is able to be updated to a lower Q-value error. Thus, it enhances the productivity of updating the actor network.

Target policy smoothing regularization
The leader introduces a standardization approach in order to decrease the deviation of the objective values. Detail as In the meantime, the mini-batch is averaged for smooth regularization: ∼ clip (N (0, ), −c, c ) (31) Public experience pool Classification experience replay model for the median award: Guided by the ε-greedy approaches in Q-learning, the experimental cases are restrained by two separated pools in the EIM-DDPG. Whenever a fresh sample is yielded, the value will be redetermined. Afterwards, the sample's award and its median value are then compared. When the sample's award is smaller or the same as the median value r a , then the sample should be stored to pool 2.
If not, the sample should be stored to pool 1.

FIGURE 5 Training chart
In the process of training, in regard to pool 1, n ξ samples are able to be obtained with the possibility of ξ. In pool 2, n (1-ξ) samples are able be obtained with the probability of 1-ξ.

3.3.4
Training process 1. The critic network Q 1 , Q 2 ,Q 3 , the actor network ' the target parameter networks ′ 1 ,`′ 2 , ' ′ and the public experience pool are initialized. 2. The action a is selected for each RL-explorer in every step and episode, each explorer implements action a to obtain the immediate reward r and the next state s, thereby obtaining the training sample; then, the HA-explorer generates the expert sample. The above samples (s,a,r,s) are accumulated in the pools in accordance to the classification experience reset guideline. 3. A small batch of samples are drawn from the pools in accordance to the replay mechanism. 4. The leader updates the critic network parameters, as shown in formula (22). 5. After the critic network guideline in the leader are updated d times, the actor network parameters are updated according to gradient, which is shown in formula (21). 6. The critic network in the leader updates the target network. 7. Continue to repeat the training until the end.

CASE STUDIES
In this Section, pre-learning and online applications are discussed separately. Section 4.1 discusses the pre-learning of the EIM-DDPG controller, and Section 4.2 discusses the online application of the EIM-DDPG controller.
The model parameters applied in the figure are displayed in Table 2 [29]. Meanwhile, some RL controllers (including an APE-X-DDPG controller [27], a TD3 controller [26] and a DDPG controller [20]), as well as some conventional controllers (including a PSO fuzzy PID controller [8], a fuzzy PID controllers [16], a Fuzzy-FOPID controllers [7], a FOPID  [7], a PSO-PID controller [8], a PID controller [6], an NNC controller [14], a TS-Fuzzy-PID controller [43] and MPC controller) are adopted for the purpose of comparison. The PSO fuzzy PID controller, fuzzy PID controller, Fuzzy-FOPID controller and TS-Fuzzy-PID controller belong to the adaptive control algorithm family; the FOPID controller, PSO-PID controller and PID controller belong to the family of PID and its derivative algorithms. The NNC controller is a neural network control algorithm.

Pre-learning
The step load is adopted within 0-200A at different amplitudes for training, and the training duration of every episodes is 10 s. The training chart is illustrated in Figure 5. The parameters are listed in Table 3. In Figure 5, the curve exhibits the median value of the reward from every algorithm. The EIM-DDPG, which adopts classification replay and ensemble intelligence policy, can learn from numerous high-value samples early in training. Hence, the

Increased/decreased load condition
During the simulations, each simulation time is 4 s, and the data is then utilized to analyse the result. Specifically, during increased load conditions, the current raises from 100 to 120 A over a time period of 1 s. During decreased load conditions, the load current drops immediately from 100 to 70 A when a decreased load changing occurs at 1 s. The results are illustrated in, Figures 6 (a-d) and 7 (a-d). The stabilization time is the duration that the control error e becomes stable to within 0.01% of the reference voltage.
(1) Comparison between the EIM-DDPG controller and the RL controllers. As shown in Table 4, under the condition of increased load current, the output voltage stabilization time of the EIM-DDPG controller is 0.03 s and its overshoot is 0.06014%, while the output voltage stabilization time of the other RL controllers is 0.22, 0.21 and 0.2 s (0.17605%, 0.23511% and 0.24639%, respectively). The stabilization time and overshoot of the EIM-DDPG controller are both lower than those of the other RL controllers; this is also true under the condition of decreased load current. In addition, according to Figures 6(a) and 7(a), the control curve of the EIM-DDPG controller is smoother and more stable, indicating that the control performance of the EIM-DDPG controller is better than that of any other RL controller. This is because the EIM-DDPG controller operates on the basis of the ensemble intelligence exploration policy and thus receives better guidance from the demonstration samples in the HAexplorer during pre-training, which leads to an optimal control policy. By contrast, the other RL controllers are not subject to effective guidance and complete optimization during training, so their algorithms fall into local optimization, leading to a  suboptimal control policy. Furthermore, as shown in Figures 6(c) and 7(c), the stable output voltage enables the EIM-DDPG controller to achieve a more stable net power.
(2) Comparison between the EIM-DDPG controller and conventional controllers. As shown in Table 4, under the conditions of increased load current, the output voltage stabilization time of the EIM-DDPG controller is 0.03 s and its overshoot is 0.06014%, while the minimum stabilization of the output voltage and overshoot of the conventional controllers is 0.34 s and 0.16249%, respectively. Under the condition of decreased load current, the stabilization time of the output voltage and overshoot of the EIM-DDPG controller are 0.04 s and 0.05518% respectively, while the minimum values are the same for the conventional controllers are 0.05 s and 0.07900% respectively. Compared with those of the conventional controllers, the stabilization time and overshoot of the EIM-DDPG controllers are both lower, indicating that its manipulated performance is greater than that of the traditional controller. In particular, the MPC algorithm as a model-based algorithm has outstanding response speed and less overshoot. This is because the MPC algorithm establishes an accurate PEMFC mathematical model, but because the model cannot be accurately represented in practice PEMFC causes the performance of this type of algorithm to still be inferior to EIM-DDPG, its stabilization time is slightly longer than that of EIM-DDPG algorithm, and there is a larger overshoot. According to Figures 6(d) and 7(d), EIM-DDPG also has a better output power control performance. As can be seen in Figures 6(b) and 7(d), adaptive control algorithms (PSO-FPID, FPID, FFOPID and TS-FPID) based on fuzzy rule have poor control accuracy, which causes poor control performance and considerably different performances under different disturbances. Under the condition of increased load, their overshoot is 13.275 times that of the EIM-DDPG controller. The control performances of PID and its derived algorithms (PSO-PID, FOPID and PID) are stable, but their time to achieve stabilization are generally long because of their failure to adapt to nonlinear schemes. Under the conditions of increased load, their minimum stabilization time is 0.37 s, which is 12 times that of the EIM-DDPG controller. The control performance of NNC is stable, but this algorithm relies excessively on the sample quality because it has no independent exploration ability (a property of the deep reinforcement learning algorithm), and its control policy is still sub-optimal. Under the conditions of increased load, the overshoot of NNC is 3.2355 times that of the EIM-DDPG controller. EIM-DDPG is able to overcome the defects of the conventional controller, and it has a better adaptability and the ability to adjust its control policy according to the state of the PEMFC so as to cope with the nonlinear characteristics of the fuel cell without relying on models, and ultimately obtain the most optimal output voltage control performance.

Stochastic load condition
Stochastic load interruption is introduced to the PEMFC for the sake of validate the robustness and performance of the EIM-DDPG controller (load is shown in Figure 8(c)). The duration of each disturbance is 5 s, and the total running time is 40 s. The outcomes are presented in Figure 8 (a-d). The relationship between reference voltage and stack current is shown in formula (32). The relationship between reference voltage and time is shown in Figure 9.
(1) Comparison between EIM-DDPG controller and other RL controllers. The adaptability and robustness of the algorithms can be tested under the condition of random step load. As shown in Figure 8(a), the control performance of EIM-DDPG is stable under different step loads, with minimum stabilization and an overshoot approximating to 0. Since the exploration ability of the TD3 and DDPG algorithms is low, these two controllers show thoroughly various performance under diversify load condition. Under the load conditions during 1-6 and 11-16 s, DDPG has an excellent control performance and its overshoot is close to 0, while under the conditions during 20-25 and 30-35 s, its overshoot is greatly increased to 0.298% and 0.15%, respectively. TD3 shows sharply degraded control performance during 20-25 s, when its overshoot rises to 0.27%. Therefore, these two algorithms have low robustness. The EIM-DDPG modification solves the problem of the low robustness of DDPG, so this algorithm is able to ensure excellent control performance under every load condition.
(2) Comparison between the EIM-DDPG controller and conventional controllers. It can be seen in Figure 8(b) that the control accuracy of the adaptive control algorithms is relatively low. Both Fuzzy-PID and PSO-Fuzzy-PID show fluctuations under different load conditions. However, because Fuzzy-FOPID combines the adaptability of its fuzzy control algorithm with the universality of PID, it is the most effective algorithm in the adaptive control algorithm group. The stabilization times of PID and its derived algorithms are long, but they show good robustness and similar control performance under different load conditions. The MPC algorithm has excellent robustness and can show similar control performance in every load condition, but because the algorithm relies too much on accurate models, the control accuracy of the algorithm is low, although it has a faster response speed. The performance of NNC is still inferior to that of the EIM-DDPG controller, but its robustness is greater than that of TD3 and DDPG, indicating that NNC training with a large number of correct data will lead to better robustness, compared with that of DDPG. EIM-DDPG imitates the training technique of NNC. It utilizes the training data through its own exploration and the correct demonstration data generated by a lot of control

CONCLUSION
In this paper, an EIM-DDPG based controller is proposed, which achieves effective control over the output voltage of PEMFC through the adjustment of hydrogen flow. In the course of assessing its control performance, comparisons are performed against adaptive control algorithms, other deep reinforcement learning algorithms and the derived algorithms of PID and NNC. The main findings of this study can be summarized as follows. First, EIM-DDPG achieves an improvement on the DDPG algorithm as it is characterized by greater robustness and adaptability. Second, the EIM-DDPG controller has a higher response speed but lower overshoot as compared to the RL controllers, which leads to superior adaptability and robustness. Third, in comparison with the derived algorithms of PID, EIM-DDPG produces significantly better control performance. In particular, EIM-DDPG performs better in minimum response time and overshoot (output voltage) than such adaptive control algorithms (TS-Fuzzy-PID, fuzzy-FOPID and PSO-fuzzy-PID), NNC and MPC. Lastly, as revealed by the simulation results, EIM-DDPG is effective in solving the low robustness of DDPG and produces superior control performance to the existing advanced algorithms. It is expected that the findings presented in this paper contribute to the further research on the stability of output voltage for PEMFC.  [44].
From Assumption 1 it can be seen thatΨ and D xx are confined. Regardless of the deficit of generalization, write ||Ψ|| ≤ B Ψ and ‖D xx ‖ ≤ B xx for positive numbers B Ψ and B xx . Theorem 1: Allow the manipulated contribution to be the same as Equation (A3) and allow the renovating approaches for critic and action networks be as in Equation (A4). Assume Presumption 1 occupies and allow the previous situation be in the series like the NN estimation inaccuracy is confined as the supposition mentioned. Later, for each repetitive pace i the weight inaccuracyW [i] is uniformly ultimately bounded (UUB).
EIM-DDPG controller is designed as The complete format of scheme is displayed in Figure 3. Depended on it, the following theorems can be reached.
where w is a positive term. This yields an online approach for renewing the weights for the critic network and the actor network at the same time.
Proof: Decide Lyapunov function contestant as follows: where and As f is locally Lipchitz, then there remains B f > 0 s.t. || f (x)||B f ||x||. Therefore, one has: it can get:Σ Furthermore, one obtains: then it can be written aṡ thenΣ ≤ 0 Thus, the weight inaccuracyW is UUB. Remark 2: It is supposed that the condensed series in Presumption 1 is bigger than the series to which the mode is enclosed. This is able to be affirmed held by a gentle status on the beginning stages as shown in [44], Section 4.2.1.
The outcome of it demonstrates that closed-loop scheme is substantial irrespective of anonymous disruption d. In the later theorem, the concurrence studies of H (x,û [ Proof: Here, we introduce neural network (NN) approximation structures for J [i] x and u [i] . They are termed, respectively, by the critic NN and the actor NN. For off-policy learning, these two structures are updated simultaneously using the off-policy Bellman equation, as shown here. Let the ideal critic network expression be: where W [i] c ∈ R n 1 ×1 is the ideal weight of critic network, c ∈ R n 1 ×1 is the active function, and [i] c is residual error. Then one has: Let the estimation of W Let Δ c = Δ c (x(t − T )) = c (x(t )) − c (x(t − T )), then, the first term is expressed as: The last term is: Let the ideal action network expression be: a ∈ R n 2 ×m is the ideal weight of critic network, a ∈ R n 2 ×1 is the active function, and Then one has: ) .
(A19) Therefore, one can define the residual error as