Trajectory Planning for Multiple Autonomous Vehicles at Short‐Distance Tandem Signalized Intersections Based on Rule‐Free Framework

High‐level autonomous vehicles (AVs) have more possibilities for improving traffic efficiency. The improvement of traffic efficiency for mixed flow at near‐saturated short‐distance tandem signalized intersections (STSI) needs attention. Most of the existing studies design a generalized control rule for AVs, ignoring the heterogeneity among different AVs. Herein, a multivehicle trajectory planning framework based on a multiagent reinforcement learning (MRL) algorithm is designed to heuristically explore the optimal traffic efficiency of mixed flow at STSI. The core algorithm of this framework is improved from the classical MRL algorithm multi‐agent proximal policy optimization based on the idea of the virtual group instead of designing control rules. The trajectories planned by the framework show outstanding performance in improving throughputs and reducing emissions at the global system level, comparing natural driving, classic adaptive cruise control model and cooperative adaptive cruise control model. The framework can be used to explore optimal traffic efficiency for mixed flow and better heterogeneous rules for high‐level AVs.

between upstream and downstream intersections of SIST under rule-free designs.Based on the framework, we discussed the traffic efficiency improvement ability of multiple AVs at mixed flow without introducing subjective rules.

Review of Related Work
Many studies have discussed the effect of AVs on traffic efficiency through simulations of various traffic scenes.As early as 1997, Chang and Lai [6] concluded that fully ACC flow and fully CACC flow can increase the traffic capacity of freeway by 11% and 200%, respectively.The huge difference shows the potential of high-level AVs in improving traffic capacity.Besides fully AVs flow, the mixed flow under different market penetration rates (MPR) of AVs has been compared.The related traffic scenes includes freeway, [7][8][9] merging area, [8][9][10][11] and urban road [12] where are easily to model the behavior of mixed flow and summarize the reasonable rules.The traffic efficiency problems of mixed flow at signalized intersections are complex and the rules for AVs are challenging to be confirmed consistently.Various related works focus on customizing control strategies and planning optimal trajectories for AVs to compensate for the loss of traffic efficiency caused by signal control.The ideas include trajectory smoothing, priority reservation, [13] coordination rules design, [14] two- [15][16][17] and multilayer coordination framework which fuse multiple ideas, [18] etc.The influence of CAVs and their applications on signalized intersections have been studied at the subdivision aspects of traffic efficiency and more surveys of related works can be found in refs.[19] and [20].
Overall speaking, in the aspects of analyzing and optimizing the traffic efficiency of mixed flow at signalized intersections, most of the studies are based on two observation approaches: first, selects a type of classic model (such as the most popular CACC model) or smooths the trajectory (such as eco-driving [21] ) for a single AV; second, independently designs cooperation strategies as the microcontrol rules for several AVs, for example, platoon, [6,22] group, [23] cluster, [24] bubble, [25] and so on.The approaches ignore the impact of the heterogeneity among AVs and the findings may not represent the full impact of mixed flow. [26]Several studies have revealed the importance of considering vehicles equipped with heterogeneous strategies in studying mixed flow.Talebpour and Mahmassani [27] proposed a simulation framework to evaluate the impact of different types of AVs on stability and throughputs, which uses different models and assumptions of communication capabilities.By adding an adaptive basic map to the mesoscopic simulation tool, Fakhrmoosavi et al. [28] observed the impact of heterogeneous AVs on network flow at the global system level.Considering different level of AVs and cooperation would make the mixed traffic problem more complex.We noted that there are no works on the trajectories planning in STSI scene, which could fully analyze and tap the potential of the heterogeneity among all AVs.
Another challenge [29] of multi-intersections trajectory planning problem is how to find the optimal solution at system level with fast searching efficiency.Relatively few studies discussed multi-intersection planning and decision-making technologies using AV, in which little works were able to synchronously plan the trajectories for multiple AVs.To simplify the problem, the works usually developed the specific coordination rules for multiple AVs or multiple intersections.Guerra and Elefteriadou [30] optimized the trajectories of AVs along the arterial under the assumption of fully automated.The trajectories are adjusted to form a platoon at saturated headway to ensure that vehicles reach the downstream intersection within the green light.Chalaki and Malikopoulos [15] have formulated scheduling problem and set scheduling goals for each AV.Then, a two-layer decentralized coordination framework is proposed for multiple adjacent CAVs to minimize energy consumptions and improve throughputs.However, coordination of a limited number of vehicles cannot be considered as a system-level optimization.
Multiagent reinforcement learning (MRL) simulation is a credible instrument to simultaneously plan trajectories of multiple AVs for mixed flow at STSI.MRL algorithms are model-free and all agent learns joint strategies rather than executing certain strategies, which can better simulate the heterogeneity among AVs.Therefore, the discussion of the mixed flow with different MPR under a MRL framework is meaningful.Most of the related researches focused on the optimization of an isolated intersection and trained single AVs using single-agent RL.Single-agent RL simulation has been used for the single-vehicle trajectory optimization with the aim of utilizing the potential of high-level AVs and learning the rules such as car-following behaviors.Angah et al. [31] designed a method to achieve green waves based on single-agent deep reinforcement learning DRL along signalized arterial roads.The fuel consumption of a single vehicle can be reduced by 46% without increasing the travel time.Jayawardana and Wu [32] proposed a RL method to learn eco-driving strategies at signalized intersections.The results showed that when the MPR of CAVs reaches 100%, fuel consumptions can be reduced by 18%, emissions can be reduced by 25%, and travel speed can even be increased by 20%.In addition, even 25% MPR of AV resulted in at least 50% total fuel and emission benefits.Extending from single-agent RL to MRL in complex environments, it will face the challenges including sparse rewards, [33] complex action spaces, [34] learning speed, and convergence performance.

Contributions of This Paper
In this study, we designed a rule-free trajectory planning framework for multiple AVs to explore the optimal traffic efficiency of mixed flow at STSI.More in detail, it achieves through the optimal multiobjective mixed flow distribution between upstream and downstream intersections of SIST.The core algorithm of the framework is virtual group (VG)-based multi-agent proximal policy optimization (MAPPO), which is obtained by improving the classical MRL algorithm MAPPO based on the idea of VG.To focus on the dynamic distribution between the intersections within the system, we used a loop road with traffic lights to simulate the scenario of the STSI.In terms of algorithm convergence superiority, comparing with MAPPO, [35] the rule-free framework performs faster convergence speeds and stable convergences in complex SIST scenarios with multiple intersections and high MPR.In addition, in terms of the superiority of trajectory strategies, the strategies optimized by our proposed framework show outstanding performance on improving throughputs and reducing emissions at the global system level with the saturated traffic flow, with the maximum of 45.3% and 79.4%, respectively, comparing natural driving, classic linear ACC (LACC), [36] and CACC model. [37]he framework can preserve the heterogeneity among AVs, with rule-free designs.Through the trajectories obtained from the planning, we can analyze the heterogeneous behaviors of the AVs in detail to summarize the better heterogeneous rules.Compared with the various methods of designing rules subjectively, the proposed trajectory planning framework may be more suitable for the high-level mixed flow in the future on a specific traffic scenario.

MAMDP Problem Statement
We assumed that the intersections within the STSI have the same fixed signal cycle and only one straight lane.In addition, the STSI system has equipped with sufficient conditions to collect real-time states and control the real-time actions of all AVs.The set of intersections is expressed as I ¼ fI 1 , I 2 , ⋅ ⋅ ⋅ , I K g.The set of vehicles is expressed as V ¼ ffHDVsgUfAVsgg.The longitudinal control of human-driven vehicles (HDVs) adopts the intelligent driver model shown in Equation (1).Each AV is regarded as an agent and RL algorithm is used to decide the longitudinal behaviors of the AVs.
where a IDM is the acceleration of a HDV, a is the maximum acceleration of the HDV, v is the speed, v 0 is the expected speed, δ is the acceleration index, d is the distance from the front vehicle, and d*(v, Δv) is the expected following distance.
As shown in Figure 2, a multiagent central trajectory planning maker is set to plan the trajectories for all AVs at the global system level.To explore the maximum the traffic efficiency improvement ability of mixed flow, the primary objective of the planning is set to maximize the throughputs of all intersections within the STSI system during a cycle, as shown in Equation (2).Besides, taking into account the need for multiobjective traffic efficiency optimization at STSI, the secondary objective of the planning is set to minimize the emissions.max where Ψ I k ¼ ∫ t f t s φ I k ðtÞ dt is the throughputs at the intersection I k in a cycle T = t f Àt s , t f is the end time of the cycle, t s is the start time of the cycle, and φ I k ðtÞ is the number of vehicles passing the intersection I k at time step t.
We modeled the planning problem as a multi-agent Markov decision process (MAMDP): fS, A, P, R, γg.S ¼ fS n j n∈N g is the state space of all agents, s n t ∈ S n represented the state of the agent n at t, and N is the number of agents.A ¼ fa n j n∈N g is the joint action space of all agents and a n t ∈ a n represents the action of the agent n at t. P ¼ fP n j n∈N g !½0, 1 is probability matrix of state transition.P n ¼ Pfs tþ1 js t , a n t g represents the probability that the state transitions from s t to s tþ1 after the agent performs an action a n t .R ¼ P t γr t is the reward function.γ ¼ fγ n j n∈N g is the discount factor.The common learn objectives of all agents are to learn the reasonable joint actions A to acquire the optimal policy π Ã .The policy could be fitted with a neural network and the objectives could be transformed into finding the optimal parameters θ % π Ã θ , as shown in Equation (3).
Where π θ ¼ ðπ θ n j n∈N Þ represents the policy of N agents and their fitted by N neural networks with parameter θ ¼ ðθ n j n∈N Þ, respectively, r t ðs t , a t Þ represents the obtained reward after performing action a t in state s t , and a t ¼ ða n t j n∈N Þ represents the action set of Nagents at t.

Description of Basic-MAPPO Algorithm
The update gradient of the policy for agent n can be expressed as Equation (4).According to Bellman Optimality Expectation Equation, state-action-value function Q and state-value function V are defined as Equation ( 5) and ( 6): V n π θ n , π θ Àn ðs n t Þ ¼ max where the discount factor γ n in Equation ( 5) and ( 6) is introduced to express that the longer the time, the less impact the reward.Proximal policy optimization (PPO) algorithm usually performs well on solving the problems with continuous action spaces.PPO can perform higher generalization performance and stronger robustness because its outputs are the distributions of probability.The MAPPO [35] is an extension of PPO algorithm in the multiagent environment, which has recently proved to be effective in solving complex cooperation problems.MAPPO is based on actor-critic [38] architecture and its includes two types of networks: actor-network is used to approximate the policy π θ with parameter θ; critic-network is used to approximate the state-value function V ϕ , with parameter ϕ.Critic-network evaluates the policy π θ and guides actor-networks to update.
In MAPPO, each agent has one actor-network and one criticnetwork.The actor-networks among all agents are different, while the critic-network among the agents shares the same parameter.The objective function of actor-network is shown in Equation (7).The gradient update rule of each network is shown in Equation (4).
where ε is a predefined small value used to control the scope of the clip function; Að⋅Þ is the advantage function, defined by Equation ( 7), which is used to measure the pros and cons of the action at t. Generalized advantage estimation [39] is usually used to estimate Ãt , as shown Equation ( 8) and ( 9): ÃGAEðγ,λÞ The objective function of critic-network is shown in Equation ( 11): The pseudocodes of the MAPPO algorithm are shown in Table 1.
To distinguish it from the improved algorithm in Section 3, we called MAPPO as Basic-MAPPO in this article.

Improved Trajectory Planning Framework Using VG-based MAPPO
During the multiobjective trajectory planning problem at STSI, the primary objective is to maximize the cumulative value of the total throughputs at the global system level.The total throughputs could only be obtained at the end of a training episode.
In Basic-MAPPO, if the policy is measured by the total throughputs, the agents will get the delayed rewards.It would affect the efficiency of the algorithm in two aspects: first, it is difficult for Compute advantage estimate Ãt via Equation ( 8)-( 10 Update hidden states for π θ and V ϕ end for Update θ !θ old according formula (7)   Update ϕ !ϕ old according formula (11)   end while the agents to get the immediate rewards associated with increased throughputs, especially in the early stages of training.Instead, the agents stagnate due to the penalty of continuous emissions.Second, as described in Figure 1, on account of a large number of intersections, the algorithm has difficulty in precisely evaluating the mixed flow distribution among multiple intersections and ignores the heterogeneity among AVs.To solve the above problems, we applied the VG to improve the trajectory planning framework based on Basic-MAPPO.The improved algorithm was named VG-based MAPPO in this article.By introducing VGs in the trajectory planning framework, the evaluation of VGs can replace the evaluation of the throughput of a single intersection.In other words, a delayed reward can be transformed into an immediate reward.The VGs we designed are not included specific communication structures and cooperation rules to avoid impairing the heterogeneity among AVs.
We selected a STSI case with three intersections to describe the improved trajectory planning framework in detail.As shown in Figure 3a, the number of VGs is equal to the number of intersections and the VGs change dynamically during total training.

Throughputs Contribution Evaluation
In the improved framework we proposed, throughputs contribution ρ i,k ðχÞ is a key parameter to decide whether vehicle i belongs to VG-k ξ k related to intersection I k .For each episode χ in the replay buffer Π, ρ i,k ðχÞ is designed as Equation ( 12): Where ρ h k ðχÞ is the throughputs contribution of the experiences as designed in Equation ( 13), ρ d ik ðχÞ is the throughputs contribution of the distance as designed in Equation ( 14), δ i,k ðχÞ is the function to judge whether vehicle i passes the intersection I k as designed in Equation ( 15): where d χ T ðx I k , x i Þ represents the Euclidean distance between i and intersection I k in χ at the end of training, L is the total length of a continuous intersection, and Γ 1 ð⋅Þ and Γ 2 ð⋅Þ are both positive correlation function.
When estimating the throughput contribution ρi,k ðχ 0 Þ for a new episode χ 0 , we designed an estimate value ρik ðχ 0 , tÞ for each time step and considered the all episodes in the reply buffer Π, as shown in Equation (16).
where λ χ ρ is the historical discount factor, γ t is the Markov discount coefficient, and κ χ 0 is the conversion coefficient.
Moreover, the normalization is needed if ρ i,k ðχÞ or ρik ðχ 0 , tÞ is updated, as shown in Equation ( 17):

Dynamic VG Logic and Rules Design
The sets of RCS fv * k j I k ∈I g are the speed parameters designed for VGs fξ k j I k ∈I g.In various studies related to AVs coordinated frameworks, it was a kind of popular method to introduce the parameters associated with speed such as ref.[18].Similar definitions include cruising velocity, [40] recommended speed, [41,42] optimal velocity, [15,43] and so on.The speed parameters have been usually set as the maximum limit speed.We believed that the approach would not achieve the optimization for mixed flow, which was proved by the conclusion from ref. [44].Therefore, the record of v * k in our framework is the speed that can achieve the maximum throughput.
The number of VGs fξ k j I k ∈I g is set equal to the number of intersections.We designed the logic and rules of dynamic VG during training, as shown in Figure 4.The optimizations of VGs are divided into episode update and time step update.At the end of each training episode, the algorithm record is the maximum throughput contribution ρ i,k ðχÞ of vehicles i related to the intersection I k calculated based on all episodes in the reply buffer Π. v * k is the average speed of all vehicles in the VG ξ k when ρ max i,k is updated.The grouping rules in Figure 4 are shown as Equation ( 18)- (20).It allows the possibility that one vehicle belongs to two VGs.

Design for Reward
The reward designed according to the monotonous greedy objective as Equation (1) has poor performance in the task with a large number of agents.We designed episode reward function RðχÞ and step reward function r t ðχ, tÞ to replace the delayed rewards and the immediate rewards, respectively.The episode reward RðχÞ is designed as Equation ( 21) and the introducing of ρ i,k ðχÞ represents the feedback of the evaluation of VGs.The step reward r t ðχ, tÞ is designed mainly related to the parameters v * k as shown in Equation (24).To encourage the exploration, we weighted emissions and average acceleration in the step reward r t ðχ, tÞ.
ΨðχÞ ¼ X where a is the average acceleration, P is the emission, and Γð⋅Þ is the positive correlation function; the instantaneous emission P is defined by Equation ( 25): [45] P where C D is the aerodynamic drag coefficient; A f is the frontal area; f r is the rolling resistance coefficient; Mis the vehicle mass; gis the gravitational acceleration; ρ a is the air mass density ; v wind is the rotational inertia factor (mass factor); and δ is the regenerative braking factor.In Equation ( 21) and ( 24), Ψ Ã and v * k could be manually assigned a specific value to achieve a precise traffic distribution, which are equivalent to bringing expert opinions.

Simulation Experiment Settings
1) We adopt a loop road environment to obtain continuous rewards.The traffic lights were evenly distributed in the circular road.Four different loop road experimental scenarios are established, respectively.The experimental parameters were set, as shown in Table 2.The environments are in the saturation state under natural driving. [46]2) In order to improve the internal road resources of the STSI system, we set a flow control point for each environment to limit the boundary flow.The setting does not need to consider the resources outside the system, such as the infinite access lane.The input flow of each experiment was controlled to be the same value.To explore the upper limit of throughput, we set the input flow to saturated traffic.As shown in Figure 5, the position of the flow control point for 1- The states include a collection of the spatial position, speed fv i g i∈AV , distance from the preceding vehicle, and signal control state (period and phase) of all AVs in the environment.The actions consist of a collection of the acceleration vectors of all AVs in the environment.At each time step, each agent can modify the real-time acceleration fa i g i∈AV within its limit range.The reward designed in Section 3.3 is set to encourage higher throughputs, lower accelerations, and fewer emissions.The parameters in Equation ( 25) and their values were described in Table 3. 4) Vehicles were evenly distributed on the loop and accelerated from a standstill at the initial time of training.To eliminate the randomness, we took the traffic volume in the second signal cycle as the optimization objective and observed the vehicles' behavior in the second signal cycle.Each intersection in the circular road has the same traffic signal cycle.The signal parameters are shown in Table 4.The yellow light time is included in the green light time.5) Other simulation parameters are listed in Table 4. Table 5 shows hyperparameters of the networks for training.

Results and Discussion
To evaluate the performance of the improved trajectory planning framework for STSI, we discussed in terms of algorithm convergence superiority and the trajectory strategies superiority.The superiority of algorithm convergence was expressed through comparing the average reward curves of VG-based MAPPO and Basic-MAPPO in Section 5.1.In Section 5.2, the superiority of trajectory strategies was demonstrated by the significant benefits of throughputs and emissions when utilizing the proposed framework, compared to multiple benchmarks.Furthermore, to discuss the sensitivity of the proposed framework, we explored the throughputs and emissions benefits of trajectory strategies by changing the number of vehicles and the boundary flow in Section 5.3.
Finally, to highlight the significance of considering the heterogeneity in improving the traffic efficiency at STSI, in Section 5.4, we analyzed the heterogeneous behavior of AVs by taking the cases of trajectories obtained from the proposed framework.

Superiority of Algorithm Convergence
To discuss the superiority in algorithm convergence performance, we comprehensively compared the average reward curve of the average reward curves of VG-based MAPPO and

4-Intersection
The STSI scenario that includes four intersections.Table 3. Coefficients of Equation ( 25). [44]scription Symbol Value Unit Aerodynamic drag coefficient Rolling resistance coefficient f r 0.005 - The advantage is more prominent in the 4-Intersection scenarios, as shown in Figure 6e,f.The algorithm can no longer converge within 100 iterations conducted with Basic MAPPO when MPR is more than 40%, while the VG-based MAPPO algorithm performs stable convergence when MPR is 40%, 60%, and 80%.

Superiority of Trajectory Strategies
To discuss the superiority in trajectory strategies of the proposed framework, we observed the variation of throughputs and emissions under different trajectory strategies.The throughputs are the prime objective of the trajectory planning and the emission is the subsidiary.The multiobjective benefits of throughput and emissions can directly reflect the performance of the framework.In addition, we also focused on the benefits of the throughputs and the emissions.All comparisons include the full MPR range from 0% to 80%.To be brief, the trajectory strategies designed by the proposed framework are written "RL-Control."Meanwhile, we selected natural driving (MPR = 0%), a linear ACC model, [36] and a classic CACC model [37] as AVs' following strategies to be the benchmarks.

Discussion on the Throughputs
Figure 7 shows the throughputs and throughputs benefits in the saturation state under RL-Control, CACC, and LACC. Figure 6b  and 7a,c shows the value of the total throughputs for STSI.
Figure 7d-g shows the converted throughput benefits of a single intersection.
With the increase of MPR and the number of intersections, the throughputs of RL-Control experiments increase nearly linearly (Figure 7a).In the LACC experiments, the throughputs increase significantly only when MPR is 80% and the change of the number in intersections basically has no effect on the throughputs (Figure 7b).This indicates that LACC cannot make good use of the internal road resources of the STSI system to improve the throughputs.In the CACC experiments, the throughputs are not significantly improved when MPR was less than 60% and the number of intersections will cause the fluctuation of the throughputs (Figure 7c).This indicates that CACC cannot get effective communication benefits [47] with medium or low MPR.In general, LACC and CACC have difficulty in meeting the full MPR range of throughput growth demands, while RL-Control is more effective at increasing the throughputs.
Compared with the scenario of natural driving (MPR = 0%) (Figure 7d) and LACC (Figure 7e), the throughput benefits under RL-Control increase nearly linearly as the increasing of MPR and the number of intersections.The highest values are 45.3% and 28.3%.There are almost no throughput benefits under RL-Control in the 1-Intersection experiments.It is due to the fact that we set the flow control points to limit the input flow, resulting in very limited road resources that AVs can utilize to optimize traffic efficiency in saturation state.Multiple intersections have more internal road resources that can be overall optimally planned and the throughputs under RL-Control have more adequate increases.It can be believed that the more intersections and the higher MPR, the more throughput benefits RL-Control can obtain.
According to Figure 7f, when LACC is used as a benchmark, CACC is only able to obtain significant throughput benefits with 80% MPR.However, compared with the scenario of CACC, the throughput benefits under RL-Control are obvious when MPR is less than 80% and it can reach 22.8% in the highest case from Figure 7g.Therefore, RL-Control could make up for the poor performance of CACC at medium and low MPR experiments.

Discussion on the Emissions
Figure 8 shows emissions and emission benefits under the saturation flow under RL-Control, CACC, and LACC. Figure 8a-c shows the average emissions of a single vehicle for STSI. Figure 8d shows the converted emission benefits of a single vehicle.It should be noted that the experimental data in Figure 7 and 8 are from the same experiments that have consistent trajectory planning objectives.
Compared with the LACC-Control experiments (Figure 8b), the emissions of RL-Control were significantly lower.RL-Control (Figure 7a) was able to keep emissions within a fixed range, which is the same as CACC (Figure 8c).This indicates that RL-Control can take full advantage of mixed flow's the traffic efficiency improvement ability to reduce stop-and-go waves flows while simultaneously maintaining total throughputs.
As shown in Figure 8d, emission benefits grow as MPR increases and can reach a higher level when MPR is greater than 40%.In the experiments with a different number of intersections, the growth trends of emission benefits tend to be the same.This suggests that the optimization mechanisms for RL-Control are similar in the experiments.
In addition, the emission benefits under high MPR have not decreased significantly with the increase of intersections.When MPR is 80%, the growth rates of 1-Intersection, 2-Intersection, 3-Intersection, and 4-Intersection are 79.4%, 61.4%, 61.1%, and 52.9%, respectively.In general, increasing throughputs would sacrifice emission benefits.Combining with Figure 7 and 8, it shows that RL-Control can optimize the throughputs in the STSI system without sacrificing too much emission benefit in the saturation state, and the emission benefits can still be maintained at a high level.

Sensitivity Analysis of Trajectory Strategies
The number of the experimental vehicles and the boundary flow in Section 5.2 are equal to the natural driving benchmark in saturation state.We discussed the sensitivity of the trajectory strategies planned by the proposed framework to the two parameters including number of vehicles and boundary flow.

Number of Vehicles
To analyze the sensitivity of the proposed framework to the number of vehicles, we conducted experiments in which the number of vehicles was reduced by 20% according to the natural driving benchmark shown in Section 5.2.In addition, the boundary flows are set same to the natural driving benchmark.
Figure 9 shows the throughputs and throughputs benefits with 20% reduction in the number of vehicles under RL-Control and LACC.According to Figure 9a, the throughputs under RL-Control increase nearly linearly with the increase of MPR and the number of intersections.The growth trends of the throughputs are similar in the experiments compared to Figure 7a.But in Figure 9a, the magnitude of the growth is clearly smaller due to the reduction in the number of vehicles.It illustrates that the throughputs under RL-Control are sensitive to the number of vehicles.
Figure 9b shows the throughputs under LACC and Figure 9c shows the throughput benefits under RL-Control based on LACC benchmark.The maximum value of the throughput benefits in Figure 9c is 27.1% and the maximum value of the throughput benefits in Figure 7d is 28.3%.It illustrates that the throughput benefits under RL-Control are not sensitive to the number of vehicles when the benchmark is LACC.
In general, the throughput growths under LACC and RL-Control are limited by the number of vehicles.Moreover, the improvements of RL-Control over the LACC on throughputs are similar for various numbers of vehicles.Indirectly, it proves that the trajectory strategies under RL-Control are sufficiently reliable in expressing the heterogeneity among AVs.
Figure 10 shows the emissions and emission benefits with 20% reduction in the number of vehicles under RL-Control and LACC.According to Figure 10a, the emissions under RL-Control remain essentially within a fixed range with the increase of MPR and the number of intersections.Compared with Figure 8c, the range of the emission variations in Figure 10a is narrower.It illustrates that the emissions under RL-Control are sensitive to the number of vehicles.
Figure 10b shows the emissions with 20% reduction in the number of vehicles under LACC.According to Figure 10c, the maximum value of the emission benefits under RL-Control is 48.9% based on LACC benchmark in Figure 10b.Comparing Figure 8d and 10c, when MPR belongs to [40%, 80%], the emission benefits change on [23.9%, 48.9%] in 20% vehicles reduction state, while the emission benefits change on [34.5%, 79.4%] in the saturation state.It illustrates that the emission benefits under using RL-Control are sensitive to the number of vehicles.
Combining with Figure 9 and 10, we can conclude that reducing the number of vehicles under a high MPR will sacrifice more emission benefits than under a low MPR, when the prime objective of trajectory planning is to maintain the same throughputs benefits as the saturation state.

Boundary Flow
To analyze the sensitivity of the proposed framework to the boundary flow, we conducted experiments by increasing 10% boundary flow at the flow control point while keeping the same number of vehicles as the natural driving benchmark shown in Section 5.2.Increasing the boundary flow could be understood as improving road conditions.
Figure 11 shows the throughputs and throughputs benefits with 10% increase of the boundary flow at the flow control point under RL-Control.Figure 11b shows that the maximum throughput benefit under RL-Control is 6.5% and the minimum is 1.3% based on the natural driving benchmark shown in Figure 7a.The throughput benefits do not change significantly with varying MPR and are not sensitive to the MPR.It indicates that a certain throughput benefit can be obtained by improving the road conditions by increasing the boundary flow.
Figure 12 shows the emissions and emission benefits with increased 10% flow at the flow control point under RL-Control.Comparing Figure 8a and 12a, it can be found that the emission benefits are similar with different MPR at different boundary flow conditions.Figure 12b shows the emission benefits under RL-Control based on the natural driving benchmark shown in Figure 8a.It could be intuitively found that there is a significant decrease in emission benefits at high MPR.It illustrates that when increasing the boundary flow, the control ability in emissions of RL-Control would be weakened under the premise of ensuring the throughputs.
According to all the experiments in Section 5.3.1 and 5.3.2, both reducing the vehicle number of the experiments and increasing the boundary flow are effective methods to improve  the traffic efficiency of the mixed flow STSI system.We can give priority to adopt reducing the vehicle number to improve traffic efficiency.The reason is that the throughput benefits obtained by the method are higher and it is not at the expense of emission benefits.In addition, the benefits under RL-Control with a low MPR are always higher than those with a high MPR.

Analysis of the Heterogeneity
Through the discussion in Section 5.1 and 5.2, the proposed framework has obvious advantages superiority in optimizing mixed flow assignment of STSI.To fully explain the heterogeneity expression ability of the proposed framework related to improving the traffic efficiency, two trajectory strategies from 3-Intersection experiments shown in Figure 13 are selected for discussion in detail.
Scenario-2 shown in Figure 13b performs an unbalanced distribution and has better traffic efficiency compared to scenario-1 shown in Figure 13a.The total throughputs of scenario-1 and scenario-2 are the same.The throughputs in scenario-1 of the intersections are 16, 14, and 16, while in scenario-2 are 17, 13, and 16, respectively.The average emission of a single vehicle in scenario-1 and scenario-2 was 499 J/veh/0.1s and 429 J/veh/0.1s,respectively.Scenario-2 obtains 14.0% emission benefits based on scenario-1.Figure 13 shows that the mixed flow could improve traffic efficiency at STSI through the expression of the heterogeneity, specific in the different cooperation strategies for AVs.The loss throughputs in Intersection-2 in Figure 13b are made up by Intersection-1 and the trajectories in the scenario are smoother.
For more details, we analyzed the heterogeneous rules among AVs for Figure 13 through vehicle speed profile.Figure 14 shows the first ten vehicles through the Intersection-2 of the two scenarios in the second signal cycle.The reference vehicle is the front car of the selected ten vehicles.In Figure 14a,b, the passing rule of AV1 and AV2 is guiding the following HDVs to form a platoon.AV3 in Figure 14a is also following the passing rule.However, the difference is that AV3 in Figure 14b is more flexible and autonomous in its decision of the passing rule.It decides to join AV2's platoon instead of becoming the leader of a new platoon, which fully shows the heterogeneity compared to other AVs.

Conclusions and Future Work
This article proposed an improved trajectory planning framework based on VG-based MAPPO to explore the optimal traffic efficiency of mixed flow at STSI.The framework can optimize multiobjective mixed flow distribution between upstream and downstream intersections of SIST.The framework can simultaneously plan the trajectories of multiple AVs in the traffic system under rule-free designs and the heterogeneity among AVs can be fully expressed.We believed that the framework can be suitable to the high-level mixed flow for traffic efficiency optimization at the global system level.The main conclusions are as follows: 1) According the results of superiority analysis, the total throughputs benefits can be significantly increased by the proposed framework without sacrificing too much emissions benefits.In the scenario where the STSI with four intersections is in the saturation state and with 80% MPR, the optimal benefits are that the throughputs can be increased by 45.3%, 28.3%, and 22.8% compared with natural driving, LACC, and CACC benchmarks, respectively.Meanwhile, the emission benefits can be maintained at a high level when with a high MPR compared with LACC benchmark.When MPR is 80%, the growth rates of the emission benefits are 79.4%, 61.4%, 61.1%, and 52.9%, respectively.The more intersections and the higher MPR, the more traffic efficiency benefits can be obtained by the framework.2) According the results of sensitivity analysis, both reducing the vehicle number of the experiments and increasing the boundary flow are effective methods to improve the traffic efficiency of the mixed flow STSI system.We can give priority to adopt reducing the vehicle number to improve traffic efficiency.The reason is that the throughput benefits obtained by the method are higher and it is not at the expense of emission benefits.3) Through analyzing the optimal trajectories of the experiments, we concluded that the heterogeneous rules for AVs have higher traffic efficiency benefits than the consistency rules, which are the same as the scenarios proposed in Figure 1.Based on the proposed framework, we can explore the better heterogeneous rules for AVs under mixed flow in obtaining traffic efficiency benefits.

Figure 2 .
Figure 2. Trajectories planning at the global system level.

) on τ
Split trajectory τ into chunks and save them to D end for for mini-batch k ¼ 1, : : : , K do Data b←random mini-batch from D with all agent data

Figure 3 .
Figure 3. Improved trajectory planning framework for STSI: a) the schematic of VG and b) the configuration of the VG-based MAPPO.

Figure
Figure 3b shows the configuration of VG-based MAPPO, which applied VG to ensure that multiple agents trained quickly and well.In more detail, when vehicle i ∈ V is added to the VG-k ξ k , it means that i contributes more to the throughputs optimization at intersection I k than its contribution to other intersections.The quantitative methods of the contribution are shown in Section 3.1.At each training time step, the VGs fξ k j I k ∈I g need to be updated based on the training experiences.The logic and rules for dynamic VG are shown in Section 3.2.We designed a new parameter named "recommended cruising speed (RCS)" v * k to help algorithms to evaluate the immediate rewards.Vehicles in the same VG-k ξ k are sharing a common v * k before the VGs change at the next time step.In Section 3.3, we used v * k to design the episode reward and the step reward to feedback the evaluation of the VG back to MAPPO.

Figure 4 .
Figure 4. Updating logic for dynamic VG while training.

Figure 7 .
Figure 7. Throughputs and throughput benefits under the saturation flow : a) throughputs under RL-Control, b) throughputs under LACC, c) throughputs under CACC, d) throughput benefits under RL-Control (based on natural driving), e) throughput benefits under RL-Control, f ) throughput benefits under CACC (based on LACC) (based on LACC), and g) throughput benefits under RL-Control (based on CACC).

Figure 8 .
Figure 8. Emissions and emission benefits under the saturation flow: a) emissions under RL-Control, b) emissions under LACC, c) emissions under CACC, and d) emission benefits under RL-Control (based on LACC).

Figure 9 .
Figure 9. Throughputs and throughput benefits with 20% reduction in the number of vehicles: a) throughputs under RL-Control, b) throughputs under LACC, and c) throughput benefits under RL-Control (based on LACC).

Figure 10 .
Figure 10.Emissions and emission benefits with 20% reduction in the number of vehicles: a) emissions under RL-Control, b) emissions under LACC, and c) emission benefits under RL-Control (based on LACC).

Figure 11 .
Figure 11.Throughputs and throughput benefits with increased 10% boundary flow: a) throughputs under RL-Control and b) throughput benefits under RL-Control (based on Figure 7a).

Figure 12 .
Figure 12.Emissions and emission benefits with increased 10% boundary flow: a) emissions under RL-Control and b) emission benefits under RL-Control (based on Figure 8a).

Figure 14 .
Figure 14.The speed profile of the first ten vehicles at intersection-2 in Figure13: a) the speed profile of scenario-1 in Figure13a and b) the speed profile of scenario-2 in Figure13b.
Algorithm 1 Base-MAPPO Initialize Actor-network π θ and V ϕ Set learning rate γ For episode do Set data buffer D ¼ fg For i ¼ 1 to batch_size do τ ¼ ½ empty list Execute actions a t ,observe r t , S t τþ ¼ ½S t , a t , r t , S tþ1