TMPF: A Two‐Stage Merging Planning Framework for Dense Traffic

Planning for autonomous vehicles to merge into high‐density traffic flows within limited mileage is quite challenging. Specifically, the driving trajectory will inevitably have intersections with other vehicles whose driving intentions can't be directly observed. Herein, a two‐stage algorithm framework that is decomposed into the longitudinal and lateral planning processes for online merging planning is proposed. An improved particle filter is used to estimate the driving models of surrounding vehicles for predicting their future driving intentions. Based on Monte Carlo tree search (MCTS), different action spaces are evaluated for longitudinal merging gap selection and lateral interactive merging operation, while heuristic pruning is used to reduce the computation cost. Moreover, the coefficients related to the driving styles are introduced, and their influences on merging performance are analyzed. Finally, the proposed algorithm is implemented in a two‐lane simulation environment. The results show that the proposal has outperformed other baseline methods.


Introduction
Merging in dense traffic is quite a challenging task for autonomous vehicle (AV) such as in scenarios shown in Figure 1. [1] However, many of the merging algorithms are designed from the perspective of AV and often ignore the interactions with surrounding vehicles (SVs) in the environment. [2] This always prevents them from being able to actively create more opportunities to change lanes, which will eventually lead to merging with longer times or riving distances. For the implementation of the lane merging operation, especially in dense traffic flow, estimating the future intentions of SV is thus necessary. [3] If there is no merging gap available directly, the AV should attempt to seek the opportunities for merging and execute the merging operation actively. [4] In general, the big challenge for this kind of problem lies in how to evaluate the most suitable merging gap and to perform a safe merging operation by interacting with the vehicles in the target lane, whose future trajectories may have intersection points with the AV. [5] Significantly, how to ensure an efficient and stable merging process has also become an urgent issue to be solved.
For the lane change decision-making problem of AV, mandatory lane change (MLC), as one of the classic strategies, has been first adopted. An MLC point is set according to the traffic density before changing the lane. When in the next the AV goes past the MLC point, the lane change action will be executed immediately. [6] However, the lane change performance with this method can also be greatly limited by the selection of MLC points. Some attempts considering both the success rate and traffic efficiency of lane change have been made. Cao et al. proposed an adaptive lane change strategy based on MLC. Before the lane change, the current most suitable lane change gap will be selected and the best gap will eventually be evaluated based on the lane utility. [7] However, since the earlier methods rely heavily on utilizing existing lane change conditions searched in the environment, it would be difficult for AV to create lane change opportunities to realize merging with a heavy traffic flow. Alternatively, based on game theory and other related theories, more progress in the research of interactive decision-making strategy has been made. Multiagent reinforcement learning and multiobjective optimization are used to solve the on-ramp merging problems. [8,9] Under the assumption of rational decision-making, the optimal decisions of all participants in a multiagent game can be determined by the Nash equilibrium strategy. [10] However, for the scenario of mixing AV and human-driven vehicles (HDV), the results of realistic strategy often deviate from those of the equilibrium strategy due to their different cognitive abilities and personalities of human drivers. [11] To solve this, based on game theory, the planning framework integrated with an iterative reasoning model is utilized to reflect human driver characteristics. The SV's intelligence levels and rationalities are modeled as latent states, and on this basis, the lane changing process can be realized actively and safely. [12] Despite considerable advances in the treatment of the lane-change gap selection and interactive decision-making strategy, it remains immense obstacles in the observability of driving intentions and the driving styles of SV. Without tackling this difficulty, it would still be hard for AV to actively perform merging tasks, especially in relatively dense traffic scenarios. Planning for autonomous vehicles to merge into high-density traffic flows within limited mileage is quite challenging. Specifically, the driving trajectory will inevitably have intersections with other vehicles whose driving intentions can't be directly observed. Herein, a two-stage algorithm framework that is decomposed into the longitudinal and lateral planning processes for online merging planning is proposed. An improved particle filter is used to estimate the driving models of surrounding vehicles for predicting their future driving intentions. Based on Monte Carlo tree search (MCTS), different action spaces are evaluated for longitudinal merging gap selection and lateral interactive merging operation, while heuristic pruning is used to reduce the computation cost. Moreover, the coefficients related to the driving styles are introduced, and their influences on merging performance are analyzed. Finally, the proposed algorithm is implemented in a two-lane simulation environment. The results show that the proposal has outperformed other baseline methods.
Consequently, the framework of partially observable Markov decision process (POMDP) has been utilized for modeling the uncertainty of observation, state transition, and interaction. [13] Although it is difficult to find the optimal solution for POMDP due to the ease of falling into the curse of dimension and history, [14] the approximate optimal solution can be evaluated practically. Particularly, the offline methods have been implemented for dealing with complex scenarios. For example, Ma et al. used graph neural network (GNN) to infer the intentions of traffic participants and then implement proximal policy optimization (PPO) algorithm to calculate behavior decisions. [15] Button et al. updated the belief state based on recurrent neural network (RNN) and then utilized deep Q-network (DQN) to ensure probabilistic security. [16] The concept of scene decomposition is proposed to reduce the cost of offline training. For instance, a recursive Bayesian filter algorithm is used to update belief and infer the cooperation degree of SV, and then DQN is used for the decision-making of ramp merging. [1] Li et al. evaluated the driving style of SV from the data set by logistic regression classifier, and in their work the hidden variables in decision-making problems are processed based on the principle of maximum likelihood estimation and supervised learning. [17] However, this kind of method needs deep learning training, which is limited by the data set. Its generalization ability also needs to be improved; otherwise, the actual collected data set can't be learnt to verify the interaction between vehicles. [18] On the other side, the online methods have also been widely deployed recently. Researchers use an improved Monte Carlo tree search (MCTS) method called the determinized sparse partially observable tree (DESPOT), for the scenario where AV drives through the crowd. [19,20] Asynchronous backtracking (ABT) is also used to solve the problem of turning at the intersections of automatic driving. [21][22][23][24] Generally, the online planning algorithm deduces the environment to a certain horizon and then selects the behavior with the maximum return. However, the computation efficiency of the online method remains to be improved.
To the above analysis, this article presents a two-stage merging planning framework (TMPF) to solve the problem of the merging within limited driving distances. The main contributions of our work include the following. 1) By decomposing a merging task into two subtasks, the planning framework consists of longitudinal and lateral processes, which respectively are responsible for selecting a merging gap appropriately and executing merging action actively. In particular, an interactive merging planning is adopted in lateral process to actively endeavor for the right to merge into the target lane and eventually to realize smoothly merging; and 2) Semantic-based heuristic pruning is incorporated into the MCTS to reduce computational loads on the premise of ensuring convergence. Also, a double-weight particle filter is applied to estimate SV's driving intentions based on the driving models.
The remainder of this article is organized as follows. The problem formulation is presented in Section 2. Section 3 depicts the modeling of a POMDP framework and analyzes the merging process. The TMPF is explained in Section 4, where the application of particle filter and search strategies of MCTS is developed. Experimental verification of our proposal is presented in Section 5, followed by the conclusion in Section 6.

Problem Statement
The TMPF proposed in this article aims to accomplish the merging tasks: the AV has to merge into the other lane, while the SV's driving intentions or styles are not available and there is no sufficient merging gap that can be used directly. Accordingly, the AV must take into account the uncertainties existing in the environment. Here, certain assumptions need to be specified: 1) The perception range is limited to 50 m around the AV, in which the nearest vehicle can be accurately observed www.advancedsciencenews.com www.advintellsyst.com by sensors, and the occluded vehicles will be sensed by Vehicleto-Everything (V2X); 2) All SVs have their own driving styles. Their driving styles can be formulated by driving models and initialized randomly; and 3) Considering Gauss noises in the driving model of each vehicle, the randomicities of their behaviors can be presented. In this article, the merging task can be modeled as POMDP due to the unobservability of SV driving models and the unpredictability of their future behaviors.

POMDP Modeling
This section presents the process of modeling a merging task which can be formulated as a POMDP problem.
A POMDP problem can be defined as a tuple ðS, A, O, T, R, γÞ, where S is the state space, s ∈ S is the state of the environment, A is the action space, a ∈ A is the action taken by the agent, O is the observation space, o ∈ O refers to the observation of the agent, T is the state transition model, R is the reward function, and γ is a discount coefficient. At each time t, the goal of the agent is to find a policy to maximize the future cumulative return at time t, which can be represented by

State Space
The state space describes the driving states of all vehicles within the perception range. As shown in Figure 2, the green vehicle is the AV, and the rest are SVs. The goal of the AV is to merge into the lane located to its right. The states in S at time t can be defined as S t ¼ ðq e t , fq 1 critical points in the adjacent, front, and rear merging gaps in the target lane. Projections of the points generate three merging zones in the current lane. By {CPs} the merging task is transformed from a continuous-time problem to a discrete-time one. [25] The structure of a gap is illustrated in Figure 3. There are two critical points for the front and rear cars, respectively: Critical point F and Critical point B. The space between the two critical points is a merging window for AV to execute a merging action. The critical point setting is based on the Gipps car following model. [26] The merging window is determined when the AV should drive at the current speed to keep front safe distance (FSD) and behind safe distance (BSD) from the front-side vehicle and the rear-side vehicle in the target lane, respectively. Since the Gipps model is too conservative, the evaluation of the critical points is discounted by an attenuation coefficient α. A gain factor β is also adopted according to the deviation err est between the estimated acceleration and observed acceleration of SVs. Both α and β are adjustable, and their influences on the overall performances of merging will be discussed in Section 6. The safe distances SD Gipps and SD CP are evaluated based on the Gipps model and the critical points respectively as shown below.  www.advancedsciencenews.com www.advintellsyst.com where v behind and v front denote the speeds of the rear and front vehicles respectively, b behind max and b front max represent the maximum braking decelerations of the front and rear vehicles separately, Δt denotes car-following time interval, PSD denotes the minimum safe distance to be set as the lower limit of the car-following distance, α specifies the attenuation coefficient, and β represents the gain factor. The following distance calculated by the Gipps following model is multiplied by the attenuation coefficient α and the gain factor β to evaluate the safe distance SD CP , which is used as the basis for determining the location of the critic points. To prevent the following distance from being too small due to the attenuation, we generally preset a minimum safety distance d safe as the lower limit of the following distance. The default values for the state space are shown in Table 1.
In Equation (2), t Þ respectively denotes the position, speed, driving tendency, actual driving model of the i th SV, and a particle set used to estimate the actual driving model. The driving tendency is to judge whether the SV shows the tendency to change lanes by observing the vehicle's heading angle. For example, if a vehicle is close to the right lane line, and its heading direction is toward the right, its current driving tendency is to change to the right lane. The driving tendency and driving intention need to be distinguished, the former can be obtained by the observed information, while the latter can't be observed directly.
In Equation (2), fq 1 t , q 2 t , · ·· ,q 8 t g indicates six neighbor vehicles around the AV and two more far-side vehicles in the target lane as shown in Figure 2. The eight neighbor vehicles are more important because they will interact directly with the AV. On the contrary, other unperceived vehicles, which are regarded as a part of the environment, will not be treated here.

Observation Space
The sensing devices or V2X can only provide the physical information of vehicles. Therefore, the driving intentions of SV, such as the future longitudinal speed and lane change behavior, can't be directly observed. In this article, the perceived noises of sensors are not considered, but the driving intention uncertainty caused by the unobservable driving model is focused on. ðx i t , y i t , v i t , tendency i t , particles i t Þ represents the observation of SV, and only position, speed, driving tendency, and the estimated driving model particles set will be used as the input of the planner.

Action Space
The AV controls its motions by changing acceleration and lane change intention. Actions in action space A can be defined as follows.
a ¼ ðacc, y lc ≡ 0Þ, in longitudinal process ðacc, y lc , intention fake , intention abortion Þ, in lateral process (5) where This action will be called fake merging in the following sections. Otherwise, the AV will perform a normal lane-changing action and drive across the lane line. During the merging process, if an emergency occurs that causes the AV to give up lane changing halfway, intention abortion will be true.

Reward Function
Reward functions are designed for longitudinal process and lateral process separately; all parameters are shown in Table 2. The reward function for the longitudinal process can be evaluated by In Equation (6), R 11 is a negative reward for punishing the deviation between the speed of the AV and the desired speed,  R 12 is a collision penalty. If a collision occurs, a À100 penalty will be given to R 12 . A result reward, R 13 ¼ R 131 þ a 132 R 132 , will be activated if the AV successfully enters the target merging zone. R 131 is set to 10 and R 132 is the length of merging zone that the AV achieves. Note that, if there is no front vehicle or rear vehicle in the target lane, the merging zone tends to be infinite due to the absence of one of the critical points. In order to avoid the impact of this item on the cumulative return, the maximum of R 132 is set to 20 here. R 14 is a guiding item to encourage the continuity of decision results. If this decision continues the intention of the previous planning step, R 14 will be set to 2; otherwise, a punishment of À1 will be attached. The reward function for the lateral process can be calculated by where R 21 is consistent with R 11 in the longitudinal reward function to punish deviation from the expected speed. In Equation (7), R 22 ¼ a 221 R 221 þ a 222 R 222 is a result reward of the lateral process. When the AV enters the target lane, the distance from the front vehicle is R 221 , and the distance from the rear vehicle is R 222 . The merging time can be adjusted to adapt to specified safety distances after merging. As in the lateral process, in order to avoid oversized reward on the cumulative return, the maxima of R 221 and R 222 are set to 20. An immediate reward R 23 will be given according to the subsequent state. If the AV successfully enters the target lane, R 23 will be set to 20. If there is a collision, a penalty of -100 will be given to R 23 . R 24 is a reward for the behavior. If the AV starts to perform the lane change action R 24 is set to 4. If the merging is abandoned in the merging process, a penalty of À2 is given to R 24 . During the lane change process, when the AV attempts to fake merging, R 24 is set to 2. If it continues to cross the lane line, R 24 is set to 5.

State Transition Model
The action evaluated by the planner will determine the desired position, posture, and speed of the AV for the next step where w denotes the lane width, which is chosen as 4 m. y Ã ¼ w=3 implies that the whole lane change process needs three steps in space, but it may take more than three steps in time due to fake merging. Δt is chosen as 0.75 s in this article, which is determined to balance the accuracy and computational efficiency of planning. If the step size of the decision is too small, more decisions need to be made in the same time, which is unnecessary for the upper level's intention decision. If the decision step length is too long, the decision algorithm may not be able to make timely response to changes in the surrounding environment, thus reducing the safety and success rate of automatic driving decisions.

Approach
After the POMDP problem is formulated, a particle filter is used to estimate the unobservable parameters, and a double-layer MCTS is adopted to realize the planning, as shown in Figure 4. The overall framework of the TMPF is illustrated in Figure 5.

Traffic Simulation
The driving strategy of environmental vehicles is designed based on intelligent driver model (IDM) [27] and minimizing overall braking induced by lane changes (MOBIL) [28] models. IDM car-following model is used for longitudinal acceleration decision, which can be defined as where v t is the current velocity, a is the maximum acceleration, b is the desired deceleration, v 0 is desired speed, T is desired time gap, g 0 is the minimum distance from the front car, and Δv is the  e À a e þ pða 0 n À a n þ a 0 o À a o Þ > Δa th (10) where a' and a represent the acceleration after and before lane changing, subscript e denotes the ego vehicle, subscripts n and o denote the rear vehicle after and before lane changing, and p specifies the aggressive degree. If the inequality holds, lane-changing behavior can be allowed. However, the decision whether this lane change will be performed depends on the condition if this behavior can produce a posterior state closer to the desired speed. The simulation parameters in this paper are listed in Table 3. The driving model of the AV is consistent with that of the timid vehicle.

Particle Filter
Differently parameterized driving models will produce different behavior decisions even under the same surrounding environment. However, the model parameters can't be directly perceived by sensors. Therefore, the AV needs to estimate the driving intentions of neighbor vehicles to improve the accuracy and efficiency of traffic deduction during the planning process. Otherwise, if we only assume that all SVs follow the same driving model, an approximately optimal policy cannot be realized.
Generally, traditional particle filters are used to deal all parameters with the same weight update model. [2] Here, the parameters of driving model are estimated by particle filter algorithm. The driving strategy of each vehicle is estimated independently. For the i th vehicle in the environment, there is a particle set consisting of 500 groups of driving model parameters: where θ i k denotes the k th particle of the i th vehicle and W i k indicates the weight of the particle, which describes the belief of the particle.
Since each particle represents a different driving style, the vehicle will behave differently in the current environment. For the i th environmental vehicle, the results generated by the k th particle are acc i k and y i lck , and the weight represents the confidence of the driving model represented by the particle relative to the real driving model. After calculating the results of each particle and observing the actual behavior of environmental vehicles, acc i and y i lc , the weight of each particle is updated according to the estimated and observed values.  In this work, the driving model parameters are decomposed into longitudinal-related ones and lateral-related ones according to their functionalities so as to avoid mutual influences. Specifically, based on SIR particle filter algorithm, we make an improvement of lateral and longitudinal-decoupled model with double-weight updating. According to their functions, the parameters of the driving model presented in Table 3 are decomposed into longitudinal driving model parameters θ lon ¼ fv 0 , T, g 0 , a, bg and lateral driving model parameters θ lat ¼ fp, ⋅a th , σ vel g. These two groups of parameters of the driving model have their own independent weights W lon and W lat , and the forms of their corresponding particle sets, each of which is composed of 500 particles, can be represented by where θ i lon and θ i lat represent the k th particle of i th vehicle and their weights are defined independently as W i lon and W i lat . For every particle, its longitudinal and lateral parameters are not strongly correlated; hence, a group of parameters temporally formed based on the serial numbers in the particle set are utilized for estimation. In this way, an 8D problem is reduced to a 5D and a 3D problem, reducing the computational difficulty of particle filter. At the same time, the mutual interference between the two groups of parameters is avoided.
The weight update method is evaluated as follows. 8 > > > < > > > : where Θ is a penalty factor, which is an empirical value of 0.2 to punish incorrect lane change. After updating the weight of each group of particles in the whole particle set, the whole particle set will be resampled based on the new weights, and the new 500 groups of longitudinal and lateral parameter particles will be combined again according to their sequence number. In order to avoid particle deprivation and ensure the diversity of particles, 50 groups of particles will be selected to reinitialize every 10 steps. [29] The particles of each vehicle are weighted according to their weights to obtain an equivalent particle as the estimated driving model, which will be used in the following planning process. The results of the real behaviors and the estimated behaviors by equivalent particles of SVs are shown in Figure 6. It is obvious that after updating about ten steps, the estimated actions can fit the real driving models well. In order to reduce the interference of inaccurate parameters estimation on planning, the AV will keep the current lane for five steps for a preliminary estimation of the SVs, once it enters a new lane.

Longitudinal Planning Process
As shown in Figure 2, the TMPF will find the most promising merging gap among the adjacent, the front, and the rear gaps in the longitudinal planning process. Then the corresponding acceleration will be executed to enter the target merging zone. The observation of the current state serves as the input of MCTS to be a root state node. Three driving intentions fa 1 , a 2 , a 3 gare defined to map three different merging gaps 8 > < > : where a rule is a speed limit of regulation, and a approach is a delicate acceleration chosen from {-1,0,1} so that the AV can enter the target gap smoothly. If anyone among the three gaps does not exist, the corresponding branch will be pruned, as shown in Figure 7a. By TMPF, we do not consider adding the exploration of a 1 under the branch of a 3 , or conversely. Although MCTS outputs the acceleration, its practical purpose is the intention selection of merging gaps.
The alternation of decision trajectories between a 1 and a 3 indicates that the decision intention of the AV has a large concussion on the time sequence. This implies that the actual acceleration is rapidly oscillating, and it is not necessary to consider this phenomenon. Therefore, the trajectories related to this phenomenon are pruned in our TMPF. [30] The rollout of MCTS will continue the deduction of the leaf node intention, to quickly initialize the value of the leaf state node and the state action value Q(s,a) of its parent action node. At the end of the longitudinal planning process, it requires that the output of the planner is the adjacent gap. However, the www.advancedsciencenews.com www.advintellsyst.com longitudinal process doesn't require the AV to enter the adjacent merging zone. When the density of traffic is too high, it will easily produce no space in the adjacent merging window. At this moment, it only needs to move in our preset range of the midpoint of critical point 2 and critical point 3, and the merge procedure will go to the lateral planning process.

Lateral Planning Process
The lateral planning process will guide the AV to execute the merging action until successfully driving into the target lane. During this stage, the acceleration will keep the vehicle in the current merging zone all the time. The action space of the lateral process can be defined as 8 > > > > < > > > > : a 2 ¼ ða approach , AE 1, 0, 0Þ a 3 ¼ ða approach , AE 1, 1, 0Þ a 4 ¼ ða approach , AE 1, 0, 1Þ As mentioned earlier, the size of the merging window is not always sufficient for the AV to perform the merging action because of high traffic density. Therefore, it is necessary to strengthen the intention of merging, forcing the rear car to slow down in the target lane, which is related to fake merging described in Section 4.3. Otherwise, it is not surprising that the rear vehicle may not give way and leave enough merging space in a short time because of SV's aggressive driving style. If the AV fails to create a useful merging window in several steps, the merging attempt will be abandoned and we return to the longitudinal process. In TMPF, the max merging attempt step is set to 7. The idea of creating merging windows by interaction is based on the k-level model of game theory. [31] We assume that the AV is an agent with a higher intelligence level than the others. When the adjacent merging window is enough for lane change operation, the ego vehicle can enter the target window to merge into the target lane.
It is noted that not only the merging window length should be obtained, but also the speed should be satisfied. The AV may maintain a speed advantage over the rear car in the target lane. Meanwhile, the speed of merging needs to conform to the IDM following model of the front car in the target lane.
The search strategy of the lateral process is similar to the longitudinal process, which inputs the current state into MCTS as shown in Figure 7b. The expansion process of MCTS selects action nodes based on upper confidence bounds (UCB), while the rollout process follows the principle of fastest merging to have a quick deduction of the process.
The parameter setting of MCTS in TMPF is illustrated in Table 4, and the convergence results of MCTS in the two stages are shown in Figure 8. The scale of the substate nodes of one action node is limited to kNðs, aÞ α , where k and α are set as 1 and 0.3 respectively. [32]

Results
For the merging scenario in dense traffic flow, the AV is required to enter the target lane within a short driving distance. In this section, various scenario tests are conducted to testify our approach and the performances in merging tasks are analyzed by comparing with other mainstream methods. We adopt simulation of urban mobility (SUMO) for the modeling of simulation environment and utilize Python for the code compiling. For the hardware platform, our computer is configured to have the Intel Figure 7. The illustration of the MCTS searching policy of TMPF: a) Blue "F", orange "A," and green "B" denote the front, the adjacent, and the behind gaps, respectively. b) Blue "M," orange "F," gray "A," and green "K" represent merging action, fake merging, abortion, and keeping the lane, separately. Core i5 processor, AMD Radeon 625 graphics card, 8 GB RAM memory and is operated in Ubuntu Linux 18.04.

Influence of CP Calculation
As described in Section 4.1, the determination of critical points is based on the attenuation of Gipps model and the compensation for estimation error. Both attenuation coefficient α and gain factor β are empirical values. In the experiment we will consider different scaled α and β and analyze their influences on the overall performances of merging. The traffic flow with an average density of 10 m veh À1 is initialized both in the target and in the the current lane. The driving model of SV in each test will be initialized randomly according to Table 3. If the AV can't merge within 100 m, the test will be marked as a failure. 50 random independent trials will be conducted for each group of parameters to ensure the objectivity and universality.
The results with different values α are shown in Figure 9a and Table 5. When α is set lower, that is, the driving style of the AV is more aggressive. Meanwhile, when the lengths of the merging window are shorter, the AV may catch the window more easily and merge into target lane more quickly.
In practical applications, we can flexibly adjust the acceptance of safe distances according to the requirements, so as to balance the traffic efficiency and aggressiveness of driving style. Although it is found that the results do not vary much from each other in Table 5, due to the high attenuation scale, it may cause a high probability of ending a long driving distance for merging, and the risk of merging failure increases correspondingly, as shown in Figure 9a. When there is discrepancy between the estimated and the observed driving behaviors of SVs, the AV will make a compensation for ensuring the safe distance to address the risk caused by inaccurate estimation. The gain factor β of estimation error is affected by the risk acceptance of the AV. If the AV is not willing to sacrifice to take high risks, the corresponding β can be set to be higher, thus producing longer distance between the critical points and the SVs with a smaller merging window.
The AV needs to pay more time and space to search an appropriate merging window to realize the merging task. From bb and Table 6, it can be seen that the probability that the merging cannot be completed within an acceptable distance can also be high.
In summary, it can be concluded that the time and mileage costs are related to the driving style of the AV. Lower merging costs can allow a more aggressive driving style and may inevitably involve higher risks. However, in spite of these uncertain factors, the AV can still maintain a stable merging speed when the density and speed of the traffic flow are determined.

Behavior Analysis of Merging Process
In the comparison tests, the AV tries to merge into the target lane when traffic flow is set to 10 m veh À1 . The AV starts the planning process of merging after driving in the starting lane for 5 steps (i.e., 175 m on the map) to acquire a preliminary estimation of SVs. For evaluating the merging behavior by TMPF, we utilize two existing algorithms for comparison. 1) Longitudinal adaptive planner (LAP): The concept of LAP originates from the research work. [7] Similar to the longitudinal planning process, this algorithm will guide the AV to evaluate the most valuable merging gap. If the current condition meets the safety constraints of lane changing, the merging operation shall be carried out directly without considering the interaction with SVs; otherwise, the AV will continue to seek another chance for merging. In LAP, the merging behavior will be finished in three steps; 2) Basic search planning (BSP) algorithm: the acceleration is chosen from {À 1, 0, 1} through an MCTS. When the AV chooses the adjacent merging gap, the lane merging will be performed. It also takes three steps to complete a merging process.
The same initial speeds and driving models are adopted to ensure the fairness and the validation of the comparison. From Table 7 and Figure 10, it can be seen that under the test scenario, all the three methods can realize the merging task within 250 m. TMPF and LAP are both able to create the most valuable merging gap to execute merging, so the mileage www.advancedsciencenews.com www.advintellsyst.com consumptions are significantly lower than that of BSP. Although BSP can realize the merging task in a shorter time, the result of its velocity planning is found to be unacceptable. This is because in actual scenarios, some other SVs in the starting lane could limit the driving speed planning of the AV.
In the whole merging process, TMPF is more stable in velocity planning than the other two. As shown in Table 7, the velocity variance of TMPF is found to be the lowest among all. By the advantage results from the variable merging time, the AV can have more time to adjust the merging speed to enter the target lane smoothly in the lateral process.    To evaluate the effectiveness of the proposed algorithm, we can evaluate if the AV can enter the target lane and realize the merging task within a reasonable time and space cost. However, we still need to pay attention to the impacts of merging on SVs, especially the interactions on rear vehicles in the target lane. Among all the algorithms, BSP realizes the lane merging task by accelerating to a farther lane merging gap in front. Therefore, this behavior has little impact on the SVs. However, this algorithm is lack of efficiency when there is a sufficient merging gap available in the environment. TMPF and LAP have got almost the same results in space and time consumption; however, TMPF has less impact on the driving speed of the rear vehicle and leads to a shorter deceleration time. LAP will directly perform the actual merging operation upon finding an available merging gap. TMPF offers the AV more times to adjust the merging speed and establish a speed advantage over the rear vehicle.
In addition, the distance between the AV and the rear vehicle is also considered in the lateral process of TMPF, as it is included in the reward function in Section 3.4. To sum up, the TMPF can enable the AV to perform the merging action at a higher speed and to keep a longer distance from the rear vehicle, so as to reduce the disturbance on the rear vehicle.

Typical Scenarios Tests
On the basis of Section 5.2, some scenarios closer to the actual tasks are designed, which are ramp merging scenario and main road exit scenario, as shown in Figure 11.
In the ramp merging task, the AV tries to merge into the traffic flow from the main road to the ramp. The speed of the ramp flow is relatively slow, and the range of the expected speed v 0 is set between [10, 14 m s À1 ], while the traffic flow on the main road is relatively fast, and the range of the expected speed v 0 is set between [14, 17 m s À1 ]. The results of 50 tests of the three comparison algorithms are compared in Table 8. It is found that TMPF can realize the merging task in a significantly shorter driving distance. Compared with the other two, the speed when merging into the target lane is also closer to the desired speed of the target lane, which indicates that TMPF can exert fewer influences on the speeds and decelerations of the SVs in the target lane. Also, the speed variance of intelligent vehicles is within an acceptable range in the whole merging process.
The main road-exit scenario is opposite to the ramp merging scenario as described above. The AV needs to exit from the highspeed flow and merge into a low-speed traffic flow. Similarly, 50 groups of independent random tests have been conducted and the results are shown in Table 9. www.advancedsciencenews.com www.advintellsyst.com In the main road-exit scenario, TMPF can still maintain a good performance in the consumption of driving distance. However, the merging speed and the average driving speed in the whole merging task are lower than the baseline methods. This phenomenon is due to the high density of traffic flow initialized, which leads to the speeds of all vehicles in the environment being lower than the expected speed. The overall traffic flow speed will not rise until the average distance of traffic flow is pulled away. TMPF is more inclined to enable the AV to perform the merging operation at an earlier time. At this moment, the speed of the environmental traffic flow in the target lane is relatively lower, resulting in a relatively lower merging speed.
According to the analysis in Section 5.2, the speed of the AV evaluated by the BSP varies greatly when there are no other SV restrictions in the same lane. However, relating the two scenarios in Section 5.3, BSP shows a better performance in relation to velocity variance. This is due to the fact that under the restriction of SVs, the speed of AV during merging cannot vary much from the desired speed. Additionally, since the design of the action space in BSP is more delicate than those in TMPF and LAP, it will not result in excessive accelerations or decelerations of the AV during the merging. This is why we adopt acceleration a approach in designing the action space in TMPF.

Conclusion
In this article, we propose an online planning framework TMPF, to realize the merging tasks in dense traffic. The proposed method can actively create merging opportunities through interaction, and realize merging more smoothly within acceptable driving distance, when the driving intentions of SVs are unobservable.
In our work, a double-weight particle filter is used to estimate the driving models of the SVs. The whole merging task is decomposed into longitudinal subtask and lateral subtask separately to make the planning more rationally. Heuristic search strategy is also implemented to improve the planning efficiency.
To verify our proposal, two baseline planners, LAP and BSP, are utilized for comparison. The effects of different empirical parameters on the TMPF are analyzed, which can offer the opportunity for AV to adapt to different driving preferences. The results from multiple simulation scenarios have shown that TMPF performs superior in time and space consumptions than the baseline methods. Also, by TMPF, the merging process can be planned with a more stable velocity distribution and fewer influences on SVs.
For future work, we will mainly focus on two aspects. 1) Based on online search planning, offline learning method will be introduced to improve the efficiency and generalization ability of our   www.advancedsciencenews.com www.advintellsyst.com algorithm; and 2) Through the effective estimation of the belief state, more rational heuristic pruning of MCTS will be investigated so that limited computing resources are required for the evaluation of valuable branches.