Automatic generation of optimal road trajectory for the rescue vehicle in case of emergency on mountain freeway using reinforcement learning approach

Funding information National Key Research and Development Program of China, Grant/Award Number: 2020YFB1600400; Key Research and Development Program of Shaanxi Province, Grant/Award Number: 2020GY-020; Fundamental Research Funds for the Central Universities, CHD, Grant/Award Number: 300102320305 Abstract Rapid rescue response has the highest priority in case of emergency randomly happening on the freeway network, which allows rescue vehicles to have many trajectory options. Searching for the fastest way is not easy within a short time after traffic accident happens especially for the mountainous area with special characteristics such as limited traffic capacity, enclosed internal space and so on. Here, road segment model is proposed to determine smallest road segment covering possible rescue ways. Other than traditional optimal search methods, modified reinforcement-leaning is introduced to find the optimal road trajectory. The proposed methods are tested in the freeway of Qinling Tunnel group, Xihan Freeway of Shaanxi province, China as a case study. Compared with traditional shortest path method, the rescue vehicle arrival time to the accident location is shortened from 22.9 to 6.5 min and dissipation time is also shortened from 52.4 to 25.6 min. Both of them show the proposed road trajectory could improve the rescue effectiveness and reduce the influence to road network. Successful application of these case study shows they could probably extend to use to other scenarios and contribute to improve the intelligence transportation system.


INTRODUCTION
Emergency rescue after an accident on a tunnel freeway is more difficult compared with ordinary freeway. Many freeways are built in mountain area in China, which have a high ratio of bridge and tunnel, even up to 90% in some area. By the end of 2019, China had tunnels and bridges measuring 60,634,600 m [1]. Therefore, tunnels are one of the significant road component of mountain freeway. The most famous tunnel group is composed of Qinling tunnel NO.1, NO.2, NO3, totally about 18 km long, as the largest tunnel group in the world, see Figure 1. Rescue delay probably leads to the more potential severity of the accidents and the higher loss compared with ordinary freeway. It is because that freeway tunnels have its spatial structure characteristics such as limited traffic capacity, enclosed internal space, lack of natural lighting, restricted venti- lation, monotonous driving environment and so on. Especially for the tunnel group, the accident occurrence is more frequently and the consequence is more serious, so the accidents usually have greater impact on the mountain freeway. Recent years, traffic accidents occur in mountain freeway are reported that cause casualties and property damage [2,3]. Moreover, once a traffic accident occurs and cannot be dealt with in time, it may cause a wider range of traffic congestion, and may even cause a second accident, and continuously influence reputation of road management. Therefore, how to make proper preparations for emergencies, improve the response speed to emergencies have become an important issue for management departments of the transportation industry. Many scholars investigate how to reduce the loss caused by traffic accidents occur on freeway tunnel as much as possible. Some of them work on early warning before traffic accidents FIGURE 1 Freeway surveillance and control system of tunnel group in Qinling freeway management centre to reduce accident occur, such as Wang et al. [4] predicted traffic conditions to warn drivers to drive carefully to prevent accidents from happened and so on [5,6]. Other researchers are paid attention to emergency rescue after an accident [7]. Those studies have shown that shorter time rescue after traffic accidents is an effective way to reduce accident losses. The most effective method is requiring the rescue department arrive the accident scene as quickly as possible to rescue the trapped people and dispose of the sceneat the accident. Therefore, the road trajectory of the rescue department to the accident point is significant to be studied.
Generally, the management of each segment of mountain freeway is in charge of the segment management office. The segment management office will immediately dispatch rescue personnel and vehicles to the accident site when it is determined that a traffic accident occurred in the segment of the area under their jurisdiction. The rescue vehicle will generally depart from the freeway segment management office to the accident site once. There are three main tasks for rescue person and vehicles. First, rescue vehicles need to load the rescue materials of traffic accident demand; Second, rescue person should control the disaster situation at the accident site to prevent the spread and cause more serious consequences; Third, rescue person need to go to the corresponding control point for traffic control in time, prepare for the passage of rescue vehicles, prevent vehicle jams and interruption of road segments. Due to the uncertain factors on the freeway, the commander of rescue work needs to formulate the optimal traffic control strategy and the trajectory of the rescue vehicle in a short time under emergency condition, and it is necessary to dynamically consider the rescue vehicle encountered emergency situations, such as, emergency vehicles being occupied etc. The above are the key research problem of transportation scholars. Furthermore, road trajectory of rescue vehicle is a special vehicle routing problem (VRP). Compared with traditional VRP, vehicle road trajectory under emergency events has the following characteristics. Firstly, the traditional VRP usually considers the lowest cost and the economic priority principle. But emergency rescue needs to consider the minimum travel time as target, the situation follows the principle of efficiency first. Secondly, the rescue vehicle road trajectory is more complex. The rescue vehicle road trajectory needs to consider special situations such as vehicle congestion, make rescue vehicles to reach the accident site as quickly as possible. Thirdly, the problem of rescue vehicle road trajectory under emergency conditions must consider the degree of impact on the entire road network. Sometimes in order to make rescue vehicles reach the accident site quickly, traffic police will implement traffic control at some special points in the road network.
So, considering the particularity and importance of rescue vehicle path planning, our research will target at least driving time of rescue vehicles and less impact on the road network. Including traffic trajectory and other data, this paper further quantifies the impact of traffic control measures as one of the important parameters for obtaining rescue vehicle road trajectory. It is fast to use reinforcement learning algorithms to calculate the optimal road trajectory to the accident site on the basis of minimal impact on the road network to reach the accident site. Different from traditional algorithms, the reinforcement learning algorithm allows agent continuously interact with the external environment, obtain environmental information and obtain reward values to guide the generation of trajectory. The agent constantly interacts with practical traffic data information of freeway, and using the calculate result as a reward function to estimate the optimal road trajectory of rescue vehicles.
The remainder of this article is organized as follows. Section 2 describes different algorithms for solving VRP. The resolution method is presented in Section 3, which includes both the data prepare as well as the description of our algorithms. Case analysis in Section 4, and finally Section 5 presents conclusions and opens some lines for further research.

LITERATURE REVIEW
The VRP can be described as the problem of designing optimal routes from one or several locations to a number of geographically scattered location, subject to side constraints [8]. Many researchers have explored the different algorithms for this type of problem, and existing various VRPs methods and a large amount of literature. Regarding the minimum time as the target, Nikoo et al. [9] has proposed a model for the emergency transportation network design problem. The model is solved by branch cutting method, taking the travel length, travel time and the number of paths as metrics to identify the optimal routes for emergency vehicles. Min et al. [10] finished path-searching for a reliable estimated time of arrival in ever-changing traffic conditions, and elastic signal preemption to reduce the negative impact on the whole traffic flow introduced by prioritizing the EVs. Refs. [9,10] are studied for emergency rescue routes in emergency situations. Although factors such as distance and flow are considered, they all does not consider the impact of road traffic control information on route selection. Shen et al. [11] proposed a novel reliable path finding algorithm for a stochastic road network with uncertainty in travel times, which can apply in real-life road traffic networks. Prakash et al. [12] proposed algorithms to determine the most reliable routes on stochastic and time-dependent networks. Tharinee et al. [13] proposed a two-phase algorithm to solve the minimax optimization problem with uncertain travel time. The result shows that the proposed method can find robust solutions in all conditions; Deng et al. [14] and Jiang et al. [15] by improving the classical Dijkstra algorithm to solve the shortest path (SR) problem. They improved the algorithm without applying actual cases, and used in actual path planning. Lou et al. [16] introduced consistent traffic information prediction, which can be provided by advanced traveller information systems, into the dynamic system to investigate its influence on travellers' route choice behaviours and the corresponding dayto-day network flow evolutions to obtain the optimal path. Lin et al. [17] solved the problem that via the manipulation of traffic state data measured or generated by compromised vehicles, which lead to erroneous predictions of traffic states and induce improper determination of guided routes for vehicles, the result secured the route guidance process, which enables traffic efficiency and safety. In route selection, some travellers or drivers do not consider time cost, regarding the minimum route distance from the starting point to the destination as the target, for example, Cao et al. [18] proposed a data-driven approach, which directly explores the big data generated in traffic to solve the stochastic SR problem in vehicle routing, the objective of which is to determine an optimal path; Qu et al. [19] proposed a profitable taxi route recommendation method called adaptive shortest expected cruising route (ASER). ASER takes the load balance between passengers and taxis into consideration and the shortest expected cruising distance is introduced to formulate potential cruising distance of taxis. Dapia et al. [20] proposed a branch-and price algorithm to solve the time-dependent VRP with time window. Chen et al. [21] proposed two efficient solution algorithms to solve the forward and backward TD-RSPP exactly and the optimality of proposed algorithms is rigorously proved. Chen et al. [22] regarded the spatially dependent reliable shortest path problem (SD-RSPP) is formulated as a multicriteria SR-finding problem. and proposed a new multicriteria A* algorithm to solve the SD-RSPP in an equivalent two-level hierarchical network. Zhang et al. [23] presented a robust shortest path (RSP) model to obtain the SR. Companies or travellers with plenty of time will regard the minimum total route cost as the target, Li et al. [24] developed lagrangian relaxation basedinsertion heuristics to solve the vehicle rescheduling problem. Zhao et al. [25] proposed a method that minimize a certain objective cost relying on the accurate prediction of the traffic network states and the estimation of the route costs that are not readily available. Diao et al. [26] investigated the problem of speed planning and tracking control of a platoon of trucks on highways. Zeng et al. [27] proposed a method that can obtained a route what is the most energy-efficient way to drive a vehicle from the origin to the destination within a certain period of time. Hong [28] proposed an improved large neighbourhood search algorithm to solve the dynamic VRP with hard time window. Colpaert et al. [29] studied the data published from the Department of Transport and Public Works in Flanders for maximum reuse. The purpose is to reduce the cost during transportation.
Above all, with the minimum time, minimum route distance and minimum total route cost as the goals, the domestic and foreign scholars' research on VRPs is listed. The planning of rescue vehicle road trajectory is more complicated than the traditional VRP, and more factors should be considered. Factors such as time, cost and distance are not enough. The impact on the road network and the rational use of freeway infrastructure must be considered.
In recent years, a great of machine learning algorithms have been used in traffic and transportation fields, scholars try to solve VRPs by combining reinforcement learning with actual traffic demand conditions. Nazari et al. [30] presented an endto-end framework for solving the VRP using reinforcement learning. And the model in the framework represents a parameterized stochastic policy, and by applying a policy gradient algorithm to optimize its parameters, the trained model can produce the solution as a sequence of consecutive actions in real time to find the optimal path. Sang et al. [31] presented an experience how to apply reinforcement learning method to optimize the route of a single vehicle in a network. It shows promising result in finding the best route and avoiding the congestion path. Huang et al. [32] used the experience of taxi drivers to propose a constrained depth reinforcement learning (CDRL) algorithm to calculate the fastest route between ODs in different time periods. Experiments show that the CDRL method is better than the SR method in travel time calculation, and has little difference from the fastest path (FR) method.
Calculating the optimal road trajectory between the starting point of the rescue vehicle and accident site in this paper based on reinforcement learning. The research content and framework of this paper are shown in Figure 2. Deep Q-learning algorithm not only includes deep learning but also reinforcement learning [33]. In the first phase, huge freeway vehicle inspection device data is used, which can calculate the average travel time and traffic volume of each road segment. In the second phase, improved DQN algorithm to select optimal road trajectory is established. The agent of improved DQN algorithm can continuously explore and learn from the given traffic data, and finally obtain the optimal rescue vehicle road trajectory between the two points. In the algorithm, the fastest travel time of the rescue vehicle and the influence of the traffic control on the road network are considered to obtain the optimal road trajectory from the starting point of the rescue vehicle to the location of the accidents. Enable rescue vehicles to quickly reach the scene of the accident and minimize the impact of traffic accidents on the road network. In high-risk segments such as, continuous tunnels, sharp turns and long longitudinal slopes, the traffic flow varies greatly. Therefore, the traffic information of different road segment on the same highway may be quite different. In order to reduce the deviation of using a section of road traffic data to represent the entire road, we set a limited number of nodes on the freeway road that to divide a freeway into a limited number of segments. Reasonable division of road segments can objectively and truly reflect actual road traffic conditions. Therefore, the following principles should be adopted for freeway segment division: The traffic conditions of the same segment are basically the same. Building node model. Principles for selecting nodes that divide road segments are: (1) Where traffic information changes (such as, interchanges, ramps, toll stations and service areas); (2) Where the number of lanes changes (for example, two lanes become three lanes); (3) Road alignment (such as straight curve, curved line, S curve segment etc.) and road gradient change. (4) Places such as tunnels and bridges that are likely to cause changes in vehicle speed or vehicle driving behaviour; (5) There is a segment between two adjacent nodes, and the distance between the segments is maintained between 0 and 4 kilometres k. After selecting a node, it is recorded as K i according to the freeway stake number information. For example, a point of ramp end is selected as a node, and the stake number is K2295+336, then K = 2295.336. The node model can be shown in the following formula: where n represents the number of nodes.
Building road segment model. According to the node model, the road segment model can be obtained: where R i , R j represents the distance between adjacent nodes.

Traffic flow selection mode
Freeway segment flow data can be automatically collected by various detectors installed on the freeway, also can be collected manually, and can also be obtained by mining freeway toll station data. The basic data used in this paper is first obtained from the detection equipment. However, if the road traffic data is seriously lost due to equipment failure and other factors, the deep learning algorithm can be used to predict the segment traffic flow. Wang et al. [34] proposed the temporal convolutional network, a deep learning algorithm to forecast unknown traffic volume at any designated cross-segment located on a freeway. To make the obtained traffic volume data more practical, a collection of toll data for at least one month to forecast the traffic volume and evaluate its prediction accuracy. For a single road segment, the inbound traffic of the upstream toll station is traveling downstream. If vehicle is not exiting the freeway, the traffic flow will be passed to the downstream road segment in turn, so the traffic flow on the single road segment is mainly from the upstream incoming traffic flows from multiple toll stations. At the upstream toll station associated with the known target road segment and the inbound traffic at the corresponding time, a traffic flow estimation model for any target road segment can be constructed. The steps are as follows: 1. Count the traffic flow of the upstream toll station associated with the target road segment: 2. Demand time period for calculating the traffic flow of the target road segment, combined with the traveling time of the road segment along the road and inversely deducing the time segment of the upstream toll station traffic demand; 3. Obtain the traffic volume of the upstream toll stations in the corresponding time period, and the proportion of traffic between the target road segment and the inbound traffic of each toll station; 4. Obtain the traffic volume of each toll station associated with the target road segment, and add up to calculate the traffic volume of the target road segment.

Vehicle queue length model
When freeway roads are occupied due to accidents, it will affect the road capacity and cause vehicle queuing. A series of measures for traffic control may help deal the order of traffic in the tunnel, enable vehicles to pass through safely and efficiently.
The main traffic control methods of common freeway tunnels are entrance control, lane use control, variable speed control and road network area control. In this paper, the duration of an abnormal event can be composed of four stages. The first stage is the time from the occurrence of traffic anomalies to the confirmation of the road managers; the second stage is the response stage, that is, from the confirmation of the incident until the rescue vehicle reaches the accident site; the third stage is the accident disposal stage, that is, the process of the rescue vehicle arriving at the accident scene and leaving the accident scene; the fourth stage is the traffic recovery stage, that is, from the end of the accident disposal to the complete dissipation of the queued social vehicles, the traffic returns to normal. The total time for the first three stages is duration of the accident.
We deduced the formula for the queue length of affected social vehicles with time by analysing the traffic wave theory [35].
Firstly, suppose that when a traffic accident occurs, the upstream flow of the lane where the accident point is located is q 1 , the corresponding traffic density is k 1 , the road capacity at the accident point decreases to s 1 , and the corresponding traffic density is k s1 . After the accident site is cleaned, the queued vehicles are driven out at the saturated flow rate s, and the corresponding traffic density is k s . t 1 is the estimated duration of the accident. Next, draw the relationship between the queue length and the duration of the accident generated by the traffic jamdissipation process, as shown in Figure 3. t = 0 is the time point of the accident begin, and y represents the queue length of social vehicles. And OB is a straight line and BCD is a curved. OB is the trajectory of the return wave after the accident, OB represents the wave speed of the return wave, as shown in the Equation (3).
Using the Greenshields model, the flow-density relationship is shown in Figure 4. Suppose that k s0 belongs to the fluent state of high speed and low density, and s 1 belongs to the crowded where u f is a constant and is related to the capacity of the road segment. Finally, according to the above analysis, the formula for the relationship between the maximum vehicles queuing length and the corresponding time can be obtained. where

Impact on the road capacity model
We use the time spent dissipating queued vehicles as the only indicator to measure the impact on the road network in paper.
Shorter the time spent evacuating queued vehicles, smaller the impact on the road network; longer the time spent evacuating queued vehicles, greater the impact on the road network. According to Figure 3, when y BCD = 0, the vehicle dissipation time solution formula is as follows: 3.1.5 Estimated arrival time model It takes a long time to unblock emergency lanes, so we need to cooperate with traffic control strategies to clear some road segments so that rescue vehicles can reach the scene of the accident as quickly as possible. The EAT of the vehicle is as follows: T distance represents the EAT of the vehicle, S distance represents the distance travelled by the rescue vehicle to the accident point, V average represents the average speed of the rescue vehicle.

Markov decision processes
Based on the content of Section 3.1.1, we abstract different segments of the highway into different nodes. The rescue vehicles arrive at the end point from the starting position via a limited number of segments, where the start position is the location of the rescue vehicles and the ending position is the location of the traffic accident. On the map we treat this process as being from one node to another. There are many influencing factors in the choice of road segments, such as the average speed of the road segment, the probability of congestion etc. The time it takes for the vehicles to pass the road segment and the degree of influence on the road network can be converted into reward values through data processing. Viewed rescue vehicles as agents, the agent can get the reward values by selecting different action, and finally select a path composed of the maximum cumulative reward value road segment from the starting position to the ending position. This process is similar to the MDP process, so reinforcement learning algorithms can be used to solve such problems. Reinforcement learning consists of four necessary parts: State S, action A, reward R, discount-rate γ.
The status S represents the initial position of the rescue vehicle in the road network, define state in this paper as the road network node S∈(Li, i = 1, 2, 3, 4, …). The action A i,i+1 (i = 1, 2, 3, 4, …) represents from one state to the next state, it also can be said the process of rescue vehicle selecting road segment. The reward function R(S,A i,i+1 )(i = 1, 2, 3, 4, …) is equal to the benefit obtained by the agent switching from one state to the next. That is the reward of the rescue vehicle through a limited number of the road segments in this paper, as shown in Equation (9).
The discount-rate γ ∈ (0,1) indicates the extent of influence of the current action selection on the later, γ is closer to 1, the greater the influence of the current action selection on the later.
The Agent will learn basic data, and base data to explore from one state to another until it reaches the target state. Each such exploration is called an episode. Each episode is the process by which the agent reaches the target state from the initial state.
The final result of this MDP is the ordering of the nodes from the initial position to the end point, which also can be regard as the optimal road segment set of the rescue vehicle from the initial position to the end point. The Markov process can also be regarded as the process of finding the optimal strategy π. In this paper, the optimal strategy π represents the road segment where the rescue vehicle chooses the largest cumulative reward value.
The rescue vehicle execution strategy π and the environment interaction can obtain the episodes consisting of state, action and reward, as shown in Equation (10): G(π ik ) is refer to cumulative reward value of each episode and can be expressed as: Rescue vehicles need to find the optimal strategy when selecting the optimal road trajectory, that is, find the episode with the largest cumulative reward value. Assuming taking method of greedy strategy, which always choose the action with the largest G(π i+1k ).
Define Q(S i ,A i,i+1 ) as the maximum cumulative reward value of the rescue vehicle in selecting the road segment, so , and the greedy strategy can be expressed as Equation (12).
If the Q(S i ,A i,i+1 ) value is known, the optimal strategy can be obtained by the above formula, but in practical applications, the value of Q(S i ,A i,i+1 ) is unknown. Fortunately, Equation (9) satisfies the property of the Bellman equation and thus can estimate Q(S i ,A i,i+1 ) by iterating from the back to the front, as shown in the following Equation (13). .
(13) The calculation process of the reinforcement learning algorithm needs to iteratively update the Q (S i , A i,i+1 ) values. In the initial training, there will be a big difference between the estimated and actual values of Q(S i , A i,i+1 ) the initial training. But with each iteration of the iteration, the Q(S i , A i,i+1 ) estimate becomes more and more accurate.
In order to prevent the agent always selecting the action of the maximum return value, ignoring the optimal set of actions. This study uses ε greedy strategy, different from greedy strategies, use ε probability to select the current best strategy, 1-ε probability random selection strategy. It can help the rescue vehicle completes a success episode strategy, choose the current optimal strategy, or continue to explore the search for possible better strategies.

DQN algorithm application
As a traditional reinforcement learning method Q-value estimation is usually expressed using a Q-value table or a function approximation. However, when solving the optimal road trajectory for the rescue vehicle problem, many states are involved. Storing such a table will consume a lot of time and computer memory, and using the function approximation method cannot solve the real-time optimal road trajectory problem. So deep reinforcement learning is introduced to solve such problems. This article uses the deep Q-learning algorithm to estimate the Q-value. In this algorithm, a deep neural network is used as a function approximator for state mapping to Q-values. The input to the neural network is the current state of the agent, and the output is a Q-value table of the actions that the agent's current state can perform. Use the calculation results of the neural network as an approximation can solve the optimal road trajectory for the rescue vehicle problem with a large number of continuous state spaces.
In this paper, the starting position of the rescue vehicle is taken as input, and the output is the queen of nodes from the starting position to the ending point and Q estimated value. The ALGORITHM 1 is the pseudo code to solve the optimal road trajectory:

Case description
The case 1 was a traffic accident in the Qinling No. 1 Tunnel. The stake number of the accident location was K1161+130 m on the Xihan Freeway, Xi'an direction, the time was about 9:38 AM on 1st November 2019. The accident was a rear-end collision between two light trucks, and the bodies of the two vehi-ALGORITHM 1 1. Initialize the Q matrix and set initial values to 0.
2. Set the parameter γ = 0.8 and the score matrix R, which consists of the reward value of the rescue vehicle selection segment; 3. Loop through all states: (1) Assign the starting position of the rescue vehicle to the initial state S i (2) If the initial state of the rescue vehicle is not the location of the accident point, note that the initial state is assigned to the current state the first time.
Loop the following five steps ① In the current state S i , select an action A i,i+1 randomly; ② Execute the action A i,i+1 and get the reward R(S i , A i,i+1 ), the state after performing this action is S i+1 ; ③ Using the formula:  cles were deformed and unable to move, resulting in loss of road production. Goods were scattered at the scene of the accident, with an area of 30 m 2 . There were no casualties on the scene, but emergency passage is occupied more than 50 meters. Under normal circumstances, the road capacity of the Qinling segment of the Xihan Freeway is: q 1 = 1571 pch h −1 , k 1 = 30 pcu km −1 . After a traffic accident, and the traffic capacity drops to s 1 = 933 pcu h −1 , density k s1 = 95 pcu km −1 . The estimated duration of this traffic accident is 10 min. The parameters of the traffic flow model of this segment of high-speed road are shown in Table 1.

Performance of reinforcement learning approach application
According to the location of the nearest rescue point and the special geographical environment, Qinling Management Office is the closest rescue point to the accident site. Rescue vehicles can go to the accident from two directions. The first direction is that the rescue vehicles leave from the Qinling Management Office, departs along the Xihan Freeway in the direction of Xi'an, and goes straight to the accident site. The second direction, firstly, traffic control at the upstream points of the opposite lane of the accident site, after the rescue vehicle departs from the management site, along the accident site opposite lane, go to the accident point retrograde. The approximate accident  Figure 5. According to the node selection princriples, it can be abstracted as the road segment model shown in Figure 6. The driving road trajectory of the rescue vehicle can also be regarded as the route from the point of management office to the accident point. Due to the connectivity of the highway, the road trajectory from rescuers to the accident site is not unique. What's more, in accordance with traffic management and control strategies, rescue routes are diverse. Three typical schemes applied traffic accident is listed, as show in Table 2. The first and second schemes consider the particularity of tunnel freeways and adopt certain traffic control measures to make rescue vehicles drive in reverse, make rescue vehicles reach the accident point as quickly as possible. It is noted that the third scheme is to select the rescue vehicle road trajectory based on the SR method. The traffic control strategy includes the location of traffic control selection, change of lane driving directions and so on.
According to the details of the accident, the length of the queue at the accident point, the length of the accident queue at the point L3, the estimated time of rescue vehicles to arrive at  Case 1: P9-P1-P2-P3-P4-P5-P6-P14-P13-P12  Case 2: P9-P1-P2-P3-P4-P5-P6-P14-P13-P12-P11-P10   Scheme 2  Traffic control at L1, L2, L3 is  To verify the applicability of the algorithm, we randomly changed the location of the accident point as an extended case. Supposed case 2 was a traffic accident B occurred at K1163+830 m, Xi'an direction, the accident duration is 12 min, and other traffic information is the same as case 1. Table 2 describes the three schemes corresponding to Figure 6.2(a)-(c) in case 2. Under scheme 1, prohibition passing traffic control measure will be carried out at L1, L2 and L3 and the rescue vehicle will drive in opposite lanes of the accident point. Therefore, the road trajectory of rescue vehicle is P9-P1-P2-P3-P4-P5-P6-P14-P13-P12-P11-P10.  is intermittent passing not prohibition passing of vehicles. We use different colours to distinguish traffic control measures in Figure 6. Figure 6.2(c) represent road segment model of case 2 under the scheme 3. The scheme 3 is to select the rescue vehicle road trajectory based on the SR method, so the road trajectory of rescue vehicle is P9-P10. Parameter values of different schemes in case 2 are obtained, as shown in Figure 9. The reward value obtained by applying the optimal road trajectory approaches based on RL under the three strategy schemes is shown in Figure 10. According to the Figures 9 and 10, the optimal scheme strategy is scheme 1 in case 2, P9-P10 is the optimal road trajectory of rescue vehicle. The path trajectory obtained by the RL method can make the rescue vehicle reach

DISCUSSION
This paper introduces the modified reinforcement-leaning in deep Q-learning algorithm approach to the choice of optimal road trajectory. Taking an accident that occurred in the Qinling NO.1 tunnel as an example tests the feasibility of the algorithm. The resultant arrival time to the accident location is shortened and time for the congestion caused evacuation is also shortened. Both of them show the proposed road trajectory could improve the rescue effectiveness and reduce the influence to the road network.
As studied the shortest way may not be the fastest way, particularly, in the condition of heavy traffic flow, and the shortest distance road trajectory cannot reach the accident point quickly; Choosing the shortest road trajectory may be cause more vehicle congestion, and even secondary accidents in severe cases. Furthermore, to driving fast with other vehicles giving way to the rescue vehicle and reduce the impact to the road network is important considerations. The balance between them should be found.
Compared to the traditional optimal road trajectory method, in the reinforcement learning method, the agent can interact with the environment by selecting the road segment to get the value feedback from the environment on the benefit of the selected road segment, and adjust its action according to the reward given by the environment. It allows the driving trajectory of rescue vehicles dynamically adjusted in real time. Using the agent interact with the road environment to obtain dynamic traffic data, and continuously adjust the driving road trajectory of rescue vehicles, so that the rescue vehicles can reach the accident point as soon as possible with less impact on the road network. Continuous interaction with the environment is the main feature of reinforcement learning, which can be used to solve many modern traffic and transportation problems in the future. The method of reinforcement learning through experience learning to provide the trajectory planning are initially tested in satisfactory effect in this paper, the extension to more complex mountain freeway tunnel group and other accident scenarios should be further studied in future too.