Survey on vehicle map matching techniques

With the development of location-based services and Big data technology, vehicle map matching techniques are growing rapidly, which is the fundamental techniques in the study of exploring global positioning system (GPS) data. The pre-processed GPS data can provide the guarantee of high-quality data for the research of mining passenger's points of interest and urban computing services. The existing surveys mainly focus on map-matching algorithms, but there are few descriptions on the key phases of the acquisition of sampling data, floating car and road data preprocessing in vehicle map matching systems. To address these limitations, the contribution of this survey on map matching techniques lies in the following aspects: (i) the background knowledge, function and system framework of vehicle map matching techniques; (ii) description of floating car data and road network structure to understand the detailed phase of map matching; (iii) data preprocessing rules, specific methodologies, and significance of floating car and road data; (iv) map matching algorithms are classified by the sampling frequency and data information. The authors give the introduction of open-source GPS sampling data sets, and the evaluation measurements of map-matching approaches; (v) the suggestions on data preprocessing and map matching algorithms in the future work.


| INTRODUCTION
In this era of Big data, a huge volume of data is being produced every day, which has been utilized to conduct the research studies. After extracting and filtering the data, researchers have used specific rules and designed mathematical models to analyze and discover knowledge from the data [1] in order to obtain useful information for people's daily activities. With the rapid development of location-based services, trajectory data [2,3] are growing exponentially, which represent the continuous movement of objects, including information such as latitude and longitude. These continuously growing trajectory data assist in the intelligent transportation in several application scenarios. As shown in Figure 1(a), trajectory data can serve the upper application layer in the research of intelligent transportation, for example taxi passenger points of interest mining [4], updating urban road network [5], vehicle abnormal behaviour detection [6] and ride-sharing taxi scheduling [7]. Inevitably, the first step is to correct the deviation of trajectory data with map-matching techniques. Then, the corrected data is analyzed and applied to the application layer. In addition, the data generated by the upper application can also be used in the data layer. In practice, there is a certain deviation between the actual driving routes and global positioning system (GPS) sampling data because of the satellite positioning error [8]. In addition, "Since the vehicle navigation system was proposed [9][10][11], the problems of inaccurate positioning and system abnormality have always existed. In addition, several scholars reported that the navigation system initially needs to perform GPS point correction on the electronic map so that the navigation function can be used effectively [12]." In the research of map matching, a Kalman filter is proposed for vehicle navigation [13], which mainly uses the covariance matrix to determine the specific location of the vehicle when the signal is blocked, thus completes the map-matching process of the vehicle position. Hereafter, based on the Kalman filter, a multihypothesis technique for map matching is proposed based on the principle that the likelihood function is used to infer the target position of the vehicles [14]. Figure 1(b) describes the workflow of the vehicle intelligent navigation system. Intuitively, after the mapmatching step, the navigation function can work properly, such as route guidance, velocity control, route visualization and position location. As shown in Figure 1, map-matching technique is the core of navigation technology and plays an essential role in the field of intelligent transportation and intelligent navigation system [15,16].
In recent years, with the rapid development of wireless network technology, map-matching techniques have wide application scenarios, such as indoor navigation [17], unmanned aerial vehicle flight [18], robot training [19] and so on. Therefore, it is necessary to use map-matching techniques to rectify the deviation of positions w.r.t. trajectory data. As for the use of trajectory data, we focus on the floating car data (FCD) collected by moving vehicles in the map-matching environment due to its popularity in transportation systems.

(a) (b)
As shown in Figure 2, map-matching technique is a complete process including FCD preprocessing, road data preprocessing and selection algorithms [20]. The phase of data preprocessing aims to process the original FCD and road data by using many different kinds of methods to obtain the ideal experimental results. The phase of selection algorithms focuses on selecting an appropriate one among many algorithms according to the actual situation. In the phase of map matching, the task is to design appropriate map-matching algorithms on the processed experimental data. In the practical applications, in the absence of such a process, it will have a significant impact on the experimental results. The phase of visualized results output is an additional step after map matching to visualize the experimental results. By using geographic information system (GIS) tools [21], we can find whether the trajectory data are correctly mapped to the actual path in the electronic map.
The research method of software engineering is very popular, which needs to establish the review goal and research method, and then draws a conclusion [22][23][24]. In [25], researchers believed that a systematic framework can effectively assign effective plans and strategies to assist practitioners. Ghazi et al. [26] reported that the demand for empirical research on software engineering is growing, and by following the research methods of software engineering to determine the problems researchers face in the investigation design and mitigation strategies. However, in order to comprehensively review the map-matching techniques, we review map-matching techniques from a new perspective, including the data preprocessing, selection algorithms, the open source data sharing and evaluation measurements. Starting off the point of view of technology, we aim to summarize the original FCD and road data preprocessing rules, and classify map-matching algorithms from different aspects. Overall, our contributions are as follows in this survey: (i) We give a detailed review of the data preprocessing process, which is rarely mentioned in the map-matching problems, and operation methods and significance of the rules are also presented in the Tables 1, 2 and 3; (ii) We categorize the map-matching algorithms according to the GPS sampling frequency and the data information; (iii) Open source trajectory data sets and common evaluation measurements are provided. We do not only describe the data sets, but also provide their downloading websites. In addition, we have made the necessary explanation of the evaluation measurements and (iv) Based on the current trend of map-matching techniques, we provide the future development direction.
The rest of the study is organized as follows: In Section 2, we first provide the necessary definition of map matching and analyze the existing research work. Then, a methodology is given in Sections 3 and 4. Finally, we propose the future direction of map matching and give conclusion in Section 5.

Rules Operation methods Significance References
Filtering redundance Set a sliding window, and delete the latter same piece of data inside the window.
Delete redundant data to avoid duplicate points for subsequent map matching.
[ [36][37][38][39][40] Removal of abnormality Set the scope of trajectory areas and remove the trajectory points beyond the trajectory area. Setting speed thresholds to filter abnormal data is also popular.
Remove meaningless points to avoid increasing the matching error.

Rules Operation methods Significance References
Coordinate transformation Convert the WGS184 coordinate system into other coordinate system, such as Beijing 50. The main task is the conversion of parameters between different coordinate systems Make the experimental data in the same projection coordinate system [47][48][49][50] Filling missing values The average method aims to find the mean value. The median method focuses on using median value. Linear interpolation is mainly to use a linear formula for the previous point to calculate the next point. Forward and backward interpolation is directly filled with adjacent data. Temporal interpolation focuses on time series interpolation. Multiple filling method needs to predict different values Make full use of the missing value data on the map matching in order to improve the map-matching accuracy [43,[51][52][53][54] Reconstruction trajectories Combining with the inference model and the existing trajectory information to restore the target trajectory route The reconstructed trajectory is closer to the actual driving route and reduces the difficulty of map matching [41,[55][56][57][58] Trajectory segmentation The long trajectories are segmented according to certain rules of trajectory distance, time period, geometric topology, trajectory semantics etc.
Segmental trajectory facilitates parallel map matching [28,35,[59][60][61] Geometric information indexing Set index mark for position information. After the map is meshed, it also can quickly locate each grid Accelerating the search of trajectory points in map matching [35,[62][63][64][65] Data arrangement Arrange the data according to the time stamps Restore messy data to the original trajectory order [48,[66][67][68]

Rules Operation methods Significance References
Road extraction Road extraction should be carried out according to road data distribution and experimental environment.
Extract the required road as the experimental data. [69][70][71][72][73] Road correction Set a threshold to determine whether the two segments need to merge.
Ensure the correctness of road data. [69,[74][75][76][77] Grid partition Divide the map into grids of the same size. You can customize the mesh size according to the requirements.
The purpose of grid partition is to speed up the search of candidate segments.
Topology can better express the connection relationship between roads.
[ [83][84][85][86][87] 58vehicle, and the GPS trajectory of the vehicle can be determined according to the time information. The more fields FCD contains, the more conditions the matching algorithms can take into consideration, and the higher is the matching accuracy.
Definition 2 (Road network structure) Because of the complexity of urban road network structure, the topology model can be used to clearly express the relationship of roads [27]. The original road data must be constructed in to a topological graph, please refer to Section 3.2. Assuming that the point is a road junction and the line represents a road segment, then all road network structures can be represented as shown in Figure 4 and Figure 5. For example, there are two interlaced roads. One road is called 'Beijing Road', which contains five road segments and six road junctions numbered from 1 to 6. Another road is called 'Nanjing Road', which consists of four road segments and five road junctions with intersection numbers 3, 7, 8, 9 and 10. Straightforwardly, the intersection of 'Beijing Road' and 'Nanjing Road' is the third road junction. Given a urban road network data, it can be expressed by an undirected graph network G = (V, E), where V is a set of vertices representing road segments intersection, E is a set of edges representing the road segments. In the phase of road data preprocessing, the road segments can be stored by the methods proposed in [28,29], while the road segments intersection is stored by the method introduced in [30].
Definition 3 (Candidate segment) In map matching, we need to find several segments as candidate objects for trajectory points, which are likely to be the actual path where the trajectory points are located. Typically, we set the search radius centred on the current trajectory point and search candidate segment within circular regions. The red dot is the trajectory point, and the circle is the search area. Trajectory segments searched by trajectory points include r 1 , r 2 , r 3 and r 4 . Correspondingly, the green dot is the matching point of the candidate segments, also known as the candidate matching point.

| Related work
Intuitively, when the vehicle is driving on the road, a complete trajectory should be correctly displayed on the road. However, due to the errors caused by satellite positioning systems, there are deviations between vehicle trajectory and roads. The map matching problem has been extensively studied in recent 2 decades, but there are few surveys of map-matching techniques, and most of existing work focused on designing map-matching algorithms. From a methodological perspective, Quddus et al. [31] gave a review of map-matching approaches, including geometric analysis, topological analysis and probabilistic mapmatching and advanced map-matching algorithms. Geometric analysis utilizes the distance relationship between trajectory points and roads for map matching. Topological analysis focuses on the connectivity of trajectories and adjacency road. The probability-based map-matching algorithm uses different mathematical models to find a trajectory with the maximum probability approximating to the real path. The advanced mapmatching algorithm can reduce positioning errors from multiple aspects, including Kalman filter, fuzzy logic, multiple hypothesis technique and so on. Based on the classification method proposed in [31], Hashemi et al. [32] classified the real-time map-matching algorithms. A variety of different matching algorithms are compared and analyzed. In addition, the disadvantages of weight-based and advanced map-matching algorithms are also compared and analyzed. Song [33] introduced a more comprehensive classification method of map-matching algorithms, which is categorized according to information, application environment and sampling frequencies. Different from previous classification methods, Chao et al. [34] proposed an innovative classification method based on matching models and real scenarios. Similarity measurements, state-transition, candidate-evolving and scoring model are proposed in the study. The similarity model contains two types, that is, distance based and pattern based, corresponding to the distance of the points or trajectory segments projected to the nearest road. Based on the weighted topological graph of all possible vehicle driving routes, the main task of the state-transition model is to find an optimal path, such as hidden Markov model, conditional random field and weighted graph technique. The candidate-evolving model needs to maintain a set of candidates in the phase of map matching, and evolves from the first point by constantly adding new candidates while pruning irrelevant ones. The particle filter and multiple hypothesis techniques are the most classical methods. The scoring model first needs to partition road into grids, scores the grids according to the timestamp and returns the maximum score of the grid results. Taking into consideration the noise characteristics of trajectory data, Gao et al. [35] introduced the trajectory data preprocessing methods including data cleaning, trajectory compression, trajectory segment and map matching. Noise filtering and stay point detection methods can effectively remove outliers in the phase of data cleaning. However, in order to reduce storage capacity and improve the efficiency of calculation, both distance-based and offline compression methods can work well in trajectory compression. Trajectory segmentation is beneficial for reducing the computational complexity, which is commonly used in three strategies based on time threshold, geometric topology and trajectory semantics. In the phase of map matching, the processed trajectory data and electronic map are combined to transform the sampling sequences of GPS coordinates into the road network coordinates. Real-time online matching and offline matching for historical static data are two most popular approaches.
Although map-matching problem has been studied for many years, many survey studies have deficiencies in some aspect. For example, many researchers regard map matching as map-matching algorithms problem, ignoring the data preprocessing process. Data preprocessing is the initial phase of map matching and should be explored in detail. In addition, the classification of existing map-matching algorithms is not state-of-the-art, so it is necessary to propose a new classification method. These studies in [31][32][33] just focus on different classification of map-matching algorithms, but it is not a comprehensive study of map matching. In addition, open source GPS sampling data sets and the evaluation measurements are rarely mentioned in the map-matching surveys. Although open source GPS sampling data sets and the evaluation measurements are not the main research contents of the study, we believe that the map-matching surveys should provide more relevant information to facilitate reader's understanding. In [34,35], it should provide more detailed information. Different from the existing work, we first make a comprehensive summary of data preprocessing, including FCD preprocessing and road data preprocessing. Then, we propose a new classification method for map matching. Finally, open source GPS sampling data sets and the evaluation measurements. In conclusion, we summarize the map-matching problem as comprehensively as possible.
In this section, we first give the necessary definition of map-matching techniques. Then, after a comprehensive review of the existing work, we propose a new classification method for map-matching techniques.

| FCD preprocessing 3.1.1 | Noise filtering
The results obtained by using the original FCD with noise directly in experiments are often not ideal. In the process of data acquisition, due to the frequency of acquisition equipment or acquisition rules, it is inevitable to produce records that are regarded as noise data, such as redundant, abnormal and incomplete records. Here, we will give the following introduction to the noise data filtering approaches as shown in Table 1.
Filtering redundance. When the speed is slow or the vehicle stops, the trajectory points may overlap and the redundant data will generate. Yan et al. [36] combined the speed and distance between adjacent points to determine whether there are redundant points. If the adjacent point distance does not exceed the threshold and the speed is less than 3 m/s, the current trajectory point is ignored. Huo et al. [37] thought that the probability of GPS producing the same record in a short time is very small, so the same record is deleted within a fixed time interval. Hui et al. [38] assumed that no more than 50 m of vehicle position change within 30 min was a parking behaviour and deleted redundant sampling points with the same timestamp. GPS records may have data with the same time and different speeds, so only a set of data needs to be retained in these studies [39,40].
Removal of abnormality. When an extreme situation occurs, the data will be abnormally generated. For example, if the visible satellite is less than 4 or the vehicle is more than 200 km/h, they are considered as abnormal data and be removed in [41]. In the study [36], FCD beyonds the limitation of longitude and latitude is regarded as abnormal data and should be removed, which can only appear within the fixed area of the electronic map, because there are more stringent speed limitations for urban roads, generally 40 km/h. It is only necessary to determine the two coordinates of lower left corner and top right-hand corner which are viewed as the latitude and longitude limitations of the electronic map. Liu et al. [40] believed that the travel time conformed to a lognormal distribution and used the Laida criterion to clean outliers. Wen et al. [39] deleted data with a speed of 0 because they could not reflect the driving behaviour. Wang et al. [42] filtered data with negative speed and replaced it with zero for subsequent adaptive tracking of vehicles.
Removing incomplete data. Ye et al. [43] considered that any records containing missing values should be removed, and only use the complete records. When the attribute variable of one column of the data has a missing value, the entire column attribute is deleted. Although attribute deletion can impact the integrity of the data sets, this method is very effective when the amount of data deleted in the whole data sets is very small. From the viewpoint of coordinates, Dong et al. [44] removed the data with latitude and longitude with value zero. The missing records are deleted in these studies [45,46] because they lack valid information and are ineffective.

| Machining data
Different machining methods can be selected for raw data according to the actual situation so that the map-matching algorithms can obtain a positive result. Note that not all machining methods can be applied in the real data. It is a good way to choose the machining method according to the structure of the data. As shown in Table 2, we summarize some rules of machining data.
Coordinate transformation: In order to reduce the conversion error of latitude and longitude data received by GPS navigation systems in the phase of transformation in electronic map, based on the seven-parameter transformation model, Tan et al. [47] analyzed the effects of the distribution of different public point positions on the transformation accuracy to establish the relationship between the distribution density of data and the transformation accuracy. The key of coordinate transformation is to solve the transformation parameters and transform them into the coordinate parameters of the corresponding region. Zhang et al. [48] believed that there are positioning errors in different coordinate projection systems. They transformed GPS WG184 coordinate system into Beijing 50 or Xi'an 80 coordinate system using the transformation of spatial coordinate systems. Li et al. [49] first transformed WGS-84 spatial coordinates into BJ-54 spatial coordinates, and finally determined the plane coordinates by Gaussian projection. Zhang et al. [50] believed that there would be errors in converting GPS measurement results to BJ-54 or Xi'an 80 coordinates, and proposed to use Gauss-Kruger projection to convert to plane coordinates.
Filling missing values: Missing value, as a factor to improve the quality of data mining, cannot be ignored. In literature [51], the five filling methods are summarized, by mainly using interpolation and filling function. Specific methods are given as follows: (a) average method: replace missing values with the average value of the entire sequences; (b) median method: use the median of a certain span range or all valid values to replace the missing values; (c) linear interpolation: the data of the previous position and the latter position of the missing value are interpolated by using a linear formula; (d) forward and backward interpolation: fill in the missing value with the previous or the latter value and (e) temporal interpolation: apply the closest value in time distance to the linear model in order to fill the missing values. Ye et al. [43] believed that the method of multiple filling has more advantages than single values filling, which aims to construct multiple estimates for each missing value and form several complete data sets. The method can simulate the distribution of missing data as close as possible to the real situation and maintain the original relationship between variables. Cai et al. [52] formulated the trajectory missing distance and time rules, and interpolated the missing trajectory points according to the missing number equal distance. Yin et al. [53] proposed a model using long short-term memory networks to solve the problem of poor effectiveness of traditional completion methods. By comparing various interpolation methods, Wang et al. [54] found that regularized algorithm has better interpolation effect.

HUANG ET AL.
Reconstruction trajectories: When the acquisition time interval is longer or the signal shielding occurs, the phenomenon of discontinuity will occur in the trajectory points. In order to improve the unity and standard of the trajectory data, and at the same time to obtain the complete motion trajectory of the vehicle by using the limited trajectory data, Sheng et al. [55] designed an improved Hermite interpolation algorithm to perform temporal interpolation on the trajectory points after modifying the errors. The key task of this method is to use polynomial function to restore the trajectories and add restrictions on the vehicle speed and driving distance. Zeng et al. [41] compensated the data using dead reckoning method, which uses the latitude, longitude, velocity, angle and positioning time interval of the previous point to calculate the position of the next point. Zhao et al. [56] considered that the error between measurement points has an effect on the reconstructed trajectory, and combined the 3σ criterion with the inverse distance weighted interpolation method to generate a new trajectory. Shi et al. [57] used Lagrangian fifth and third polynomial interpolation to estimate the trajectory points, and then used Kalman filter algorithm to remove the noise of the trajectory. Wang et al. [58] first calculated the statistical information of driving mode with probability distribution, then used maximum likelihood estimation to reconstruct the trajectory.
Trajectory segmentation: By dividing the trajectory into several segments, the data can be displayed in detail and clearly, which are especially beneficial for the heuristic matching algorithms. Zhang et al. [28] made the rule that the trajectory does not intersect and the distance of the trajectory is greater than the threshold, and partition the trajectory into multiple segments without loops. According to the speed variables, Yang et al. [59] divided the long-time sequence trajectory into three states: high speed, medium speed and low speed. The partitioned trajectory segments need to be used to find homomorphic trajectory and merge into subtrajectories. In addition, partitioning methods based on time threshold, geometric topology and trajectory semantic strategies are also mentioned [35]. Zhang et al. [60] used classification and regression tree to segment the original trajectory while calculating the synchronized Euclidean distance (SED) between adjacent trajectory points. Finally, the subtrajectories are merged and simplified by SED. Zhang et al. [61] calculated the change of velocity and direction of adjacent points. When the threshold is exceeded, the trajectory is segmented.
Geometric information indexing: Finding a trajectory directly on an electronic map is difficult. Geographic location information is a tuple containing longitude and latitude that can be converted into other expressions for localization. Kang et al. [62] used the Geohash technique to encode the longitude and latitude of the trajectory points in order to find the candidate segments. This method can only be used to search the nearby nine mesh information so that greatly reduces the time complexity. Gao et al. [35] introduced the index structure such as R-tree, B-tree and K-D tree. In addition, some researchers proposed to use the method of griding to index the trajectory data [63,64]. Hong et al. [65] used MapInfo software to establish a spatial index search based on a single-layer grid, using two staggered partition mesh division methods.
Data arrangement: Sorting data is necessary because processed data can sometimes appear complex. There is only one popular way to sort the trajectory data, that is, to arrange the data according to the time stamps. The study [48,66,67] sorted the data in the ascending order according to the sampling time, where the trajectory of each vehicle is from the departure to the destination. Nevertheless, Zhong et al. [68] sort the trajectory points according to the principle of nearest neighbour distance to form the trajectory with sequence.

| Road data preprocessing
Road data are necessary for map matching because the trajectory points are ultimately to be displayed on an electronic map containing road data. Most electronic maps consist of road data, such as 'Baidu Map', 'Google Map' and so on. Inevitably, road data also appear missing and wrong, due to slow road renewal and staff mapping errors. For FCD, road data also need to be preprocessed. As shown in Table 3, we summarize the methods of road data processing.
Road extraction: The goal is to extract the required path. Luo et al. [69] thought that the path planning algorithm needs to work in a network of road connectivity, and combined high-rise and low-rise networks to ensure connectivity including national highway, provincial road, county road, highway and so on. Zhang et al. [70] used the regular expressions to quickly match the node information of road files to obtain the road networks. If the road contains multiple segments, each segment is treated as an object. Yu et al. [71] aimed to explore common road network models and constructed five simplified road network models, such as rectangle and tree branch. Huang et al. [72] extracted road data such as highway and ramp, and carried out operations such as data duplication, data integration and data segmentation. Zhang et al. [73] classified OpenStreetMap data through tags, and then used regular expressions to match node information to obtain multiple road objects.
Road correction: Before constructing the topological graph of road data, we need to check whether the road data are correct. The work done in [69] has focused on repairing road separation. After setting the distance threshold, when the distance between the two end points of the separated segments is less than the distance threshold, the road is merged. Considering the disadvantages of map information in electronic maps, Sun et al. [74] proposed a buffer analysis method to repair the road data by specifying the distance threshold. Instead of changing the raw road data, they added supplementary information to the property tables to generate new layers. In order to meet the needs of traffic management business, Wang et al.
[75] simplified the original road data. They perform multilane fragment removal, off-ramp removal on highways and multilane merging. Zeng et al. [76] deleted the pseudo nodes and separate arcs of the road data in order to obtain a connected road network. Deng et al. [77] believed that the traditional method of extracting road intersections was not fine enough, and proposed a trajectory point clustering method to extract road intersections. This method uses local spatial autocorrelation analysis to detect trajectory steering angle hotspots, and uses adaptive spatial point clustering to cluster hotspots and identify the location of intersections.
Grid partition. Grid division focuses on equivalent block segmentation of maps. Fan et al. [78] partitioned the longitude and latitude of the map equally with 0.002 degree as the unit. Wang et al. [79] used the tree structure to divide the map step by step until it can no longer partition the grids. Li et al. [80] proposed to quickly and accurately locate the candidate road segments. They divided the road data into grid areas of the same size to build a grid index, using the ID number of grids to quickly identify all segments of the grids. Zhao et al. [81] proposed an adaptive basic grid division algorithm, which can effectively solve the problem of road search errors in equidistant grid division. Liu et al. [82] proposed to use multiprocess technology to cut the original raster image to improve the speed of map division.
Topological construction. Topological construction aims to establish the expression relationship between roads. In [83,84], they interrupted the road at the intersection point and divided the road into several small segments through the SuperMap GIS platform. In [85], road data can be partitioned by segments intersection into 'node-segments', and the topology is constructed by using the breadth-first traversal method. The work was done by Chen et al. [86], which proposed the dual structure of data storage layer as well as dynamic topological layer, and they mentioned that the topological relation of object space is dynamically changing. For expressing the relationship between segments and nodes, Li et al. [87] first recorded segments and nodes separately with tables, and then merged nodes with the same ID to construct road topology.
In this section, we first classify FCD preprocessing rules and introduce each processing rule in detail. Then, we summarize the rules of road data preprocessing.

| SURVEY OF SELECTION ALGORITHMS
As the map-matching techniques grow rapidly, algorithms based on geometric principle, topological structure, probability and machine learning are applied to map matching, which is also the key technique of map-matching techniques [88][89][90][91]. In this section, we classify map-matching algorithms with sampling frequency and data information. In addition, the open source GPS sampling data sets and evaluation measurements are given below.

| Classification of map-matching algorithms based on sampling frequency
In general, vehicles with GPS positioning devices record GPS information at a certain frequency. By considering energy consumption and communication cost, GPS information is continuously recorded at a lower frequency. As shown in Figure 6, Yuan et al. [92] counted the data generated by more than 10,000 taxis in Beijing for a week. By statistics, the data of sampling frequency within 1 min occupy 34% of the whole data, and it is regarded as high-frequency sampling data. Moreover, sampling frequency more than 1 min is considered to be low-frequency sampling data with a proportion of 66%. As described in Table 4, Gao et al. [93] viewed the sampling data between 1 s and 10 s as high-frequency sampling data, more than 30 s as low-frequency sampling data, and more than 2 min as ultra-low-frequency sampling data.

| High-frequency sampling data-based algorithm
High-frequency sampling data has the characteristics of dense points, small change of heading angle and smooth curve, which can be fully utilized to improve the accuracy of map matching. Based on the topological relationship and connectivity of the road network, Zeng et al. [94] played the advantage of dense trajectory points and used the shortest path search algorithm to obtain candidate road segments because taxi drivers preferred the shortest route. The more dense the trajectory points, the less candidate sections are searched by the trajectory points, and the higher the map-matching accuracy is. Chen et al. [95] used the relationship between the heading angle of the adjacent trajectory points and the candidate segments to find the road junction. If the angle is greater than 45 degrees, it is regarded as the road junction. Because the variation of heading angle of

Type of sampling Sampling frequency
High-frequency sampling 1-10s Low-frequency sampling More than 30s Ultra-low-frequency sampling More than 2 min HUANG ET AL. -63 high-frequency sampling data is small, the intersection can be found accurately in the experiment, but the performance effect is poor in the low-frequency sampling data. Liu et al. [96] proposed two different matching algorithms for the map with road segments. When considering the road complexity, the trajectory points are input into the algorithm model to get the results and feedback back to the model. Otherwise, the prediction results are obtained directly by fitting the formula. Both methods effectively solve the map matching of Y zigzag road.
Since essentially considering the smoothness of the trajectory curve, high-frequency sampling data is required. Zhang et al. [97] concentrated on using weighted formulas for angle, distance and slope. If the ultra-low sampling frequency is used, there will be a 'zigzag' phenomenon.

| Low-frequency sampling data-based algorithm
Low-frequency sampling data has the disadvantages of point sparsity and large curve fluctuation, so map-matching algorithms generally focus on incremental, global or other aspects. Sheng et al. [55] found that the accuracy of the topology searching method decreases with the sampling time interval. They improved the Hermite interpolation method and used the singlezero and multi-zero-segment linear interpolation to obtain the matching results from the aspect of restoring the vehicle route. The concept of reliable points for screening candidate segments is proposed in [98]. By comparing the ratio of the top two candidate road segments of the transmitting probability row, the TA B L E 5 Classification of map-matching algorithms by sampling frequency Classification High-frequency sampling data-based algorithm Low-frequency sampling data-based algorithm

Features of data sets
Dense trajectory points, small change of heading angle and smooth curve. Trajectory points sparsity and large curve fluctuation.

Advantage
Dense trajectory points can be used to screen candidate segments and reduce computation. The advantage of small change in heading angle can be used to find road junctions. Curve smoothing preserves the integrity of the trajectory and reduces the difficulty of map matching. Because the trajectory smoothing preserves the integrity of the trajectory, map matching has a lower level of difficulty.
Map matching from the perspective of probability statistics will be better due to sparse trajectories.

Algorithm Citations Principle Scope
Geometric theory [88] Comparing the trajectory curve and the slope of the candidate segment Incremental [100] Construct reasonable search ranges to obtain candidate segment and then weight the distance and angle Incremental [107] Calculating the Frechet distance with discrete distance Incremental Topology [113] The trajectory is divided into segments without loops, and matching results are obtained by combining the road structure and custom rules Incremental [87] Build a topology on node, roads and intersections to achieve map matching using four different states.
Incremental [89] The trajectory points are matched to subpaths by linear model Incremental Probability statistics [108] Set inference rules and weight them Global [90] Construct a search tree for the later selected sections and then evaluate the results using spatial and temporal information Global [104] The probability formula is used to search the candidate segment and the results are obtained by the correlation coefficient method Global [105] Using the Bayesian model to obtain the highest score path Global Advanced model [109] Using position and direction as reasoning evidence and establishing recognition function Global [91] Input information such as angle and position labels into the model and adjust the number of nodes in the hidden layer Global [110] Build the historical trajectory experience bases and use the reward function to find the optimal path Global [111] Using the weighted combination of spatial similarity, shortest path distance and direction similarity as the fitness function to output the optimal results Global 64remaining candidate road segments are removed when the ratio is greater than an empirical value, thus reducing the calculation of the transition probability. In addition, A* algorithm and the sliding window are used for real-time matching. Research was done by Lou et al. [99], which proposed spatiotemporal (ST) matching algorithm that considers the spatio and temporal factors. The spatial analysis function takes into account the distance from the trajectory point to the road segments and the adjacent candidate matching points, while time analysis function combines the factors of road segments length and constraint speed. The product of the two functions is taken as the score of the candidate segments, and the segment with the highest score is taken as the matching result. Based on the work in [99], Yuan et al. [92] adopted the interactive voting approach. The candidate matching points need to vote with each other to eliminate the phenomenon of matching errors in adjacent segments. When compared with the ST-matching algorithm, the proposed algorithm is more robust. As shown in Table 5, we summarize the different characteristics of high-frequency and low-frequency sampling data. Although there are many kinds of algorithms, sampling data have similarities, so we summarize the algorithms classification from the point of view of data characteristics and advantage.

| Classification of map-matching algorithms based on data information
As shown in Table 6, according to the data information contained in the data sets, the map-matching algorithms can be categorized into four types: geometric principle [88,100,101], topology [28,89,102,103], probability statistics [90,[104][105][106] and advanced model [91,[107][108][109][110][111][112]. These algorithms can be regarded as incremental and global method according to the number of sampling points in map matching. The incremental method first determines the previous point and then predicts the next point, while the global method takes into account the characteristics of the whole trajectory to determine the output results.

| Geometric principle
In general, the relationship between point-line and line-line is analyzed in the map-matching algorithms based on geometric principle, using distance, similarity and other factors to find the optimal path. In [88], by comparing the trajectory curve and the slope of the candidate segment, the least square method is used to fit the adjacent four trajectory points to the candidate segment. To improve the accuracy of real-time matching for navigation system, Teng et al. [100] proposed a new algorithm integrating ST proximity and improved weighted circle. The product of vehicle speed and travel time is used as the search radius. The elliptical region is constructed with the adjacent trajectory point as the centre, and the road in the intersecting region is the candidate segment. Because of the dynamic changing speed of vehicles, the search region also changes with the speed. "Then, in combination with the established rules, the distance from the trajectory point to candidate segment, the angle between the driving direction and candidate segment can be determined." Zheng et al. [107] calculated the Frechet distance with discrete distance, which can determine the relationship between the trajectory curve and the road curve. Viewing distance as a factor, uncertainty inference is used in cloud model to obtain the matching similarity of road segments.

| Topology
The topology-matching algorithm needs to build the road network and combines the local search algorithm to find the optimal path. Zhang et al. [113] split the trajectory into segments without loops and used the stack-stored road topology to find candidate segment. Then, a custom rule is used to filter the candidate segment to obtain the matching results. Li et al. [87] took nodes, roads and intersections as objects to establish the spatial topological relationship between points and lines. They divided the matching process into four different states and implemented the corresponding matching algorithm according to each state. In [89], sensors are used to receive road network data for nearby vehicles. After using the map data of multiple vehicles to obtain the subpath, the matching position is figured out by the linear model. Because of the simulation in the ideal environment, the delay and GPS accuracy of the sensor will affect the algorithm.

| Probability statistics
Probability statistics algorithm is designed to find a matching path with the maximum probability. Tuo et al. [90] used candidate segment to construct map search trees, which have multiple branch paths from root to end and cannot be traced back. Then, considering the spatial and temporal information of trajectory, they evaluated each branch in the map-searching tree. Li et al. [104] proposed a probability formula to establish an elliptical region search candidate segment. According to the characteristics of roads, the correlation coefficient method is used to select the candidate segment. The deviation of the matching points is also rectified on the road segments with the largest coefficient. Cheng et al. [105] thought that there is a probability relationship between the vehicle heading angle of the trajectory point and the road segment. By applying Bayesian theorem, they designed mathematical models to obtain the highest score of candidate segment as matching results. Yao et al. [108] used the membership function to perform fuzzification of three input variables of average error, projection distance and velocity to obtain the fuzzy subset, and designed inference rules for average angle and speed. They weighted the inference results in order to obtain the highest score as the matching path.  [91] used LIBSVM to establish multiclassification models for processed road network data and GPS sampling data. The map matching is transformed from the real position prediction problem to the multiclassification problem, in which the road network data contains label information such as road segment angle and position. These label data need to be input into the extreme learning machine algorithm, and then adjust the number of hidden layer nodes to get the output results. Sun et al. [110] proposed History Markov decision Processes Q-Learning algorithm based on reinforcement learning and historical trajectory, which include searching for candidate point set, selection of candidate path set and solution of optimal path. First of all, the candidate point set is formed by projecting to the corresponding candidate segment according to the position of the trajectory point. Then, if there is a historical track between the adjacent candidate matching points, the historical track is taken as the candidate segment. Otherwise, the shortest path of adjacent candidate points is used as the candidate path. Finally, distance, historical trajectory weight and connectivity are taken as reward functions to solve the optimal path using reinforcement learning. Gu et al. [111] proposed a map-matching method based on multicriteria genetic algorithm. They used the weighted combination of spatial similarity, shortest path distance and direction similarity as the fitness function of the genetic algorithm, and iteratively minimized the objective function to output the optimal results.

| Introduction of open source GPS sampling data sets
In this section, we introduce some open source trajectory data sets, including related descriptions, download addresses and registration requirements, as shown in Table 7. We have carried out download test for these data sets. The result is that they are all usable.

| Evaluation measurements
where Num c is the number of correctly matched points and Num all is the total number of points needed to be matched.
where T o represents the time of application ends, and T s is application start-up time.
Map-matching algorithms generally need to be measured by CMP and TE measurements. In addition, some researchers also use Recall, LenCMP and Circuity to measure the performance of map-matching algorithms.
where Num l is the number of missing matching.
where Len c is the length of the CMP and Len all is the path length of the actual driving of the vehicle.
Circuity ¼ |Len e | |Len c þ Len e | � 100% where Len e represents the length of mismatching path. It is worth noting that the selection of algorithmic measurements depends on whether the matching object is a trajectory point or a path length. CMP and Recall are recommended if trajectory points are selected as the study object. Otherwise, LenCMP and Circuity are often used in trajectory path measurement. No matter which one your research object is, TE always appears in the experiment as a basic metric.
In this section, we first partition map matching into sampling frequency and data information-based approaches and then introduce the open source trajectory data sets and evaluation measurements.

| DISCUSSIONS AND CONCLUSION
We conduct a comprehensive survey of the map-matching problem, which includes data preprocessing and selection algorithms. We first point out the shortcomings of the previous study, and then propose a new classification method for the existing measurements. In addition, we introduce the open source data sets and evaluation measurements. Finally, we present new challenges to map-matching techniques. In conclusion, this study summarizes the current trend of mapmatching techniques and provides guidance for future research directions.
Map-matching technique is the basis of trajectory mining. With the development of the Internet, map-matching techniques, including data preprocessing and matching algorithms, also have new challenges. In order to further study the vehicle map-matching techniques, the following aspects can be studied in the future.

| Challenge of data preprocessing
(i) Incomplete FCD is the most common problem in data preprocessing, which will increase the difficulty of map matching. Most researchers prefer to use vehicle speed and heading angle variables for map matching, but the deletion method and adjacent value filling are inappropriate. We can analyze the overall trajectory curve and insert the missing value. If the amount of data is large enough, we can train the model with historical data, which can be used to predict the missing value. FCD with more attributes should be further explored, which is applied to map-matching algorithms in order to improve the accuracy of map matching, for example, vehicle slope, altitude and other factors; (ii) Currently, the road network structure appears mostly in the form of point line, which will inevitably lose the integrity of the road. In the three-dimensional road, high speed and other complex road, the lack of road integrity will increase the error of matching. When building a road network structure, we can add additional space to store complete roads. In addition, due to the time lag of open road data, we can cluster the massive floating vehicle data in order to restore the missing road.

| Challenge of map-matching algorithms
Although most map-matching algorithms are based on highfrequency sampling data, map-matching algorithms based on low-frequency sampling data have attracted a lot of attentions in recent years. We can focus on the following directions: (i) Due to sparsity of low-frequency data, most studies are based on hidden Markov probability or advanced model, and there is rare work on geometric principles and topological structures. We can apply the algorithms based on geometric principle and topology by combining more attributes of low-frequency sampling FCD and road data; (ii) When searching candidate segment, grid search is more popular because it works fast. When candidate segments appear in extreme positions, most people choose to continue searching for eight nearby grids to avoid missing candidate segments, but it also increases the computational overhead. It is reasonable to construct a circular region with the product of vehicle speed and driving time as radius to choose the search results of nine grids to reduce the computational overhead; (iii) The increasingly mature distributed computing platform can efficiently support large-scale data analysis and modelling problems. We can use cluster computing capabilities to improve the efficiency of map matching.