Towards trustworthy multi-modal motion prediction: Holistic evaluation and interpretability of outputs

Predicting the motion of other road agents enables autonomous vehicles to perform safe and efficient path planning. This task is very complex, as the behaviour of road agents depends on many factors and the number of possible future trajectories can be considerable (multi-modal). Most prior approaches proposed to address multi-modal motion prediction are based on complex machine learning systems that have limited interpretability. Moreover, the metrics used in current benchmarks do not evaluate all aspects of the problem, such as the diversity and admissibility of the output. In this work, we aim to advance towards the design of trustworthy motion prediction systems, based on some of the requirements for the design of Trustworthy Artificial Intelligence. We focus on evaluation criteria, robustness, and interpretability of outputs. First, we comprehensively analyse the evaluation metrics, identify the main gaps of current benchmarks, and propose a new holistic evaluation framework. We then introduce a method for the assessment of spatial and temporal robustness by simulating noise in the perception system. To enhance the interpretability of the outputs and generate more balanced results in the proposed evaluation framework, we propose an intent prediction layer that can be attached to multi-modal motion prediction models. The effectiveness of this approach is assessed through a survey that explores different elements in the visualization of the multi-modal trajectories and intentions. The proposed approach and findings make a significant contribution to the development of trustworthy motion prediction systems for autonomous vehicles, advancing the field towards greater safety and reliability.


I. INTRODUCTION
The ability of human drivers to predict the motion of other road agents allows us to anticipate potentially dangerous situations and take preventive actions to minimise safety risks.It also allows humans to perform more efficient and comfortable maneuvers.It is therefore important that autonomous vehicles also have the capability to predict the motion of other road agents, so that they can apply predictive planning approaches and therefore behave in a more human-like manner.
However, predicting future actions and motions of traffic participants is a very complex task, as the behaviour of road agents is influenced by many different variables and interactions [1], [2].Furthermore, despite the fact that traffic environments are well structured (e.g.street layout, traffic rules), the number of possible future trajectories for each past trajectory for each agent can be considerable, whether for pedestrians, cyclists or vehicles.That is, the problem is multi-modal in nature.
In order to handle this complexity, most of the computational approaches proposed to address multi-modal motion prediction rely on very complex machine learning models which are far from being interpretable.These models are not at human scale and suffer from the characteristic of opacity (i.e., black-box models).Besides, there is no consensus on the most important metrics that should be used to evaluate their performance.Different benchmarks propose different metrics and, in most cases, focus mainly on accuracy, omitting some other relevant aspects such as robustness, diversity, or compliance with traffic rules.
Furthermore, in recent years it has been increasingly accepted that the design of complex learning-based systems must follow certain rules to ensure compliance, not only with traditional safety requirements, but also with general ethical grounds.This approach has recently been referred to as Trustworthy AI.It is a concept that encompasses multiple ethical principles, requirements, and criteria to guarantee that AI systems are designed following a human-centered approach and committed to social good [3].The development of human-centric AI is now a common trend worldwide.For example, at EU level, the High Level Expert Group on AI (AI HLEG) appointed by the European Commission (EC) defined the main horizontal Authors' preprint, 2023.arXiv:2210.16144v2[cs.RO] 5 Aug 2023 requirements [4] and criteria [5] to develop trustworthy AI systems, including elements such as human oversight, robustness and safety, privacy, and data governance, transparency, fairness, well-being, and accountability.In April 2021, the EC presented the Proposal for a regulation lying down harmonized rules on AI (the AI Act [6]) which imposes a set of requirements for AI systems used in high-risk scenarios.Among other things, the AI Act states that the relevant accuracy metrics shall be fit for purpose, and that technical measures shall be put in place to facilitate the interpretation of the outputs of AI systems.At the US level, albeit with a different focus ("algorithms" and "automatic decision systems" instead of "AI systems", and "critical decisions" instead of "high-risk scenarios"), the Algorithmic Accountability Act [7] was introduced in the US Senate and the House of Representatives in February 2022, which also imposes specific requirements on impact assessment, documentation and performance evaluation of automated critical decision systems.

Multi-modal motion prediction
between agents as edges [20].There are several architectural design decisions that must be made in order to effectively represent the input data, model the interaction and finally represent the output trajectory distributions.
1) Representation of high-definition maps: Methods for motion prediction need to effectively represent both geometric information (static scene elements) and traffic agents (dynamic scene elements).The standard raster representation encodes the world as a stack of bird's eye view (BEV) images, also called high-definition (HD) maps [12], [13], [21], [22].This approach is straightforward as all the different types of input information (e.g., road configuration, state history of agents, spatial relationships) are unified in a multi-channel image, allowing the use of a Convolutional Neural Network (CNN).However, the this approach is limited by narrow receptive field of standard CNNs.Hence, rasterized representations has difficulties in modeling long-distance interactions.
2) Interaction modelling: There are two types of interactions that need to be modelled.First, the encoding of temporal sequential data, which is typically accomplished via Recurrent Neural Networks (RNNs) -such as the Gate Recurrent Unit (GRU) [28] or the Long Short-Term Memory (LSTM) [29] -or temporal convolutions.Second, the interactions between the relevant agents and the environment is modelled via an attention mechanism [30].Interaction modelling is closely related to the method used for scene encoding.
Based on the polyline representation, VectorNet [10], [18], [19] utilises only self-attention modules to directly learn the interactions between all the sub-graphs in the environment.Further, goal-oriented lane attention [31] emphasizes the relationship between agents and lanes.The split-joint attention mechanism, used by LaneGCN [26], captures the complex topology of lane graphs and long range dependencies.Deo et al. [11] propose Prediction via Graph-based Policy (PGP) model, where interactions are modelled via lane-graph traversals.This approach combines discrete policy roll-outs with a lane-graph subset decoder, conditioning each prediction on the driver's goals.Finally, the network predicts trajectories by selectively attending to node encodings along paths traversed by the policy and a sampled latent variable.
3) Multi-modal output: To account for environmental uncertainty, models predicting future trajectories should represent multi-modal paths [21], [31], [32].This can be represented by implicitly modelling multi-modality as latent variables or explicitly proposing to generate multiple trajectory proposals.The first approach uses Gaussian Mixture Models (GMMs) [33] or Mixture Density Networks (MDNs) [34] to generate distributions for possible trajectories.In next step, it uses Variational Autoencoders (VAEs), recently also their modified version as Conditional Variational Autoencoder (CVAEs) [35], or GANs [33], [36], [37] to sample various future modes from latent variables.The main drawback of this approach is that the obtained predictions cannot be interpreted unambiguously, which lowers the understanding of the model's predictions.
On the other hand, the proposal-based methods design various possible proposals and thus separately solve the tasks of intention prediction and motion prediction.Some methods [10], [19] sample proposal points around the lane centerline to capture detailed information.The proposed sampling is based on carefully developed rules, the restriction on sharing proposals between different agents is one of them.In contrast, [23] uses lane segments from the map as proposals that can clearly describe the fine-grained intentions of agents and be shared globally.
From predicted paths, the most reliable trajectories are selected.Here, a variation of Non-maximum Suppression (NMS) algorithm is widely used.The traditional version of this method is employed in TNT [19], in which each trajectory is sorted according to the method scoring process.Then, the distance between different trajectories is calculated and a diverse set of trajectories with high scores is selected.Similar to TNT, LaneRCNN [27] treats a lane segment as an anchor and output each anchor's probability, using NMS to remove too close duplicate goals.The solution has its drawbacks, as a fixed threshold does not allow the model to maintain the balance between accuracy and multi-modality of output.In contrast, DenseTNT [10] is an anchor-free goal-based model, which generates a set of predicted goals without relying on the heuristic anchors.

B. Datasets
Experiments presented in subsequent sections are conducted on two publicly available trajectory forecasting benchmarks: Argoverse v1.1 Motion Forecasting Dataset [8] and nuScenes [9].These two datasets are the most widely used in the motion prediction task.The datasets differ from each other in many parameters, e.g., number of scenes, total driving time, number of objects, interaction complexity, or prediction time horizon.The prediction time horizon, in particular, plays a crucial role.Long-term forecasting is inherently more difficult, and multi-modal forecasting appears to be more successful in this case.Correctly determining the driver's intentions and the future trajectory of the vehicle is essential for safe planning.It strongly depends on the overall quality of the dataset on which the algorithms were trained, so it is important to select appropriate data for future applications..
Argoverse [8], released in 2019 by Argo AI, is composed of 323 557 real-world driving sequences.Driving scenarios were collected in two American cities: Miami and Pittsburgh.Each example consists of 2 seconds of historical state and 3 seconds of future state, which is sampled with a frequency of 10 Hz.The entire collection of scenarios totals 320h.In each scene, one actor of interest is specified whose future movement is to be predicted.For training and validation, focal agent history, location histories of nearby (social) actors, together with HD map features are also provided.The semantic HD map provided is simplistic and consists of lane-based polylines.
nuScenes [9], released in 2020 by Motional, is a large-scale data collection for multi-agent trajectory forecasting with 1,000 scenes recorded in Boston and Singapore, where right-hand and left-hand traffic rules apply respectively.Each scene is annotated at a frequency of 2 Hz and is 20s long.It contains up to 23 semantic object classes, including vehicles, bicycles and pedestrians as possible tracked agents, as well as HD semantic maps with 11 annotated layers.For this dataset, each agent has 2 seconds of observed trajectory and the prediction horizon is set to 6 seconds.
Overall, Argoverse-which tends to represent short, independent episodes-is more than fifty times larger in terms of total time than nuScenes.Despite the impressive size of the Argoverse dataset, previous works have observed that the recorded trajectories mostly represent straight-line trajectories within a full 5-second window [24], [38].This makes the dataset much less challenging and diverse than nuScenes.The datasets differ also in the schematic representation of the HD maps.Argoverse presents a much simpler representation, while nuScenes is more comprehensive but complex.We simplified nuScenes representations to focus on the most representative information for the conducted study.

III. EVALUATION FRAMEWORK
In this section, we revise the evaluation procedures of multi-modal prediction approaches.The presented metrics are differentiated between measures of precision, diversity, and admissibility (see Table I).Diversity refers to the degree of coverage of the output distribution.We seek to have diversity across distribution modes rather than having multiple candidates representing a single intention (mode).It is important to evaluate, however, the admissibility of such predictions.They need to comply with the traffic rules.
Accuracy evaluation: • Best-of-K Average Displacement Error (minADE): The minimum point-wise L2 distance to the ground-truth trajectory over all predicted trajectories.• Scene minADE: Used when considering the prediction of multiple agents simultaneously.It measures the joint minimum L2 distance between predictions and ground-truth over all the agents in the scene.• Best-of-K Final Displacement Error (minFDE): Lowest L2 distance between the K predicted endpoints and the ground truth endpoint at the prediction horizon over all predicted trajectories.• Scene minFDE: To some extent, similar to scene minADE.It measures the joint minimum FDE over the whole scene.
• Best-of-K Miss Rate (MR): Percentage of predictions whose maximum pointwise L2 distance between the prediction and ground truth is greater than a threshold.For Argoverse and nuScenes benchmarks a threshold of 2.0m is considered.However, they define this metric in different ways.Whereas the nuScenes benchmark uses the above definition, Argoverse computes MR as the number of scenarios where none of the forecasted trajectories are within 2m of the ground-truth endpoint.Waymo [39] differentiates between lateral and longitudinal thresholds, which scale depends on future time horizon and initial speed.• Heading error: The difference in the heading angle between the predicted trajectory and the ground truth trajectory at the final point.• Mean Average Precision (mAP) [39]: The area under the precision-recall curve by applying confidence score thresholds across a validation set.They use the same definition of a miss as defined for MR, considering any missed prediction as false positive.Only one true positive is allowed for each prediction -assigned to the highest confidence prediction.They report the final mAP averaged over eight different semantic buckets or driving behaviors, i.e., straight, straight-left, straight-right, left, right, left u-turn, right u-turn, and stationary.This makes the evaluation more balanced, since some of these trajectory shapes are much more infrequent than others.• Soft Mean Average Precision (Soft mAP) [39]: Similar to mAP, but additional matching predictions, other than the highest confidence one, are not penalized.Multi-modal prediction models are usually benchmarked using Best-of-K metrics.These metrics, although useful for deterministic regressors, only take into account a single output of an arbitrary number K, representing a limited part of the model output distribution.Hence, they are not able to compare the distributions produced by multi-modal models, neglecting the assessment of variance and multi-modality.
Furthermore, these metrics only assess the quality of the underlying marginal distribution per agent.Best-of-K ADE takes the trajectory sample that is closest to the ground truth of each agent in an independent manner.In this way, we are not measuring scene consistency in the predictions.It would be possible to have a low minADE by predicting high-entropy distributions that are not consistent at the scene level.In [40], the authors propose scene-level sample metrics to assess how well the output modes capture the joint distribution over future motions.
Probabilistic evaluation: In order to make a fair assessment of the probabilistic capabilities of a model, we need to measure its ability to capture the underlying uncertainty distribution of future motions.In the following, we list some of the metrics used in the literature [8] to evaluate probabilistic prediction models.Being p the probability of the best forecasted trajectory: distance between the maximum pointwise L2 distance between prediction and ground truth is greater than the threshold.• Brier minimum Final Displacement Error (brier-minFDE) [41]: This metric is similar to minFDE, but we add (1.0 − ) 2  to the endpoint L2 distance.• Brier minimum Average Displacement Error (brier-minADE) [41]: This metric is similar to minADE, but we add (1.0− ) 2  to the average L2 distance.• Negative Log Likelihood (NLL) [42]: The average negative log likelihood of the ground truth trajectory as determined by a kernel density estimate over output samples at the same prediction timestep.It is important to note that different benchmarks consider a different number of modes to evaluate prediction accuracy.For instance, Argoverse and Waymo consider the 6 most likely trajectories, whereas nuScenes takes 5 and 10 modes.For probability-based metrics in Argoverse, they take the 6 most likely trajectories and normalize the probabilities before computing the metrics.
There is also a difference in the prediction horizon considered.Both Argoverse and nuScenes provide 2s of trajectory history.However, Argoverse prediction time horizon is 3s, whereas nuScenes considers 6s.Waymo [39] provides tracks for the past 1 second and considers three different evaluation times: 3, 5, and 8 seconds into the future.
In order to sort their leaderboards, there is no consensus on the most important metric.nuScenes ranks the different approaches according to minADE 5 (K=5).Argoverse sorts them according to brier-minFDE at K=6, assuming a uniform distribution for approaches that do not provide a probability associated to the predictions.In previous competitions, they used MR and minFDE at K=6. Waymo leaderboard ranking is built based on the average Soft mAP across evaluation times, while MR is used as a secondary metric.
Diversity evaluation.Evaluating the diversity of multiple predictions is required to properly assess the multi-modal predictive capabilities.However, this evaluation is far less explored in the literature, resulting in models that, although obtaining good results in terms of accuracy, suffer from mode collapse and do not show real multi-modality.In the following, we list metrics used to evaluate the diversity of the predictions: • Lateral diversity metrics: The authors of [11] report the average number of different lanes reached as a measure of lateral diversity, as well as the variance of the final heading for the different output modes.[38] defines minLaneFDE, which captures both the quantity and quality of diversity of multiple outputs based on the centerlines of reference lanes.It computes the minimum value among the L2 distance between centerlines of possible lane candidates and each predicted mode.
• Longitudinal diversity metrics: Deo et.al [11] reports the variance of the average speeds and accelerations for the different output modes.• Ratio of avgFDE to minFDE (RF): Proposed in [43], it measures the spread of the predictions in Euclidean distance.A large average L2 error implies that the predictions are spread out, while a small minimum L2 error implies that at least one of the predictions has high precision.The authors in [43] follow this intuition and propose to compute the ratio of avgFDE to minFDE to capture diversity in the output.However, this metric can fail to differ between longitudinal and lateral diversity in some scenarios.In addition, this metric may be better for models that are worse in terms of accuracy and admissibility when most modes are failing modes.A straightforward way to assess lateral diversity can be to measure the variance of the final heading for the K possible outputs.When having access to the lane information, the average number of final lanes reached is a good measure of mode diversity.On the other hand, longitudinal diversity can be measured by the variance of the final speed and acceleration for the K outputs.
Admissibility evaluation.Finally, assessing the compliance of the predictions with the driving scene is essential to evaluate the quality of the output and to ensure a safe motion planning in complex driving scenarios.In the following, we list the metrics used to evaluate the admissibility of the predictions: • Off-road rate: Computes the fraction of trajectories that are off-road, outside the drivable area.
• Drivable Area Occupancy (DAO) [43]: Measures the proportion of pixels occupied by the predicted trajectories within the drivable area.This metric can be used together with RF to get a more compliant measure of diversity.• Drivable Area Compliance (DAC) [8]: Measures extreme off-road predictions that are not admissible.If a model produces K modes for future trajectories and M of those going off the drivable area at any point, the DAC for that model would be (N -M)/N.• Scene consistency rate (SCR) [40]: Measures the percentage of predicted samples that collide (overlap) in the scene in order to evaluate social-consistency.A collision is detected by comparing the IOU between the future BEV-defined bounding boxes of each pair of agents in the scene with a small IOU threshold.
• Overlap Rate [39]: Similar to the previous metric, it takes the highest confidence prediction from each agent and computes the total number of overlaps divided by the total number of agents.A single overlap is defined if any of these trajectories overlaps at any time with any other agent at the prediction time step.Off-road rate, DAO, and DAC are useful for measuring the proportion of modes that go off-road.The last two metrics evaluate the ability of the model to capture social interactions.However, to the best of our knowledge, there are no metrics that assess compliance with traffic rules.To this end, we propose to measure the ratio of modes going in the oncoming traffic direction   =  (     )  .Autonomous driving is a high-stake and safety-critical application.As such, it is of utmost importance to thoroughly evaluate each of its intermediate systems, including the trajectory prediction stage that will be an essential input for safe and efficient planning.A comprehensive evaluation and interpretation of the performance of the prediction model is needed in terms not only of precision, but also diversity and admissibility.Simultaneous examination of all these dimensions provides a holistic evaluation framework for the assessment of multi-modal motion prediction.In practice, the choice of which metrics to evaluate will depend on the particular needs of the system and the specific application context.Our framework serves as a guide to help the user make an informed decision about which metrics are most relevant to their needs and to provide a basis for comparison with other models.

IV. ROBUSTNESS ANALYSIS
The motion prediction task usually assumes perfect perception, i.e., that the ground-truth past trajectory of all actors is given and that the map information is available.However, in practice, self-driving vehicle perception systems have noise that will translate into false negatives, false positives, and incomplete agent tracking or id switches.We therefore conduct a robustness study, in which perception system noise is simulated.
We perform an ablation for PGP [11] as having an object detector with 80%, 90%, and 95% recall -the latter being a realistic setup under good conditions.This ablation is done in an independent manner for dynamic objects detection and lane detection.
Each agent or lane segment is masked following a Bernoulli distribution with a probability of 20%, 10% and 5% respectively.Table II shows the results for both lanes and dynamic agents in terms of   , _  -metrics used for ranking nuScenes benchmark -and Behavioral Cloning (BC). 1  We found PGP model to be quite robust to failures in the detection of dynamic agents.However, when we introduce noise in the perception of lanes, performance is drastically decreased, specially in the BC metric, given the importance of the lanes for this metric.This is probably due to the fact that this model largely exploits lane information as a strong inductive bias, which contains both the direction of traffic flow and legal paths for each agent.
We perform a second experiment where we analyze the effect of not detecting some frames of the agents that interact with the focal agent.Results are shown in the second last row.We consider an interaction if their trajectory intersects with a 20 m radius of the target agent.Each frame has a 50% probability of being detected.Results do not show a noticeable drop in performance in this case either.Finally, we perform a final experiment where we simulate that no dynamic agent is detected.Results are shown in the last row.Given these results, we can state that the model relies heavily on lane information as well as on the focal agent's past trajectory.
In order to evaluate if these results generalize to a different scenario and model, we perform the same experiments for DenseTNT in Argoverse dataset with K=12.Table III shows the results of the robustness analysis for DenseTNT.In the same way as in the previous scenario, failing to perceive the lanes has a more detrimental effect that failing to perceive the dynamic agents in terms of accuracy.Again, the masking of all agents in the scene decreases the performance, but it still maintains a reasonable level.Noise in the perception of lanes reduces diversity, while masking of dynamic agents seems to increase diversity -see second-last row in Table III .The reason for this behaviour is that the fewer lanes perceived, the fewer plausible trajectories.Not perceiving certain agents, however, could cause the AV to go to an occupied space, increasing the diversity in the prediction at the cost of raising the risk of a collision.Admissibility decreases when lanes are masked for the same reason, while not perceiving agents in the scene has no effect on this aspect, as expected.
The behavior observed in this analysis can have several explanations.First, the historical information of the focal agent in terms of position, velocity and acceleration is already very informative when inferring the motion of the vehicle.Second, most scenarios lack interactions that drastically impact the future trajectory of the focal agent.Even though nuScenes is one of the most complex trajectory forecasting datasets and it includes several highly interactive scenarios, this is still insufficient.This is one of the main weaknesses of current motion prediction benchmarks.In the most safety critical situation, other road users should play a crucial role.However, this is not reflected in the current benchmarks, which should cover a wide range of edge cases.These rare occurrences are easily missed and thus are often missing in datasets.Humans are naturally proficient at dealing with these extreme cases, but this is not true for autonomous systems.Therefore, we need to deal with it carefully.Another conclusion we can draw from this analysis is the importance of dynamic object detection.Object detectors are usually trained on single images.However, this can be sub-optimal.Humans exhibit an understanding of the dynamics of scenes. 1 Robustness Analysis code repository.If objects move against a static background, we can detect them quite easily, despite darkness, rain or other occlusions.A dynamic object detection system would prevent missing objects in specific frames, thus avoiding dragging or carrying this error into the next stages of the pipeline.

V. INTENTION PREDICTION
Understanding the intention of the surrounding road agents is most relevant to mid-and long-term prediction and decision making.In order to drive through dynamically changing traffic scenarios, a multi-modal intention prediction module is necessary to adapt to the different scenarios.Intention prediction differs from motion or trajectory prediction in that it corresponds to discrete high-level behaviors, semantically different from each other, which we can consider modes of the future trajectory distribution.Many trajectory prediction models are not inherently multi-modal and suffer from mode collapse.We wish to disambiguate the output and disentangle these modes into clear high-level intentions.In addition, this may contribute towards the interpretability of the overall system since it is more aligned to how humans think while driving.
Another potentially important advantage of intention prediction is that it is much less sensitive to the actions of the AV, compared to the more detailed task of trajectory prediction in which the precise future trajectory of the agent will depend on the actions of the AV if they are "interacting".This implies a greater potential for the intention prediction task to be dealt with in an open-loop manner.
Most studies in the field of motion prediction work on trajectory prediction and only a few on intention prediction [44], [45], which frame the task as a classification problem.However, these rely on predefined trajectories obtained by handcrafted principles, failing to capture comprehensive representations for the future distribution.These methods lack a consistent evaluation, which makes them suboptimal and lag behind state-of-the-art regression and generative models.In this section, we explore a new formulation for the intention prediction framework, by using a simple post-hoc approach that could be added on top of different state-of-the-art multi-modal motion prediction methods.
We extend the DenseTNT model [10] to instead perform intent prediction 2 .We observed that the set of K potential output goals for the focal agent are not inherently different.Oftentimes, they are right next to each other or almost overlapping.The reason for this is most likely that there is uncertainty in the motion profile.It would be more convenient and intuitive if each of the K locations were different, disentangling the longitudinal uncertainty from the intention.Another problem we detected in this model is that often the predicted goals are not admissible or rule abiding, falling off the road or in lanes going in the opposite direction.It would also be desirable to have a probability associated with each mode, to facilitate the subsequent decision making.
A simple approach to solve this would be to cluster the different outputs of DenseTNT into intentions.As a first step, we trained DenseTNT to output 12 goals instead of 6 -by default -optimizing the miss rate metric instead of FDE, since we believed these two changes would lead to better coverage of the output distribution.Indeed, it showed higher diversity and a reduction of failure modes.The model outputs a set of goals G = { 1 , . . .,   }, where each goal   ∈ R 2 is a point in birds-eye-view.In order to capture different intentions, we cluster the goals, forming a set of clusters C = { 1 , . . .,   }.Each cluster,   , contains three components, where   ∈ [0, 1], with    = 1, is the probability of a goal ending up in cluster .  ∈ R 2 is the mean position of the cluster and Σ  ∈ R 2×2 is the covariance of the cluster.Note that we expect the covariance to be high along the lane, corresponding to the variability in speed between agents, and low orthogonally to the lane.We propose a straightforward and intuitive approach to find C. Cluster creation: Clustering is an NP-hard problem and commonly used algorithms, such as K-means clustering or expectation maximization, are computationally expensive and in some cases prone to degenerate solutions.In our setting, however, we have a good partitioning of the bird's-eye-view plane available already: namely the lane segments in the HD-map.We therefore propose to create clusters based on the lane segments.In the following, we describe the heuristic used to compute the clusters.For each predicted goal   : 1) Obtain all lane segments   = { 1 , . . .,   } that intersect with a radius  = 20 from the agent based on the Manhattan distance.The bounding boxes of small point clouds (lane centerline waypoints) are precomputed in the map.If no lanes are found, we double the search radius . 2) Estimate the confidence  of each lane based on the distance to its closest waypoint,   .Keep those lanes whose closest waypoint lies within a threshold radius  of 2.5m.
3) Compute the angle between the agent's heading for mode  and the lane direction,   .Discard those lanes whose direction differs from the agent's heading by more than 45 degrees.This ensures that we cluster only those goals that follow the current lane direction.Formally, let   correspond to the probability that the goal   belongs to cluster   .We intentionally let the assignment be soft.This makes it possible for a single goal to be assigned to multiple clusters, which is sensible under uncertainty.Taking the retrieved lanes   for each goal   , we compute   as follows: 1) If two lanes in   do not merge or one is the successor of the other, then they belong to different clusters.
2) The probability   , where lane   belongs to cluster   , is computed based on the distance to the closest waypoint   and the angle   .
This heuristic is computationally efficient and guarantees that different intentions -such as staying on lane versus cutting out of the lane -are represented by different clusters.We repeat the process for each goal   , grouping the lanes that belong to the same cluster Then, we compute In order to compute the cluster probability,   , we exploit the heatmap produced by DenseTNT.Taking the score of goal   ,   : Cluster visualization: For the survey study, we visualize the clusters by considering a hard clustering, assigning each goal to its most probable cluster.
In Figure 2, we showcase different scenarios.The AV is represented in red and the focal agent, whose trajectory we want to predict, is represented in green.Other agents are represented with blue dots.The ground truth trajectory is shown in green.
In the first row, the outputs of the DenseTNT model are visualized in orange.The final predicted goals are represented by orange stars.The arrows show the direction of the lanes.
The clustered output is shown in the second row.The probabilities of the output trajectories are mapped to the colored bar shown on the right.The averages of the intention clusters, whose probability follow the same mapping, are represented by coloured circles.The uncertainty in the motion profile for each intention is represented by the confidence ellipse of its covariance, visualized as a shaded contour.As expected, the covariance is high along the lane direction and low orthogonally, showing higher uncertainty in velocity.To deal with cases where only one goal is assigned to a cluster, we apply a fixed covariance based on our previous knowledge of 2m along the lane and 0.05m orthogonal to the lane (left-most lane cluster in Figure 2).In the first column, three intentions are detected, one for each lane, being the second the most probable one.In the second column, the model outputs two intentions: follow and change lane.The goal that go off the road is discarded and not considered for the clustering.In the third case, we can observe how this method eliminates spurious goals that do not follow traffic rules, going to lanes that go in the opposite direction.In the last scenario, the model outputs fall in both boundaries of the same lane.In this case we only output one intention.Note that this is fundamentally different from a heatmap representation, where the output is a dense grid of probabilities that assigns the likelihood of the agent being at each position on the map at a given time.While heatmaps can provide valuable information about the distribution of possible trajectories, they do not provide a clear indication of the underlying high-level intentions of the agent.This is a fundamental limitation when it is essential to disentangle different behavioral modes and predict the agent's intended goal or destination accurately.By contrast, our proposed clustering approach allows us to identify the most probable high-level intentions of the agent, providing a more interpretable and actionable output.
Quantitative evaluation: This method not only disambiguates the multi-modal output, showing the possible intentions predicted by the model, but also improves the compliance of the output predictions.The clustering discards non-plausible goals that violate traffic rules, going off-road or in the direction of oncoming traffic.
To assess the veracity of this statement, we provide measures of DAC and OTD for both DenseTNT and our method, which we call DenseTNT-Intent.
We compute two metrics of diversity: p-RF and variance of the final heading for the different output modes,  2 yaw for the whole output -12 modes -and the 3 most probable outputs.We propose to modify RF and compute instead the ratio of the probabilistic versions of avgFDE and minFDE, p-RF.p-avgFDE is computed following the same heuristic as p-minFDE.For each sample with K trajectories: Finally, we also provide probabilistic precision metrics to compare both models in terms of accuracy.All metrics must be considered simultaneously in order to get a holistic interpretation.The results are illustrated in Table IV.
DenseTNT diversity metrics are better when evaluated with the whole output, i.e., 12 predictions.This is probably due to the fact that clustering removes bad predictions, which makes the variance of the yaw lower.Following the same intuition, avgFDE is higher for DenseTNT whereas minFDE is lower.Therefore, the ratio RF is also higher for DenseTNT.However, RF only measures diversity without looking at the quality of this output.Moreover, 90.7% of the target agent's maneuvers go in a straight line.In most of these cases, all the uncertainty is in the motion profile, not in the intention, thus having only one output cluster.Figure 2(d) shows a clear example where the variance will be much higher for DenseTNT before the clustering.
We decide to evaluate this hypothesis with two more evaluation scenarios.First, we evaluate the model in the case where we only keep the three most probable predictions (k=3).In this scenario, the clustered output shows higher diversity both in terms of  2 yaw and p-RF.Secondly, we evaluate it in a subset of the validation set, considering only those scenes whose underlying distribution has more than one plausible mode.As expected, we encounter best results in terms of diversity in this evaluation scenario.For the analysis of accuracy and admissibility results, we will focus on the whole dataset wit K=12 since there are no major changes in the other dimensions.
The admissibility metrics show that clustering the output trajectories into high-level behaviors provides a more compliant prediction, with no clusters going in the direction of oncoming traffic and a 2pp higher DAC.We find that almost 7% of the predictions go in the opposite direction.
In order to get a holistic evaluation, we also explore how the clustering affects the results in terms of accuracy.When evaluating the probabilistic versions of distance-based metrics, DenseTNT-Intent seems to be superior in terms of p-minADE, p-avgFDE and p-MR.This is due to the fact that it provides a lower amount of modes with a higher probability for the ground-truth intention.However, since DenseTNT outputs more trajectories with higher variability in the motion profile, it is more likely that one of these trajectories will be closer to the ground-truth prior to the clustering, thus having a lower p-minFDE for DenseTNT.
These results suggest that DenseTNT-Intent achieves a more scene-compliant output, in the form of intention prediction, covering the modes of the output distribution while improving the overall quality of the predictions.This output can be used to improve the safety and reliability of autonomous vehicles by enabling better decision-making and more accurate prediction of the agents' future behavior.

VI. INTERPRETING MULTI-MODAL MOTION PREDICTION OUTPUTS
In this section, we test our hypothesis that appropriate visualization of multi-modal trajectory representation increases the interpretability of the prediction system.We verified our null hypothesis by creating a survey in which we explore how the visualizations of multi-modal predictions, with their associated probabilities and different number of modes and prediction horizons, impacts the interpretability of the prediction system output, and how this, consequently, might impact transparency and user trust in the overall system.

A. Methodology
We recruited 39 technical experts to participate in a twenty-minute survey designed within-subjects.Twenty-three percent of the subjects are women, which is similar to the inherent bias in the technical field.In this survey, technical interpretability was assessed using different visualization experiments of the information provided by the prediction systems.Fig. 3: A/B questions pie charts 3. Prediction time horizon: 6s versus 2s.In this scenario, we study the importance of long-term prediction over short-term prediction for the interpretability of the output.84.6% of the participants preferred six seconds of prediction over two seconds in terms of interpretability of the output predictions.Most argue that 6s provides more information in order to understand the scenario.
In the Likert scale rating, we found a statistically significant difference of 1.5 points in the means of both distributions, with a p-value of 10 −10 .This is consistent with our hypothesis that longer predictions horizons increase the interpretability of the predictions output and the reliability in the system.
4. Cameras view as additional visualization.In this scenario, we explored the influence of adding the visualization of six cameras placed in the AV on the interpretability of the traffic scene.70% of participants believe the additional visualization helps them understand the scene.87.2% believe that it is easier to understand the behavior of the focal agent and its predicted trajectory.
5. Single mode vs multi-mode predictions with DenseTNT on Argoverse dataset.We repeat the first scenario on a different setup to evaluate the difference in scene representation of the Argoverse and nuScenes datasets, as well as the output representation of DenseTNT versus PGP model.We found different results from those of the first scenario.In this case, 33.3% of participants chose single mode predictions over multi-mode predictions in terms of interpretability.The biggest difference, however, is in the Likert scale rating.In this case, both means are almost identical, with no statistical difference between distributions.This notable difference is mainly due to the output representation in the case of DenseTNT, which outputs 12 trajectories with no probability associated.People believe that this makes it difficult to interpret the output.This result is yet another justification for the need of a posterior clustering into intentions with associated probabilities, which leads us to the last scenario.
6. Trajectory prediction versus intention prediction.We test the hypothesis presented in Section V. We posit that intention prediction can contribute towards the interpretability of the system since it disambiguates the multi-modal prediction output and produces a clustered output that is more in line with the way human drivers think.87.2% of participants believe the clustered output with its associated probability is more understandable when visualizing the future predicted behavior of the focal agent.
Most argument argue that they find it more useful to know the high-level behavior than the concrete future trajectory, as it is more in line with the way humans reason.Others pose it is faster to interpret and adds more information while removing the ambiguity of multi-modal trajectory prediction.In addition, they also indicate the importance of having uncertainties associated with the predictions, to allow the subsequent algorithm to adopt a probabilistic framework.However, 5 participants believe that the second option is more confusing and difficult to understand.
When looking at the inferential statistics of the Likert scale question, there is a statistically significant difference of 0.9 points in the mean of both distributions, with a p-value of 3 • 10 −6 .This verifies our hypothesis that intention prediction helps improve the understanding of the predictive system.

C. Discussion
This study comes with several limitations that should be considered.First, Likert scales are subject to distortion due to central tendency bias (i.e., avoidance of using extreme categories), and acquiescence bias (i.e., agreeing with statements in the survey).
Second, the A/B type questions leave no room for a neutral response when no option is preferred.However, this could be specified in the Likert scale and in the open-ended question of the particular scenario.It has been taken into consideration for the analysis.
Finally, some subjects claimed to be unsure of the purpose of some questions and believe that some answers are highly dependent on the particular task.For example, for a developer it may be more interesting to have more information at the cost of a more complex representation.Some questions also depend on the traffic scenario, speed, and complexity of the road, with the optimal number of modes or prediction horizon being different for each case.
Despite these limitations, we can derive some conclusions from this study.First, it verifies our hypothesis that multi-modality is not only important for better decision making and planning a safer route, but it also improves the understanding of prediction systems when properly visualized.Another important insight is that the form of visualization is crucial and it is important to show the different probabilities associated with each trajectory in a clear way.Second, longer prediction time horizons provide a better understanding of the whole traffic scene and make the overall system more interpretable.Finally, the need for intention prediction is supported by the survey findings.Clustered outputs provide a more human-like representation of the predictions into high-level behaviours.

VII. CONCLUSIONS
In this work, we move towards the design of reliable motion prediction models based on evaluation, robustness, and interpretability of the outputs.
Our findings can be summarized as follows: • We highlighted the main gaps and differences in current evaluation methodologies, especially in terms of lack of diversity assessment and admissibility with the traffic scene.We identify the main aspects that are critical for the evaluation of multi-modal motion prediction and propose a more comprehensive and holistic evaluation framework.• In the robustness analysis, we showed how failure to perceive the road topology has a greater impact on system performance compared to failure to perceive other agents on the road, due to the significant inductive biases introduced by the lanes.This also showcases the need for more comprehensive datasets covering complex scenes with high interaction and wide range of edge cases.• We provided DenseTNT-intent outputs with high-level intentions that prove to be diverse, compliant, and accurate, improving the overall quality of the predictions.• The results of the study suggest that this new representation improves the interpretability of the output prediction.Our first hypothesis that multi-modality improves interpretability over single-mode predictions is also verified.Finally, long-term predictions appear to provide a better understanding of the predicted traffic scene.The proposed approach and findings make a significant contribution to the development of trustworthy motion prediction systems for autonomous vehicles.By comprehensively analyzing current evaluation metrics, identifying gaps, and proposing a new holistic evaluation framework, this work provides a valuable foundation for future research in the field.Additionally, the formulation of a method for assessing spatial and temporal robustness, as well as the proposed intent prediction layer, demonstrate innovative solutions for addressing the complex challenges of multi-modal motion prediction.Finally, the assessment of interpretability through a survey of different visualization techniques offers further insights for enhancing the performance and transparency of autonomous vehicle systems.Overall, this work represents a substantial step forward in the design of trustworthy artificial intelligence for safe and efficient autonomous driving.

Fig. 2 :
Fig. 2: Qualitative evaluation: DenseTNT-Intent (second row) clusters the output of DenseTNT (first row) into high-level intentions with a probability associated with each cluster following the colorbar on the right.The circles represent the cluster average.Shaded contours represent the uncertainty in the motion profile.Spurious goals going off the road or in the direction of oncoming traffic are discarded.

TABLE I :
Reviewed metrics categorized in terms of Accuracy, Diversity, and Admissibility.p corresponds to the probability of the best forecasted trajectory.Symbolrefers to no units, # refers to number of Similar to minADE, adding (− ( ), − (0.05) ) to the average L2 distance.p-minFDEm (↓) Similar to minFDE, adding (− ( ), − (0.05) )to the endpoint L2 distance.p-MR% (↓) Similar to Miss Rate, taking (1.0 − ) as the contribution, instead of 0.0, when the endpoint error for the best forecasted trajectory is less than 2.0m.Number of overlaps divided by the total number of agents.A single overlap is defined if any of these trajectories overlaps at any time with any other agent at the prediction time step.

TABLE II :
Robustness analysis of PGP in NuScenes dataset.K=10 for all experiments.

TABLE III :
Robustness analysis of DenseTNT in Argoverse dataset.K=12 for all experiments.

TABLE IV :
Quantitative results for DenseTNT (12 modes) and DenseTNT-intent in terms of diversity, admissibility, and accuracy.Improvements are indicated by arrows.The first two rows show results for the whole validation set.The third and fourth row consider top 3 predictions.The last two rows describe the results for the subset of scenes with more than one plausible mode.

TABLE V :
Evaluation of survey results: Descriptive and inferential statistics for each of the cases under study.k is the number of modes, being n>1 the multi-modal scenario.We provide mean, variance, and mode as descriptive statistics.For inferential statistics, test statistic, p value and degrees of freedom (dg) based on Welch's test are shown.nS refers to nuScenes, AV refers to Argoverse dataset.