Improving Vehicle Re-Identification using CNN Latent Spaces: Metrics Comparison and Track-to-track Extension

This paper addresses the problem of vehicle re-identification using distance comparison of images in CNN latent spaces. First, we study the impact of the distance metrics, comparing performances obtained with different metrics: the minimal Euclidean distance (MED), the minimal cosine distance (MCD), and the residue of the sparse coding reconstruction (RSCR). These metrics are applied using features extracted through five different CNN architectures, namely ResNet18, AlexNet, VGG16, InceptionV3 and DenseNet201. We use the specific vehicle re-identification dataset VeRI to fine-tune these CNNs and evaluate results. In overall, independently from the CNN used, MCD outperforms MED, commonly used in the literature. Secondly, the state-of-the-art image-to-track process (I2TP) is extended to a track-to-track process (T2TP) without using complementary metadata. Metrics are extended to measure distance between tracks, enabling the evaluation of T2TP and comparison with I2TP using the same CNN models. Results show that T2TP outperforms I2TP for MCD and RSCR. T2TP combining DenseNet201 and MCD-based metrics exhibits the best performances, outperforming the state-of-the-art I2TP models that use complementary metadata. Finally, our experiments highlight two main results: i) the importance of the metric choice for vehicle re-identification, and ii) T2TP improves the performances compared to I2TP, especially when coupled with MCD-based metrics.


ABSTRACT 1 Introduction
With the recent growth of Closed-circuit Television (CCTV) systems in big cities, object re-identification in video surveillance, such as vehicle and pedestrian re-identification, is a very active research field. In the last few years, major progress has been observed in the vehicle re-identification field thanks to recent advances in machine-and deeplearning. These advances are very promising for intelligent video-surveillance processing, intelligent transportation and future smart city systems.
Vehicle re-identification in video surveillance aims at identifying a query vehicle, filmed by one camera, among vehicles filmed by other cameras of a CCTV system. It relies on a comparison between a query vehicle and a database of known vehicles, to find the best matches. Commonly, the query is a single image and the vehicles of the database are represented by a set of images called track, extracted from video segments recorded by CCTV cameras.
In the literature [1,2,3,4,5,6,7,8,9], vehicle re-identification is conducted as follows. First, query and probed vehicles are placed in a common space, by extracting features, representing the visual characteristics of the vehicle within one or several images, in order to share the same dimensions and be comparable to each other. Additionally, these features can be augmented using additional annotations (license plate, trend of the car, color of the car, etc.) and/or contextual metadata (camera location, time, road map, etc.). Second, using a metric measuring distance (or similarity) between these features, the probed vehicles are ranked with respect to the query vehicle, from the first candidate to the last.
Previous studies have focused on the problem of feature extraction. Feris et al. [1] originally proposed an attributebased method for vehicle re-identification using several semantic attributes (such as the category of vehicle and color). Zapletal et al. [2] proposed to use color histograms and histograms of oriented gradients on transformed images (placing them in a common space) and a trained SVM classifier to perform vehicle re-identification. Liu et al. [3] were the first to evaluate and to analyze the use of Convolutional Neural Networks (CNNs) for vehicle re-identification, extracting the Latent Representation (LR) of the vehicles within the latent space of CNNs. They also provided a specific large-scale dataset for this purpose: the VeRi dataset. They evaluated the vehicle re-identification performance of LR extracted from several CNN architectures, and compared them to texture-based and color-based features. They showed that i) LRs of CNN architectures were particularly suitable for vehicle re-identification and ii) a linear combination of the three types of features was performing better.
Later, they showed that adding contextual information (license plate and spatio-temporal metadata) is able to improve performance [4,5]. In the same vein, Shen et al. [6] incorporated complex spatio-temporal information to improve the re-identification results. They used a combination of a siamese-CNN and a Long-Short-Term-Memory (LSTM) model to compute a similarity score, used for vehicle re-identification. Instead of training a CNN to classify vehicles, Liu et al. [7] suggested to directly learn a distance metric using a triplet loss function to fine-tune a pre-trained CNN. They also provided another large dataset containing a high number of vehicles, called vehicleID. Liu et al. [8] introduced a CNN architecture that jointly learns LRs of the global appearance and of local regions of the car. Attribute features (colors, model) are additionally used to jointly train their deep model. Finally, they concatenated global LR, local LR and attribute features. They concluded that the more information is combined, the higher is the re-identification performance. Recently, focusing on the development of more effective LR of vehicle, Zhu et al. [9] fused quadruple directional deep features learned by using quadruple directional pooling layers, and were able to outperform most of the state-of-the-art methods without using extra vehicle information. In these studies, the matching process uses the Euclidean distance, or a similarity score derived from it, to measure the distance between the query and a probed vehicle, keeping only the nearest image of each track as a reference for the ranking process. However, the use of Euclidean distance has often been criticized for being not well suited to high-dimensional spaces [10], such as those constructed by CNNs (often generating dimension of features greater than 500). To our knowledge, the impact of the metric choice on the vehicle re-identification performance has not been addressed; and this is the first issue addressed in this paper.
Furthermore, the systematic evaluation of distance metrics leads us to consider a more general framework than the commonly used image-to-track process (I2TP) which relies on image-to-track distance comparisons. Indeed, in the practice of vehicle re-identification, the query vehicle is selected directly on the video segment recorded from the camera of the CCTV system. This video segment provides a variety of valuable information that remains unused in IT2P. For instance, in case of a moving car, the video segment may offer different visual cues from the same vehicle (angle of view, zoom, brightness/contrast changes, etc.). This additional knowledge about the visual aspect of the query vehicle may improve the re-identification. Moreover, the use of a whole video segment may avoid the user selecting only one specific query image without knowing the potential impact of such a selection in the performance of the re-identification.
The literature on vehicle detection and tracking is very rich, and numerous methods are today available to perform automatic vehicle detection and tracking in a given camera [11]. Therefore, assuming that the video segment selected by the user has to be processed by such algorithms, the query vehicle could be represented by a track, which would provide more information for the re-identification. So far, the use of a query containing more than one image has not been used in vehicle re-identification. We address this issue with considering the track-to-track process (T2TP).
In this paper, we propose to i) evaluate the impact of the metric choice in re-identification and ii) extend the vehicle re-identification to T2TP and assess the performance in comparison with I2TP. To this extent, experiments in this paper are made using the VeRi dataset. Indeed, unlike other large-scale dataset, VeRi contains image-based tracks of vehicle, allowing performance comparisons between IT2P and T2TP, as well as comparisons of performance with state-of-theart methods. Let us underline that this paper focuses on visual-information only re-identification processes: no extra or contextual information is used in the studied processes. It is worth noting that the goal of this article is not to provide another re-identification system, but rather to evaluate the impact of the metric choice in the re-identification performance, and the potential benefits of T2TP on state-of-the-art methods.
This paper is organized as follows. After introducing the mathematical notations in Section 2, we present the distance metrics that we compare in terms of re-identification performance, in Section 3. Section 4 presents the extension of the re-identification to T2TP. Then, Sections 5 and 6 respectively present the experiments conducted to evaluate the re-identification performance and the results obtained. Finally, in Sections 7 and 8 we discuss our results, give some perspectives, and conclude.

Vehicle re-identification
In this section, we present the problem of vehicle re-identification. First, we introduce the mathematical notations that cover state-of-the-art I2TP, and T2TP (the second being a generalization of the first). Then, we present the two-step method for vehicle re-identification considered in our experiments, namely the LR extraction and the matching and ranking process.

Notations and problem statement
Let consider C = {C 1 , C 2 , ..., C nc }, the set of n c cameras of a CCTV system, and V = {V 1 , V 2 , ..., V nv }, the set of n v vehicles captured by the cameras in C. Each vehicle of V is uniquely identified. We denote T = {T 1 , T 2 , ..., T nt } the set of n t tracks captured by the cameras of C, and stored in a database. A track T k , captured by one camera of C, is associated with one of the vehicle of V denoted V k . Since a vehicle can be recorded by multiple cameras, two tracks T i and T l (with l = i) can be associated with the same vehicle, such that V i = V l . A track T i = {I i,1 , I i,2 , ..., I i,Ni } is a set composed of N i images, all representing the same vehicle V i . Each image I i,j of T i is cropped within the frame of its corresponding video segment from where it has been recorded. Note that, in this paper, we do not consider the time of the capture of each image, so that the order of images in a track is not taken into account. Given a query track T q = {I q,1 , I q,2 , ..., I q,Nq }, representing the vehicle V q ∈ V, V q being unknown, the aim of vehicle re-identification is to find a track T r ∈ T in which the vehicle V q appears. It is worth noting that, in case of I2TP, the query track T q is only composed of one image I q . Figure 2 shows a general overview of the vehicle re-identification process considered in this paper.
The first step consists of extracting features characterizing the vehicles in the track images. The feature extraction process is presented in Section 2.2. Using these features, the second step aims at ranking the different tracks of T based on their distance to the query. The matching and ranking process is presented in Section 2.3.  Figure 1: Extraction of the latent representation L k for a track T k . Each image I k,i ∈ R n×m of T k is transformed into a vector L k,i ∈ R f through the second-to-last layer of the CNN. The matrix L k is then constructed as the concatenation of the N k vectors L k,i .

Latent representation extraction
The aim of feature extraction is to represent all the images of each track of T in one common space, in order to make them comparable. We use as common space the latent spaces of CNNs and, as features, the latent representation (LR) of each image in these latent spaces. The main idea is to use one of the last layers of a CNN as a vector of features, in order to represent the input image in the latent spaces of the network. Formally, we consider a function N : R n×m → R f that transforms an image I k ∈ R n×m to a vector of features L k ∈ R f , n × m being the size of the image and f being the dimension of the latent space. To represent the LR of a whole track, we concatenate each LR of its images to form a matrix. Thus, we denote the matrix constructed as a concatenation of the LR of the N k images of the track. Similarly, the LR of a query track T q is denoted L q ∈ R f ×Nq . Let us notice that, in case of I2TP, the LR of the query image I q is denoted L q ∈ R f . Figure 1 shows a graphical representation of the LR extraction for a track T k .

Vehicle matching and ranking
Given a query track T q , the aim of LR matching is to find the vehicle Vr, such that with r ∈ {1, 2, ..., n t }, and where d is a distance function measuring how close the probed track T r (represented by L r ) is from the query track T q (represented by L q ).
In order to evaluate the vehicle-re-identification, the matching process is conducted as a ranking on the probed tracks, from nearest to farthest. This consists in ranking every track of T to construct an ordered setT q = {T q,1 , ..., T q,Nt }, such that a track T q,i is the i th nearest track from the query according to the distance function d(.), T q,1 being the first match (i.e. the nearest) and T q,Nt , being the last (i.e. the farthest).

Image-to-track distance metrics
In this section we define the different distance metrics that we tested to compare their impact on vehicle reidentification. Referring to Figure 2, we consider here a query track containing one image T q = I q (represented by L q ∈ R f ) and a probe track T r (represented by the vector L r ∈ R f ×Nr ) taken from T .

Minimum Euclidean distance
Euclidean distance has been widely used as a basic metric in many applications of content-based image retrieval [12,13]. In our context of vehicle re-identification, previous works only focused on the use of M ED (or a variant) [1,2,3,4,5,6,7,8]. Therefore, we use M ED as a basis to evaluate the impact of other metrics (defined below) in the vehicle-re-identification. We define the M ED function as: where ||.|| 2 is the L 2 norm measuring the Euclidean distance between the vector L q and a column of L r . N (CNN) . . ...
T q,n t (rank n t ) Figure 2: Overview of the vehicle re-identification. Every vehicle image (query included) is represented by its LR within the latent space of the CNN. LR of all images of the same track are concatenated to build a matrix representing the LR of this track. Using a distance metric d, each track is ranked towards the query, from the closest to the farthest track, producing an ordered setT .

Minimum cosine distance
As a first alternative to M ED, we propose to use the minimum cosine distance (M CD). Cosine distance is commonly used in data mining, machine learning [14], and is often referred as being one of the most suitable distance metrics in information retrieval. We compute the M CD as follows: where the term L q Lr,i ||Lq||2||Lr,i||2 corresponds to the cosine similarity between L q and L r,i . Note that, since we consider CNN architectures constructed with Rectified Linear Unit activation functions [15], both elements of L q and L r,i are all positive. Therefore, M CD is bounded in [0, 1] (0 when L q = L r,i and 1 when L q and L r,i are orthogonal).

Residual of the sparse coding reconstruction
Since Euclidean and cosine metrics are designed to measure distance between signals of the same dimension (here R f ), these metrics are computed for each vector of L r (corresponding to an image-to-image comparison). The minimum distance is then selected as the reference. Therefore, among all images contained in tracks, at decision time, only one image is ever used to measure the distance between L q and L r . To induce the use of more information, we propose to use the residual of the sparse coding reconstruction (RSCR). Sparse representation has been widely studied in many applications of computer vision, such as image classification, detection and image retrieval [16,17].
We computed the RSCR as follows: where Γ q,r ∈ R Nr is a code, combining linearly the column of the probe L r , and optimized to reconstruct the query L q as follows: where ||.|| 1 is the L 1 norm maintaining the sparsity of the code, controlled by the coefficient α ∈ [0, 1].

Extension to track-to-track re-identification
As an extension of I2TP, and referring to Figure 2, T2TP aims at measuring the distance between a probed track T r containing several images and the query track T q . Here, LRs of T r and T q are respectively represented by L r ∈ R f ×Nr and L q ∈ R f ×Nq . Therefore, the main challenge with T2TP is to define metrics that are able to measure the distance between two tracks of different sizes.

MED and MCD for T2TP
We extend the M ED and M CD metrics to T2TP as follows. First, considering a distance metric d (e.g. M ED or M CD), we construct a set of distances D q,r = {d(L q,j , L r ) | j ∈ N q } containing the N q computations of d for each vector j of L q regarding L r . Then, we compute the overall distance between T q and T r by defining an aggregation function g : R n → R, in order to aggregate the elements of D q,r , and obtain a scalar.
In our experiments, we used the following aggregation functions: minimum, mean and median. The minimum function consists of selecting the best image-to-image match between the query and the probed track, without taking into account the other images. Such function is therefore supposed to be more efficient when seeking for two tracks containing images with very similar points of view. The median function also considers one image-to-image match, while promoting tracks containing at least half of its element similar to the query. On the contrary, the mean function aggregates all elements of D q,r , promoting tracks for which each image is similar to at least one image of the query, which can be sensitive to query with more variability. With d = M ED, we denote minM ED, meanM ED and medM ED the T2TP metrics using respectively the aggregation function minimum, mean and median. Similarly, with d = M CD, we denote the T2TP metrics, minM CD, meanM CD and medM CD. In addition, because some images of a track can be irrelevant for T2TP, we also consider the computation of truncated mean and median, using only the N q /2 smallest distances within D q,r . With d = M ED, these metrics are denoted mean50M ED and med50M ED. Similarly, with d = M CD, these metrics are denoted mean50M CD and med50M CD.

RSCR for T2TP
Interestingly, since sparse coding is designed to reconstruct matrix, RSCR can easily be extended to comply with track-based queries, by rewriting equations (4) to comply with L q : where ||.|| F denotes the Frobenius norm, and where the sparse code Γ q,r = [Γ q1,r , ..., Γ q Nq ,r ] ∈ R Nr×Nq is computed by iteratively solving the equation (5) for each column Γ qi,r ∈ R Nr of Γ q,r , such that:

Kernel distances
As a natural extension of distance measurements between two sets of vectors (i.e. LR of tracks), we also propose to evaluate kernel distance metrics [18,19]. Kernel distance allows the measurement of the global distance between two tracks according to a given similarity kernel function k. The kernel distance D k between L q and L r is defined as : where k(.) is a positive definite kernel function, measuring similarity between two vectors (here LR), such that k(L x , L x ) = 1 and k(L x , L y ) decreases when the distance between L x and L y increases.
In our experiments, we tested two kernels, the radial basis function (RBF), defined as k(L x , L y ) = e γ||Lx−Ly||2 2 (with γ ∈ R + , the spread parameter of the function), and the cosine similarity (CoS), defined in Section 3.2. We respectively denoted these kernel distances KRBF and KCOS.

Experiments
We evaluated the impact of the distance metrics on I2TP and T2TP performances by running experiments on the large-scale benchmark dataset VeRi [3].
We conducted our experiments as follows. First, we used the training set of the VeRi dataset on five well-known CNN architectures to specialize them in the vehicle recognition task. We then used these fine-tuned CNNs to extract LR on every image of the testing set. Second, we evaluated I2TP with respects to distance metrics defined in Section 3. T2TP performance is also evaluated with the metrics defined in Section 4.

The VeRi dataset
The VeRi dataset is composed of 49357 images of 776 vehicles recorded by 20 cameras in a real-world traffic surveillance system. Every vehicle of the dataset has been recorded by several of the 20 cameras of the system, constituting a totality of 6822 tracks of vehicles (each track is composed of a mean number of 6 images, varying from 3 to 14 images). The VeRi dataset is divided into two sets, a training set, composed of 37778 images representing 576 vehicles (5145 tracks), and a testing set, composed of 11579 images representing 200 vehicles (1677 tracks). Evaluation of I2TP is performed through 1677 query images pre-selected in each track of the testing set. Evaluation of T2TP is conducted using the 1677 tracks of the testing subset. Since I2TP and T2TP both rely on the comparison of a query (that is either a unique image from a track or the whole track, taken from the testing set) to all other tracks of the testing set, their performances remain comparable.
In order to comply with the inputs dimension of these CNNs, every image of the VeRi dataset was resized to 224×224. The different dimensions of the second to the last layer of ResNet18, VGG16, AlexNet, InceptionV3 and DenseNet201 are respectively 512, 4096, 4096, 2048 and 1920.

Fine-tuning for vehicle recognition
To fine-tune the CNN models, we proceed as follows. We replaced the last layer of each CNN architecture by a fullyconnected layer of 576 neurons, and trained each network to classify the 576 vehicles of the VeRi training set. The back-propagation was performed using the cross-correlation loss function. Weight optimization was performed using classical stochastic gradient descent (learning rate set to 0.001, momentum set to 0.9). The network was trained during 50 epochs.

Evaluation protocol
To evaluate the vehicle ranking, we use the Cumulative Matching Characteristic (CMC) curve which is widely used in object re-identification [3,4]. We reported the two measures rank1 and rank5 of the CMC curves, corresponding respectively to the precision at rank 1 and 5.
Regarding the dataset VeRi, since there are several tracks that correspond to the query, we also computed the mean average precision (mAP) which is classically used in vehicle re-identification evaluation. mAP takes recall and precision into account to evaluate the overall vehicle re-identification. Given a query q and a resulting ranked setT q , the average precision (AP) is computed as where δ(T q,i ) is a function equals to 1 if the track T q,i represents the vehicle V q , or 0 otherwise. N gt is the number of tracks representing the query vehicle V q .
We computed mAP as the mean of all AP computed for every query: with N Q being the number of queries performed with the dataset (N Q = 1677 with the VeRi dataset).

Implementations details
CNN architecture construction and training have been implemented using the Pytorch framework in Python [28].
Regarding the RSCR, we solved equations (5) and (7) by using the lasso-LARS algorithm (Lasso model with a regularization term L 1 , fitted with Least Angle Regression) [29], with α = 1. We computed the kernel distance KRBF Figure 3: Image-to-track mAP results depending on the CNN architecture and the distance metrics used. The higher, the better.
with γ = 1 f , f being the LR dimension of the considered CNN. Distance metric computations were implemented using the package scikit-learn in Python. Source codes for LR extraction (Section 2.2), distance metric computations (Sections 3 and 4) and vehicle ranking (Section 2.3) are available at <will be publicly available>. Table 1 reports the performances obtained with the metrics tested in I2TP (M ED, M CD and RSCR), depending on the CNN (AlexNet, VGG16, ResNet18, DenseNet201 and InceptionV3). Figure 3 depicts the mAP results obtained.

Discussion and perspectives
From a general point of view, we can observe high variability of performance between CNNs. As expected, such results confirm the impact of the CNN architectures on the re-identification performance. This demonstrates the relevance of previous works focusing on the definition of specific CNN architectures and on the learning of efficient LR.
Besides, considering a given CNN architecture to produce LR, our results also show high variability of performance depending on the distance metric, showing that the choice of the metric for the matching process has a major impact on re-identification performance.
7.1 Impact of the metric on I2TP

Limitations of MED
Overall, there is a clear gain of performance from M ED to M CD (mAP gain ranging from +2.02% to +5.79%). More precisely, we can observe big difference of performance between M ED and M CD/RSCR, especially when associated with AlexNet and VGG16. This could be related to the higher dimension of the LR produced by these CNNs (R 4069 ), potentially more affected by the curse of dimensionality [30], compared to other CNNs (R 512 , R 1920 and R 2048 ). Therefore, besides the obvious differences of performance between CNN architectures, we argue that such dimensionality-performance relationship could have limited M ED-based results in the literature. For instance, with their RAM architecture, Liu et al. [8] concatenated vectors of features into a single vector of dimension > 6000. Thus, we think that the use of M ED metric during their matching process may have reduced the performance of their system, which could be improved with a more appropriate metric (e.g. M CD).

Performance of MCD
Cosine measure has been shown to be a powerful metric when dealing with high dimensional features [31], in various applications [32,33]. In our I2TP experiments, M CD metric clearly outperforms M ED in terms of mAP, and remains similar regarding the metrics rank1 and rank5. This can be interpreted as the fact that M CD provides overall better ranking of vehicles, improving the retrieval of other correct track of vehicles that are not in the first ranks, without impacting the retrieval of top-rank vehicles. In addition, M CD demonstrates adaptive capabilities to various dimensions of features (from R 512 to R 4096 ). Therefore, we think that cosine-based distances can be considered as an interesting, and easy to implement, alternative to M ED.

Performance improvement with T2TP
From a general point of view, T2TP outperforms I2TP independently from the metric (with the exception of KRBF and KCOS, not computed with I2TP). The gain of mAP is respectively +0.34%±2.63 for the M ED-based metrics, +4.07%±0.85 for the M CD-based metrics, and +3.37%±3.11 for the RSCR. These results clearly illustrate the interest of using track-based query to help the re-identification process. Obviously, such gain of performance had to be expected since a track-based query (T2TP) contains more visual information than an image-based query (I2TP). Nevertheless, we can observe that the gain of performance is higher with M CD-based and RSCR metrics than M ED-based metrics (with the exception of DenseNet201 for RSCR). In addition, T2TP-specific metrics (KRBF and KCOS) performed poorly compared to others, indicating that global track-to-track distance measurements, taking into account all the images of both tracks, seems to be less effective than more "selective" ones. Thus, results outline that a significant improvement of performance with T2TP can only be obtained when combined with a relevant and adapted metric.

Aggregation function
Results show the extension of IT2P metrics to T2TP (M EDand M CD-based metrics) seem more effective than T2TP-specific metrics (KRBF and KCOS). However, the generalization of M ED and M CD to T2TP is not straightforward, and induces, in the absence of a priori knowledge on the vehicle tracks, an arbitrary choice of aggregation function. In our experiments, the aggregation function min and mean50 shows the best overall performance. As M ED and M CD in I2TP, the min function consists in selecting the best image-to-image distance between all pairs of images, focusing the re-identification to the best possible match between the query and a probed vehicle. Therefore, the performance obtained with this metric depends on the existence of similar images between tracks of the same vehicle. Alternatively, the aggregation function mean50 has the advantage of aggregating the distances between query and probed track images, while truncating irrelevant images contained in the query track. Such aggregation function is thus supposed to be more robust to this case. Nevertheless, since the VeRi dataset mainly contains tracks with similar images, such effects are hard to evaluate. Further experiments including more diversity in tracks of vehicles are thus needed. For instance, the CompCars [34] and the Tocada [35] datasets provide tracks of vehicles containing different points of view (e.g. a track containing images of the vehicle in front and side-view). Although these datasets are not meant to assess re-identification performance as VeRi, they could be used to evaluate the effect of using more diverse images over tracks (more viewpoints of the vehicles, lack of similar images, etc.), and hence the benefit of T2TP. Note: Values are in percentages. The higher, the better.

Advantages of RSCR
Despite the relatively poor results obtained with RSCR (compared to outperforming M CD-based results), we think that the use of sparse coding reconstruction remains an interesting method to explore in the context of LR-based reidentification. First, RSCR has the advantage of being directly usable for both I2TP and T2TP, without having to define any arbitrary aggregation function (like M EDand M CD-based metrics), or to perform a global comparison between tracks (like kernel distances). Second, unlike other distance metrics, RSCR is based on linear combinations (the sparse coding reconstruction) of LR, which are expected to induce complex semantic operations between the visual cues present in the images. Mikolov et al. [36] in the domain of word representation and Radford et al. [37] in synthetic image generation showed that simple arithmetic operations between objects in latent spaces of DNN can correspond to complex transformations between semantic concepts. In our context of vehicle re-identification, linear combination performed with RSCR can be viewed as a combination between the various existing points of view of a given vehicle, which could potentially produce LRs corresponding to unseen points of view of the vehicle. Hence, in contrast to other metrics, RSCR could be more robust to the absence of similar images between tracks. In addition, the sparse constraint holds this linear combination of the most useful LR, avoiding the use of irrelevant images (e.g. images of vehicle in back-view to retrieve a vehicle seen in a front-view, noisy images, etc.) and/or redundant information (e.g. stationary vehicle), in the reconstruction.
Future work will focus on evaluating the advantages of using RSCR, and more generally metrics based on linear combination of LRs, in the context of vehicle re-identification. Table 3 presents our best results (I2TP and T2TP) along with the vehicle re-identification performance reported on the VeRi dataset in the literature.

Comparison with the state-of-the-art methods
First, using only visual information (LR), the method combining DenseNet201 and M CD (in I2TP) outperforms FACT and nuFACT [5], which use a combination of the visual aspect and contextual information. The method DenseNet201+M CD also outperforms the state-of-the-art RAM "baseline" [8], which only uses the global visual aspect of vehicles (like in our approach). These first results highlight the importance of the metric in the re-identification process, indicating that the use of M CD is a more relevant metric than M ED in LR-based vehicle re-identification.
Second, the method combining DenseNet201 and mean50M CD in T2TP outperformed the state-of-the-art RAM and QD_DLF methods [8,9] in terms of mAP ([+1.35%, +1.7%]) and rank5 ([+2.56%, +3.02%]). Considering the performance improvement obtained with only global visual information of vehicle images (no local features, no metadata/contextual information) and the very simplistic learning procedure that we used in our experiments (finetuning of standard CNN architectures), we argue that a relevant metric (M CD) combined with the use of more visual cues of the query vehicle (T2TP), could easily improve the performances of state-of-the-art methods, which are specifically designed for vehicle re-identification.

Limitation of visual-only based re-identification
As stated and studied in [4,5,6], qualitative examples presented in Figure 5 confirm that visual-only based methods remain limited in their capacity to distinguish visually similar vehicles. As an example, the model was not able to discriminate two similar yellow trucks carrying respectively rocks and sand. This is possibly due to the use of global visual-only feature, limiting the detection of details. To overcome such limitation, the use of region-based features, as in [8], could allows the detection of small details differing from two similar vehicles, and increase the re-identification performance. In addition, visual-only based methods seem to hardly discriminate two similar cars with same color and model (see the black car example of Figure 5). In such case, the use of contextual metadata, such as spatio-temporal information and/or licence plate, as in [6] and [5], is required to reach better discrimination between similar vehicles.

Conclusion
Recent studies on vehicle re-identification focused on the extraction of latent representation (LR) of vehicles, i.e. vectors of features extracted from the latent space of convolutional neural networks (CNN), to discriminate vehicles on their visual appearance in order to retrieve a given vehicle. These previous works performed the re-identification process by comparing LR of vehicles using metrics based on the Euclidean distance (or a variant), which is known to be poorly suited with high-dimensional spaces (such as CNN latent spaces). They focused on the re-identification in an image-to-track process (I2TP), using only one image of a query vehicle to retrieve a track (a set of images) representing this vehicle.
In this paper, we firstly studied the impact of the metric used for the vehicle re-identification, comparing performances obtained with different metrics; we studied visual-information only re-identification processes (no extra or contextual information used). We tested metrics based on the minimal Euclidean distance (M ED), the minimal cosine distance (M CD), and the residual of the sparse coding reconstruction (RSCR). We applied these metrics using features extracted through five different CNN architectures (namely ResNet18, AlexNet, VGG16, InceptionV3 and DenseNet201). We used the specific vehicle re-identification dataset VeRi to fine-tune these CNNs and to evaluate the results.
Results show a major impact of the metric on the re-identification performance. In overall, independently from the CNN used, M CD metric outperforms M ED (mAP: [+2.02% -+5.79%]). This result is of great importance since the literature always uses M ED only (or a variant).
In a second part, we investigated to extend the state-of-the-art I2TP to a track-to-track process (T2TP). Indeed, in real applications, users face with video segments (vehicle tracks) rather than vehicle images. T2TP grounds the reidentification of the visual data available (vehicle track) and enhances the process without using additional metadata (contextual features, spatio-temporal information, etc.). We extended the metrics to measure the distance between tracks, allowing for evaluation of T2TP and comparison with I2TP (using the same five CNN models).
To conclude, our experiments highlight the importance of the metric choice in the vehicle re-identification. On the other hand, T2TP improves the vehicle re-identification performance (compared to I2TP), especially when coupled with M CD-based metrics.
More experiments are needed to evaluate the T2TP gain: i) strengthen the T2TP results through the use of contextual metadata, ii) explore more linear-based distance metrics, and iii) evaluate the impact of track diversity by using other datasets.
As practice of vehicle re-identification tends to favour queries based on tracks rather than images, we argue for considering T2TP (in addition or in replacement of I2TP) in future vehicle re-identification works.