Parking recommender system privacy preservation through anonymization and differential privacy

Recent advancements in the Internet of Things (IoT) have enabled the development of smart parking systems that use services of third‐party parking recommender system to provide recommendations of personalized parking spot to users based on their past experience. However, the indiscriminate sharing of users' data with an untrusted (or semitrusted) parking recommender system may breach the privacy because users' behavior and mobility patterns could be inferred by analyzing their past history. Therefore, in this article, we present two solutions that preserve privacy of users in parking recommender systems while analyzing the past parking history using k‐anonymity (anonymization) and differential privacy (perturbation) techniques. Specifically, given an original parking database containing users' parking information, the k‐anonymity mechanism constructs an anonymized database, while differential privacy perturbs the query response using the Laplace mechanism, making the users indistinguishable in both approaches, hence preserving the privacy. Experimental results on a data set constructed from real parking measurements evaluate the trade‐off between privacy and utility, therefore enabling users to receive parking spots recommendations while preserving their privacy.

hence it is better to exploit the services of third-party recommender systems dedicated for this purpose. This implementation is currently less widely adopted; however, is gaining attention with the horizontal and vertical emergence of IoT and smart cities applications, 2 as well as with the interoperability in IoT that interconnects various applications/deployments, hence also interconnects third-party recommender systems with the smart parking system. For instance, there is a recent EU-KR H2020 WISE-IoT project, 3 that enabled the interoperability between two IoT platforms: FIWARE and oneM2M which are widely used in Europe and South Korea, respectively. It demonstrated such interoperability through a smart parking use case by adopting the second type of implementation. In this demonstration, a smart parking application operates both in Europe and South Korea. When in Europe, it connects to the FIWARE platform and recommender system to obtain the recommendations of parking spots. While when in South Korea, it connects to oneM2M infrastructure and recommender system to obtain the recommendation of parking spots. 4 In this study, our focus is on the second type of implementation. Both types of implementation consider the smart parking application to be trustworthy that is responsible for receiving user requests and maintaining a parking database. However, the second type of implementation has an additional third-party parking recommender system. Since we do not know much about the third-party parking recommender system, therefore one cannot identify its trustworthiness and it could be either trusted or semitrusted or untrusted. The parking database contains user ID and user's current location (obtained from user's request), parking spot (obtained from recommender system), user rating (obtained from user after completing the parking), and current timestamp (the time of the user's request). The need of storing this information into the parking database is to provide personalized recommendations to users based on their past parking behavior and experience.
We focus on preserving the privacy within the parking database containing users' parking history that could lead to infer users' behavior and mobility patterns. We assume that when the application sends user's current location to the parking recommender to obtain parking spot, it sends the perturbed user location by applying differential privacy (eg, geoindistinguishabilty 5 ), hence the parking recommender does not get the actual location of the user and the privacy is already preserved in the case of user's request. To preserve the privacy of statistical databases, there is an emerging interest in k-anonymity and differential privacy techniques that preserve the privacy through anonymization and perturbation, respectively. 6,7 k-anonymity 6 is the earliest work on privacy preservation that anonymizes a data set in such as a way that with respect to the set of quasi-identifier attributes (ie, attributes that can identify the individuals when combined together), each record (or row) is indistinguishable from at least k − 1 other records. Differential privacy, instead, operates on the principle of data perturbation by adding noise to the query result. 7 Therefore, the parking recommender system would not be able to differentiate among multiple records (in k-anonymity), as well as would not be able to find the actual query result (in differential privacy), hence making the users unidentifiable and indistinguishable in both cases. Both k-anonymity and differential privacy are formally defined and discussed in detail in Section 4.1.4 and 4.2, respectively.
Our main contribution in this article is to preserve the privacy of users in the parking recommender system while analyzing their past parking history using k-anonymity (anonymization) and differential privacy (perturbation) techniques. Specifically, given an original parking database containing users' parking information, the k-anonymity mechanism constructs an anonymized database, while differential privacy perturbs the query response using the Laplace mechanism, making the users indistinguishable in both approaches, hence preserving the privacy. In order to evaluate the performance, we performed experiments on a data set constructed from real parking measurements that evaluate the trade-off between privacy and utility, therefore enabling users to receive parking spots recommendations while preserving their privacy. To the best of our knowledge, these two privacy preservation techniques have not been applied and evaluated before in the perspective of preserving the privacy of users in the parking database. 8,a This article is organized as follows. Section 2 presents the related work. Section 3 presents the system and adversary models. Section 4 describes the privacy preservation techniques of k-anonymity and differential privacy. Section 5 describes the experiments for the evaluation of k-anonymity and differential privacy. Finally, Section 6 concludes the article.

RELATED WORK
For privacy preservation in current smart parking systems, the major focus of existing works is on protecting real-time user's location and navigation information, cryptography, pseudonymity, encryption, and consortium blockchain. The protection of historic parking database, which is the focus of our study, is not investigated much. a Reference 8 is the PhD thesis of Y.S., where some of this work is presented. See reference list for the link.
For instance, Ni et al 9,10 preserve the privacy of parking navigation using Bloom filters by enabling a user (or vehicle) to receive the navigation results, even the user is moved out of range of the queried roadside unit. They preserve the privacy using pseudonymity in which the users make queries to the cloud server which handles the parking information for available parking spots in an anonymous manner. The cloud server enables the vehicle to receive the navigation query results even if the vehicle has moved out of the range of the queried roadside unit.
Chatzigiannakis et al 11 preserves the privacy of a smart parking system by using public key cryptography scheme, known as elliptic curve cryptography that is suitable for resource constraint devices and is platform independent. The authors used zero knowledge proofs that avoids the exchange of confidential information, hence achieving the privacy. The authors evaluate the performance by studying the execution time and system overhead. However, the authors did not evaluate the privacy and utility of their proposed system.
Huang et al 12 worked on automated valet parking system for which the parking reservation is a prerequisite in order to achieve automated parking. The authors worked on preserving the private information of drivers (eg, identity and locations) that are revealed by the reservation requests by removing the user identity and making it anonymous. However, making the users anonymous cause security problem, for example, double-reservation attack. The authors address this security issue by allowing each anonymous user to possess only one reservation token that can be used to reserve one available parking spot. In this way, the authors claimed to preserve the privacy of user's identity and location, as well as avoid double-reservation attack using zero knowledge proofs and proxy re-signature. However, the authors mainly preserve the privacy by using pseudonymity (making the user anonymous) and did not evaluate the privacy and utility of their system.
Lu et al 13,14 designed a smart parking system for large parking spots using vehicular communications that offers real-time navigation and anti-theft protection. The authors preserve the privacy of users by keeping the identity of users secret, that is, by using pseudonymity. However, only protecting the explicit identifier is not sufficient because an adversary could still identify the users uniquely by linking and disclosure attacks.
Yan et al 15 designed a privacy preserving parking system that relies on wireless network and sensor communications and allows users to reserve parking spots. The authors protects the privacy by using encryption technique.
Alqazzaz et al 16 proposed a privacy preserving and secure smart parking framework based on publish/subscribe mechanism. It provides 2-fold functions. First, it offers the parking services, for example, parking availability, navigation and parking reservation. Second, it offers security on application and network layers, as well as preserves privacy. The authors protect the privacy using encryption technique which is basically a security mechanism.
Garra and Mart 17 implemented an anonymous e-coin system that protects the privacy of a parking system and offers payment by phone without disclosing start and end time.
Hu et al 18 proposed a blockchain-based parking system using smart contracts that preserves privacy through a consortium blockchain in which the transactions are controlled by the legitimate nodes and are not disclosed to external entities.
Besides privacy in parking systems, many works have been done on privacy in mobility data. Although the mobility data are mostly about the trajectories and are not in the scope of parking data, but we would like to discuss some works on privacy in mobility data for the readers who are interested in considering privacy in mobility data together with parking systems. Nowadays, due to various location-based services, the users' mobility data are recorded and sensed continuously that causes a serious concern in privacy. Mobility data can disclose the users' behaviors, routines, and mobility patterns. Giannotti et al 19 worked on mobility data and its privacy preservation by first providing the basic concepts of data privacy and then discussed the privacy in data analysis of offline mobility data. Then they presented how privacy in data mining can be achieved by design. Monreale et al 20 proposed a system for preserving the privacy of mobility data of trajectories using k-anonymity and generalization. Shao et al 21 proposed two approaches for privacy preservation using differential privacy while publishing trajectory data of ships, as well as provide comparative investigation of these two approaches. Torra 22 wrote a book on data privacy by covering the basics, developments and Big Data challenge. This book covered the perspectives of statistical and machine learning, classified privacy preservation methods, privacy risk disclosure measures and methods, masking methods, and finally the information loss in privacy and utility. D'Acquisto et al 23 presented an overview of achieving privacy preservation by design in Big Data. This work was done in the framework of the European Union Agency for Network and Information Security (ENISA). Pratesi et al 24 proposed a framework for assessing the privacy risk vs utility in data sharing systems. This system allows assessing the guarantee of data quality, as well as empirical study on privacy risk for the users in the data. Agrawal 25 highlighted the need of privacy preservation and data ownership in data mining, Hippocratic databases, and sovereign information sharing.
Additionally, we have recently published a work that analyzed various machine learning (ML) and deep learning (DL) models for the prediction of availability of parking spots. 26 However, our current work is different from our previous work in a manner that our previous work is mainly about the prediction of availability of parking spots and it does not preserve the privacy of parking data. Our current article, on the other hand, is mainly about preserving the privacy of users' parking data. Moreover, in this article, we apply two privacy preservation techniques: k-anonymity and differential privacy for preserving the users' privacy in parking database. However, our previously published work 26 performed a comparative analysis of four ML/DL models: multilayer perceptron (MLP) neural network, K-nearest neighbors (KNN), decision tree and random forest, and ensemble learning approach (voting classifier) for the prediction of parking spot in the next 30 minutes.
The abovementioned works on privacy preservation for smart parking are first focused on the real-time user's location and navigation information. Second, they preserve privacy using pseudonymity, cryptography, and encryption techniques which are prone to privacy leakage using linking and disclosing attacks, as proved in the literature. We, on the other hand, focus on privacy preservation using two well-known privacy preservation techniques of k-anonymity and differential privacy and we focus on preserving privacy within the historic parking database. k-anonymity and differential privacy are widely used in the literature. Although they have been applied to preserve privacy in several contexts, for example, smart grid and Internet of Things, we are novel in a manner that we are applying k-anonymity and differential privacy to preserve the users' privacy in parking database, and to the best of our knowledge, these privacy preservation techniques have not been applied in this context before.

System model
Our system model is comprised of two entities: an internal smart parking system that is a trustworthy entity and an external third-party parking recommender system that is a semitrusted or untrusted entity. The smart parking system is comprised of users, a smart parking application front-end, a service logic, users' parking database, an anonymized database (for privacy through k-anonymity), and a perturbation mechanism (for privacy through differential privacy). The parking recommender system is a third-party recommender system that uses various metrics (such as parking and traffic information, and sensors quality) to provide recommendations. The system architecture is presented in Figure 1. A user, registered on a smart parking application, makes a request for a nearby parking spot comprised of his user id (eg, registration number) and current location. The smart parking application is a trusted entity that receives requests from users, forwards each user's request to the service logic entity, obtains recommended parking spots from the parking recommender, and provides them to the user, as well as collects ratings from users after they have completed their parking and forward them to the service logic entity. The service logic entity is also a trusted entity and it maintains a users parking F I G U R E 1 System architecture of privacy preserving parking system TA B L E 1 An example snapshot of data database comprised of user ids and current locations, parking spots, ratings, and timestamp attributes. The parking recommender is either an untrusted or a semitrusted entity that receives users' current locations, analyzes their past parking history, and provides parking spots recommendations. To protect the privacy of users, the parking recommender should not be able to uniquely identify the users from the users parking database. We achieve this by using two well-known privacy preservation techniques: k-anonymity and differential privacy. In k-anonymity, the service logic entity generates an anonymized version of the users parking database and releases it to the parking recommender for analysis. In differential privacy, the parking recommender makes numeric queries to the service logic entity, but instead of receiving the actual responses, the service logic entity sends the perturbed responses to the parking recommender with noise added by the Laplace mechanism.

Adversary model
The primary adversary b in our system is an untrusted (or semitrusted) parking recommender system that needs access to the historical parking database for its recommendations of personalized and efficient parking spots. This system is susceptible to a disclosure attack, in which an adversary (ie, the parking recommender) can recognize the behavior and mobility patterns of the users by observing the historical parking database. The adversary can track the behavior and mobility patterns of the users and uniquely identify them by analyzing the past history of users in the parking database. Such tracking could lead to the discovery of the users' private information and unique identification. For example, from past parking history, a user can be identified when he is at work, when he returns home, as well as other personal information, for example, when and which hospitals or clinics he visits etc. Therefore, the parking database must preserve the privacy of the users such that an adversary could not be able to uniquely identify a user. We assume that the adversary is curious but not malicious.

PRIVACY PRESERVATION
In our considered scenario, a user, registered on a smart parking application, makes a request, comprised of user ID (eg, registration id) and user's current location. On each user's request, the smart parking application (trusted entity) performs 2-fold functions. First, it forwards the request to the third-party parking recommender system (semitrusted or untrusted entity), obtains the recommended parking spot, sends it back to the user and collects the rating from the user after completing the parking. Second, it maintains a parking database that contains user ID and user's current location (obtained from user's request), parking spot (obtained from recommender system), user rating (obtained from user after completing the parking), and current timestamp (the time of the user's request). The sample database is presented in Table 1. This database needs to be shared with the parking recommender system for personalized recommendations of b We use parking recommender system and adversary interchangeably.
parking spots based on user's past experience. For instance, by tracking the user's parking behavior and rating, it is possible to recommend those parking spots, the users have good experience with (eg, frequently used and highly rated). However, the parking recommender system could identify an individual and infer user's routine and mobility patterns by analyzing the user's location and parking behavior; therefore, the indiscriminate sharing of user's data with parking recommender system violates the privacy of the users. The parking recommender system can easily identify a user uniquely and trace his habits, behaviors, and mobility patterns by analyzing the parking database. For example, as presented in Table 1, even if we remove the user ID (the unique identifier), the recommender system could easily guess the routine of user 1, as he leaves daily in the morning from the same place (his home) between 8:30 a.m. and 9:00 a.m. only on weekdays (for the work) and parks in the similar area (his work place), as parking spots 3601, 3602, and 3603 are very close to each other. Hence, the parking recommender system could exploit this routine to do malicious activity, for example, plan stealing at user's home in his absence. Here, the most important parameter for recommendation services is timestamp because the timestamp parameter helps the recommender system to learn about the user behavior and habits so that it can recommend the personalized parking spots to the users that they liked and used in the past. The timestamp is also the most critical parameter for privacy leakage because an adversary uses the timestamp for disclosure attack. For example, using the timestamp parameter, an adversary tries to find the correlation in users' parking pattern and infers the users' routines and habits. Therefore, the user's request and the database contain user private information, and sharing them in their current form with the recommender system seriously violates the privacy of users. Hence, there is a need of preserving the privacy of users. One solution is that the parking application does not share such historical database and the recommender system recommends the parking spots only based on the real-time information. However, this will cause lack of personalized and efficient recommendations of parking spots. Another solution is that application shares the historical database by removing the user ID; however, the parking recommender system can still easily identify an individual by analyzing the other quasi-identifier attributes (eg, user current location, parking spot, and timestamp) as we discussed above. Hence, the preferred solution is to apply the privacy preservation techniques 27 so that the parking recommender system could not be able to identify the private information of the individuals. We preserve privacy using two techniques; one uses noninteractive data publishing through k-anonymity 6 and the other uses interactive data publishing through differential privacy. 7,28 These are discussed next in subsequent sections.

Privacy preservation through anonymization
There are three widely used anonymization techniques for privacy preservation: k-anonymity, -diversity, and t-closeness. The anonymization technique preserves the privacy by anonymizing the data and is applied on the microdata. The microdata is raw data that contains the information of the users, comprised of multiple attributes (or columns). 29 The attributes in microdata are categorized into three types: (i) explicit identifiers that can identify a user uniquely, for example, user id, (ii) quasi-identifiers that can identify a user when they are combined together, for example, user location, parking spot and timestamp, and (iii) sensitive attributes are the attributes that must be protected. We do not have sensitive attributes in our system, but two examples are diseases and salary. 30 The first step in anonymization is to remove the explicit identifier.

k-anonymity
k-anonymity 6 is the earliest work on privacy preservation that anonymizes a data set in such as a way that with respect to the set of quasi-identifier attributes, each record (or row) is indistinguishable from at least k − 1 other records. It achieves anonymization using generalization and suppression. The main purpose of k-anonymity is to counter against the linking attacks in which an adversary could not be able to uniquely identify a user by linking the quasi-identifier attributes (such as birthdate, zip code, and gender) with external data. k-anonymity is suitable for noninteractive data publishing when there is no sensitive attribute or the distribution of sensitive attribute is sparse. In this approach, the data publisher (ie, curator) does not want to get involved in answering all the queries and instead, releases an anonymized data set that will be queried by the recommender systems. k-anonymity is discussed in Section 4.1.4 formally with the reasoning that why we used it instead of other anonymization techniques (ie, -diversity and t-closeness).

-Diversity
k-Anonymity protects from linking attacks (ie, privacy against identifying the records); however, it is susceptible to two other types of attacks of homogeneity and background knowledge attacks. In homogeneity attack, if all the sensitive attributes are same in a group of k records, the value of sensitive attribute can be identified by an adversary. In background attack, an adversary uses background knowledge to identify the individuals. To address the limitation of k-anonymity, Machanavajjhala et al 31 extended k-anonymity by proposing -diversity that requires each record in a group to have at least " " diverse values for the sensitive attribute. -diversity is also suitable for noninteractive data publishing when the data publisher wants to release an anonymized data set and does not want to get involved in answering each query. However, unlike k-anonymity, -diversity is used when the anonymized data set should contain each record in a group to have at least " " diverse values for the sensitive attribute. It is formally defined as: An equivalence group fulfills -diversity if it has at least ' ' well-represented values for the sensitive attribute. A data set having equivalence groups, all of which are -diverse, is said to be an -diverse data set.
In brief, -diversity ensures intragroup heterogeneity of sensitive attributes by at least " " different values. If k = , -diversity automatically satisfies k-anonymity.

t-Closeness
Although -diversity was proposed to solve the limitations of k-anonymity, Li et al 32 proved that -diversity does not completely counter against the homogeneity attack. They used two types of attacks: skewness attack and similarity attack to demonstrate the limitation of -diversity. In skewness attack, the anonymized data set has skewed distribution of sensitive attribute in equivalence groups and -diversity failed to prevent the attack because the distribution of the sensitive attribute is different from the data set. In similarity attack, the anonymized data set has distinct values of sensitive attribute in equivalence groups but they are semantically similar. -diversity also failed to prevent the attack because an adversary can estimate the value of a sensitive attribute by linking it to another sensitive attribute. These limitations of -diversity are overcome by Li et al 32 by proposing t-closeness. t-closeness is also suitable for noninteractive data publishing and is used when a data set that needs to be anonymized has sensitive attributes. It is suitable when the sensitive attribute has skewed distribution or distinct values in the equivalence groups of anonymized data set. t-closeness is formally defined as: An equivalence group fulfills t-closeness if the distance between the distribution of a sensitive attribute in this group and that in the whole data set is no more than a threshold t. A data set fulfills t-closeness if all the equivalence groups have t-closeness.

Privacy preservation of parking data through k-anonymity
The quasi-identifier attributes in our scenario (eg, user current location, parking spot, and timestamp), while they cannot by their nature be used to uniquely identify users by linking to the external data, they can be combined together to track a user's behavior and mobility pattern (eg, a disclosure attack). Therefore, we apply k-anonymity for indistinguishability among multiple users, thus preventing an adversary from identifying a user uniquely. We do not use -diversity and t-closeness because they are used when the distribution of sensitive attribute is homogeneous or skewed, respectively; however, we do not have sensitive attribute in our parking data set and hence, k-anonymity is the most suitable candidate in our scenario. k-anonymity is formally defined as: where n is the number of attributes, QI D be the quasi-identifier attributes associated with this data set, and D[A 1 ] is the value of A 1 attribute in data set D. Then the data set D satisfies k-anonymity if each sequence of values of quasi-identifier attributes in data set D (ie, D[QI D ]) appears at least k times. 6 The higher is the value of k, the stronger is the privacy. However, a trade-off exists between privacy and utility, the stronger is the privacy (eg, higher value of k), the lesser will be the utility. Hence, a balance between privacy and utility is required.
k-anonymity achieves anonymization using generalization and suppression. 6 In this study, we consider single dimensional global recoding (ie, mapping a value to the same level of generalization in all the records for each attribute individually). In anonymization process, removing the explicit identifiers is the first step, hence we first remove user ID and apply anonymization on quasi-identifier attributes of user location (eg, latitude and longitude), parking spot, and timestamp, while constructing the anonymized data set. The implementation and experimentation details are provided in Section 5.2

4.2
Privacy preservation through differential privacy Dwork 7 coined the term differential privacy, with the definition that the outcome of a differentially private mechanism does not get highly affected by adding or removing a single record in the data set. This mechanism can protect the privacy of users while sharing a database with an untrusted recommender system by perturbing the data. Differential privacy thus overcomes the limitations of k-anonymity, specifically the curse of dimensionality. 33 We adopt the interactive differentially private data publishing approach for numeric queries by adding noise created by the Laplace mechanism, thereby answering each numeric query f as it reaches the smart parking system (eg, the curator) without revealing any individual record. 34 We next define differential privacy and some important notations.
Definition 2 ( -differential privacy). A randomized process X adheres to -differential privacy if it fulfills the following two conditions: (i) two adjacent data sets D 1 and D 2 differ only in one element and (ii) all outputs S ∈ range(X) where range(X) is the range of outputs of the process X. 35 Formally, where X(D 1 ) and X(D 2 ) are the randomized processes applied to data sets D 1 and D 2 , and is a parameter of privacy, known as the privacy budget. The smaller the value of , the stronger the privacy.
Definition 3 (Sensitivity). The sensitivity defines the amount of the required perturbation. Assuming a query function f (.) in a given data set, the sensitivity △f is defined as: Definition 4 (Laplace mechanism). Differential privacy uses the Laplace mechanism to perturb the results for numeric queries. It adds Laplace noise to the query result sampled from the Laplace distribution that is centered at 0 with scaling b. The Laplace noise is represented by Lap(b). The higher the value of b, the higher the noise. The probability density function (pdf) of the Laplace distribution is given as Lap(x) = 1 2b −(|x|∕b) . The Laplace mechanism for differential privacy is formally defined as: Given a function f ∶ D → R, the randomize process X adheres to -differential privacy if: Equation (3) shows that the amount of noise is dependent on the privacy budget and sensitivity △f . A lower privacy budget and higher sensitivity △f generate higher amount of noise.
The parking recommender makes the numeric query f to analyze the parking history, for example, the rating of a selected parking spot belonging to the user's current location because it may be possible that users of a certain location did not like certain parking spots due to various reasons, for example, too far away, crowded or a narrow or poorly maintained road. A sample query is: f: How many users from a specific location (user current location, eg, 43.3905, −3.8896) gave a rating (eg, 5-stars) for a specific parking spot (eg, 3601) between a specific timestamp (eg, 2019-08-01 08:00-2019-08-02 08:00)?
This type of query is used with different parameters to evaluate differential privacy in the next section.

Experimental setup
We used a real parking data set of Santander, Spain that is comprised of the occupancy time of parking spots for the month of December 2017. Real locations within Santander were then used to generate a synthetic parking data set by assigning the user locations and ratings randomly for each record of the real parking occupancy data set in order to evaluate the privacy preservation of k-anonymity and of differential privacy techniques. Hence, although our data set is synthetic, it is generated from a real parking occupancy data set and real locations, and therefore it reflects a real data set.

Evaluation of k-anonymity
We used the four attributes (user latitude, user longitude, timestamp, and parking id) presented in Table 2 of the parking data set as quasi-identifier attributes (QIA) and evaluated k-anonymity using different values of k from 2 to 750. We analyzed the performance of k-anonymity by individually studying different QIA sizes to have a complete and detailed analysis. In Section 5.2.2, we analyze QIA size = 1 by selecting user latitude as QIA. In Section 5.2.3, we analyze QIA size = 2 by selecting user latitude and user longitude as QIA. In Section 5.2.4, we analyze QIA size = 3 by selecting user latitude, user longitude and parking id as QIA. In Section 5.2.5, we analyze another case of QIA size = 3 by selecting user latitude, user longitude and timestamp as QIA. In Section 5.2.6, we analyze QIA size = 4 by selecting all the attributes: user latitude, user longitude, parking id and timestamp as QIA. Finally in Section 5.2.7, we present all the QIA sizes together that are discussed above to see the consolidated effect.

Performance metrics
We evaluate the performance of k-anonymity in terms of privacy and utility using six widely adopted metrics: 1. Average groups size is the average size of the anonymized blocks/groups generated by the anonymization technique. It is used to measure the privacy and utility of the anonymized algorithm and has been widely used in the literature. 31,32 If the groups sizes are smaller, an adversary/analyst would be able to infer more information that enhances the utility but it weakens the privacy because due to smaller groups size, it is relatively easier to uniquely identify the users. On the other hand, when the groups sizes are larger, the privacy is stronger because it is difficult to identify the users but it reduces the utility. So, the smaller average groups size is favorable for utility while higher average groups size is favorable for privacy. 2. Total number of groups is the number of groups generated by the anonymization algorithm. A higher number of groups causes smaller groups size which enhances the utility but weakens the privacy. On the other hand, a lower number of groups makes the privacy stronger but it reduces the utility. So, the higher number of groups is favorable for utility while lower number of groups (or higher groups size) is favorable for privacy. 3. Generalization height is the height of an anonymized database, that is, the number of generalization levels applied. It has been widely used in the literature for measuring the privacy and utility of anonymization technique. 29,31,36 With lower generalization height, the values of records are closer to their actual values, hence it enhances the utility but it weakens the privacy because an adversary/analyst could identify the users uniquely. On the other hand, when the generalization height is higher, the values of records are in the much generalized form, making it difficult for an adversary/analyst to infer useful information from the anonymized data set which results in stronger privacy but lower utility. So, a lower generalization height is favorable for utility while higher generalization height is favorable for privacy. 4. Number of suppressed records is the number of records that are suppressed because they could not fit into any anonymized block/group (because of not fulfilling the requirement of k) while privacy preservation. Suppressed records enhance privacy because if they do not get suppressed, an adversary could identify the users because of not fulfilling the requirement of k (ie, not fulfilling the indistinguishability of records). However, they reduce the utility because the suppressed records reduce the size of the data set, making the inference of useful information lower. So, a lower number of suppressed records is favorable for utility. 5. Discernibility cost is the metric to measure the indistinguishability of records with each other. The discernibility metric penalizes each record based on their indistinguishability from each other. Each unsuppressed record in a group of size j incurs a cost j, while each suppressed record incurs a cost |D|, that is, the size of the original data set D. This metric is used to measure the utility and privacy of anonymization algorithm and it has also been widely adopted in the literature. 31,32,37 A lower discernibility cost if favorable for utility while higher discernibility cost if favorable for privacy. 6. Execution time is the time required to generate an anonymized database of the original database. 36 A lower execution time is favorable.

Analysis of one quasi-identifier attribute
In this section, we analyze the performance of k-anonymity when one attribute is selected as QIA, that is, QIA = 1 (user latitude). Figure 2A presents the average groups size generated by the anonymization algorithm from k = 2 to k = 750. For k = 2 and k = 10, the average groups size is around 30. This is because no anonymization is required as there is a total of 500 locations (ie, num_loc = 500) that are randomly assigned to the parking data set D of size |D|=15 306. Therefore, each user latitude appears around 30 times on average (ie, avg appearance = |D| num loc ≈ 30), hence it already fulfills the requirement of k = 2 and k = 10, by default. When k = 25, the average groups size is around 200. Although apparently it seems that there should also be no need of anonymization of user latitude at k = 25 because the expected repetition of each user latitude is 30 times (as discussed above, ie, k = 25 < avg_appearance = 30); however, first that is an average repetition appearance, and second, since the user locations are assigned randomly to the parking data set, therefore the user latitude repetitions vary (ie, from 17 to 71 times). Hence, the anonymization needs to be performed at k = 25 and it generates the average groups size of around 200 records. From k = 50 to k = 150, the average groups size is around 500, and finally from k = 200 to k = 750, the average groups size is around 1000. This result shows that the average groups size increases with the increasing values of k. Higher is the value of k, higher is the group size, and hence lower is the utility. Figure 2B presents the total number of groups generated by the anonymization algorithm from k = 2 to k = 750. For k = 2 and k = 10, the total number of groups is very high and around 500. This is because no anonymization is required (as shown in Figure 2A). The total number of groups keeps decreasing from k = 25 to k = 750. This is because for each increasing value of k, the anonymization algorithm has to maintain indistinguishable groups of records that fulfills the requirements of k, hence causes larger groups and hence smaller number of total groups. This result shows that the total number of groups reduces with the increasing values of k. Higher is the value of k, lower is the total number of groups, and hence lower is the utility. Figure 2C presents the generalization height applied by the anonymization algorithm from k = 2 to k = 750. For k = 2 and k = 10, the generalization height is zero because no anonymization is required (as discussed above). The generalization height for k = 25 is 5 because the anonymization algorithm is able to generate anonymized parking database at this height. The generalization height for k = 50 to k = 150 is same and is 6 because as we analyzed in Figure 2A, since the average groups size are same from k = 50 to k = 150; therefore, they are achieved by applying the same height of generalization. Similarly, the generalization height from k = 200 to k = 750 is also same and is 7. This result shows that the generalization height increases with the increasing values of k. Higher is the value of k, higher is the generalization height, and hence lower is the utility. Figure 2D presents the number of suppressed records by the anonymization algorithm to generate an anonymized parking database from k = 2 to k = 750. The records are suppressed only when k = 100, 150 and when k = 650, 700, 750. This is because in order to maximize the utility, the anonymization algorithm tries to apply as minimal generalization height as possible. Therefore, while applying a new generalization level, it first checks the number of records that are not k-anonymous (ie, N non anon ). If N non anon > k, it goes for another level of generalization, otherwise if N non anon < k, it suppresses these N non anon records for maximizing the utility. This is why, the number of records that are not k-anonymous (N non anon ) at k = 100, 150 and k = 650, 700, 750 are suppressed. This result shows that on the one hand, the number of suppressed records reduces the utility by reducing size of the data set, but on the other hand, it actually enhances the utility by avoiding another level of generalization. Because the generalization affects the whole data set and may reduce the utility drastically by making the records more generalized and hence more difficulty in analysis, compared with the suppression of a small number of records (ie, less than k). Figure 2E presents the discernibility cost from k = 2 to k = 750. For k = 2 and k = 10, the discernibility costs are very low because no anonymization is required (as discussed in the description of Figure 2A). At k = 25, the discernibility cost is around 5 × 10 6 . An important point to note here is that from k = 50 to k = 150, the discernibility cost for k = 100, 150 is higher than that of k = 50, 75, while they all apply the same generalization height, have the same average group size and total number of groups, however, they differ in the number of suppressed tuples (as shown in previous Figure 2D) and this is the reason of higher discernibility cost at k = 100, 150 compared with k = 50, 75. The similar explanation applies to the higher discernibility cost at k = 650, 700, 750 compared with k = 200 − 600. This result is a very significant metric of utility and it shows that the discernibility cost increases with the increasing values of k because of higher groups size and number of suppressed records. Higher is the value of k, higher is the discernibility cost, and hence lower is the utility.
Finally, Figure 2F presents the execution time from k = 2 to k = 750. The execution time for k = 2 and k = 10 is very negligible because no anonymization is required (as discussed above). While for k = 25 to k = 750, the execution time is almost similar because the main execution time is consumed in making the generalizations of the records. Since, there is no much difference in the generalization heights of k = 25 to k = 750 (as presented in Figure 2C), therefore the execution time is similar.

Analysis of two quasi-identifier attributes
In this section, we analyze the performance of k-anonymity when two attribute are selected as QIA, that is, QIA = 2 (user latitude, user longitude). Figure 3A presents the average groups size generated by the anonymization algorithm from k = 2 to k = 750 when QIA size = 2 (user latitude, user longitude). For k = 2 and k = 10, the average groups size are around 30. Since a location is a combination of latitude and longitude; therefore, the same reason discussed in the description of Figure 2A in Section 5.2.2 applies here as well, that is, no anonymization is required because the original parking data set already fulfills the requirement of k = 2 and k = 10 by default. In other words, the average appearance of each location (avg_appearance = 30) is greater than k = 2 and k = 10. When k increases from 25 to 150, the average groups size also keeps increasing to fulfill the indistinguishability requirement of k. However, from k = 200 to k = 600, the average groups size is similar and is around 1000. This is because at k = 200, the minimum group size at k = 200 is 600; therefore, the higher level of anonymization (or generalization) is required above k = 600. Finally, the average groups size from k = 650 to k = 750 are much higher and around 1700-1800. This result shows that the average groups size increases with the increasing values of k. Higher is the value of k, higher is the group size, and hence lower is the utility. Figure 3B presents the total number of groups generated by the anonymization algorithm from k = 2 to k = 750 when QIA size = 2 (user latitude, user longitude). For k = 2 and k = 10, the total number of groups are very high and around 500. This is because no anonymization is required (as discussed before). The total number of groups keeps reducing from k = 25 to k = 750. This is because for each increasing value of k, the anonymization algorithm has to maintain indistinguishable groups of records that fulfills the requirements of k, hence causes larger groups and hence smaller number of total groups. This result is very similar to the result of total number of groups when QIA size = 1 (user latitude) in Figure 2B because a location is comprised of a pair of latitude and longitude. Hence, when we apply either one part of location (eg, latitude) or both parts (eg, latitude and longitude), we exhibit the similar trend. It shows that the total number of groups reduces with the increasing values of k. Higher is the value of k, lower is the total number of groups, and hence lower is the utility. Figure 3C presents the generalization height applied by the anonymization algorithm from k = 2 to k = 750 when QIA size = 2 (user latitude, user longitude). For k = 2 and k = 10, the generalization height is zero because no anonymization is required (as discussed above). The generalization height for k = 25 is around 17 because the anonymization algorithm is able to generate anonymized parking database at this height when anonymizing two QIA of user latitude and user longitude. The generalization height for k = 50 to k = 150 keeps increasing, but remains same from k = 200 to k = 600. The reason is similar as explain in the result of average groups size, that is, the minimum groups size at k = 200 is 600, hence

F I G U R E 3
Performance evaluation of k-anonymity when QIA = 2 (user latitude, user longitude) no more generalization (or anonymization) is required until k = 600. Finally, the generalization height from k = 650 to k = 750 is around 28 and is the highest. This result shows that the generalization height increases with the increasing values of k. Higher is the value of k, higher is the generalization height, and hence lower is the utility. Figure 3D presents the number of suppressed records by the anonymization algorithm to generate an anonymized parking database from k = 2 to k = 750 when QIA size = 2 (user latitude, user longitude). The records are suppressed when k = 25, 50, 100, 150, 600, 700, 750. This is because in order to maximize the utility, the anonymization algorithm tries to apply as minimal generalization height as possible. Therefore, while applying a new generalization level, it first checks the number of records that are not k-anonymous (ie, N non anon ). If N non anon > k, it goes for another level of generalization, otherwise if N non anon < k, it suppresses these N non anon records for maximizing the utility. This is why, the number of records that are not k-anonymous (N non anon ) at k = 25, 50, 100, 150, 600, 700, 750 are suppressed. This result shows that on the one hand, the number of suppressed records reduces the utility by reducing the size of the data set, but on the other hand, it actually enhances the utility by avoiding another level of generalization. Because the generalization affects the whole data set and may reduce the utility drastically by making the records more generalized and hence more difficulty in analysis, compared with the suppression of a small number of records (ie, less than k). Figure 3E presents the discernibility cost from k = 2 to k = 750 when QIA size = 2 (user latitude, user longitude). For k = 2 and k = 10, the discernibility cost is very low and almost negligible because no anonymization is required (as discussed before). The discernibility cost keeps increasing from k = 25 to k = 150 because of having varying groups sizes. However from k = 200 to k = 550, the discernibility cost is same because of having the similar groups size. Although the average group size at k = 600 is also same (in Figure 3A), it incurs higher discernibility cost because of suppressing the records. Finally, the discernibility cost increases at k = 650 and stays constant from k = 700 to k = 750. This result shows that the discernibility cost increases with the increasing values of k because of higher groups size and number of suppressed records. When the groups size and number of suppressed records are similar, the discernibility cost is also similar (eg, from k = 200 to k = 550). Higher is the value of k, higher is the discernibility cost, and hence lower is the utility.
Finally, Figure 3F presents the execution time from k = 2 to k = 750 when QIA size = 2 (user latitude, user longitude). The execution time for k = 2 and k = 10 is very negligible because no anonymization is required (as discussed above). While for k = 25 to k = 750, the execution time is almost similar because the main execution time is consumed in making the generalizations of the records. Since, there is no much difference in the generalization heights of k = 25 to k = 750 (as presented in Figure 2C), therefore the execution time is similar.

Analysis of three quasi-identifier attributes (case 1)
We have two cases when three attributes are selected as QIA. In both case, the first two QIA are user locationand user longitude. In the first case, the parking id is selected as the third QIA, while in the second case, timestamp is selected as the third QIA. In this section, we consider the first case and analyze the performance of k-anonymity when three attributes are selected as QIA, that is, QIA = 3 (user latitude, user longitude, parking id). Figure 4A presents the average groups size generated by the anonymization algorithm from k = 2 to k = 750 when QIA size = 3 (user latitude, user longitude, parking id). The average groups size from k = 2 to k = 50 are very small compared with others because due to the repeated locations (user latitude and longitude) and parking spots, the anonymization algorithm was able to make smaller groups causing lower groups sizes. However, from k = 75 to k = 750, the average groups size keep increasing because since the anonymization algorithm has to fulfill the indistinguishable of records equal to k, it ended up making bigger groups. However, at k = 150, 250, 300, 600, the average groups sizes are smaller compared with their neighbors. The reason is that at these values of k, the anonymization algorithm was able to enhance the utility by applying slightly lower generalization height (discussed next in Figure 4C) by the suppression of records (discussed next in Figure 4D). Overall, this result shows that the average groups size increases with the increasing values of k. Higher is the value of k, higher is the group size, and hence lower is the utility. Figure 4B presents the total number of groups generated by the anonymization algorithm from k = 2 to k = 750 when QIA size = 3 (user latitude, user longitude, parking id). In this result, the total number of groups are very high for k = 2, 10, 25, 50 having total number of groups around 400, 150, 60, and 60 respectively. However, at k = 75, the total number of groups reduces drastically with having a total of seven groups. From k = 75 to k = 750, the total number of groups are very low and between 3 and 10. The pattern is very obvious and self-explanatory. For lower values of k, the anonymization algorithm has to ensure low indistinguishability of records, and hence it can make smaller groups sizes resulting in a higher number of total groups. But as the value of k gets higher, the anonymization algorithm has to ensure high indistinguishability of records, resulting in larger groups sizes and lower number of groups. This result shows that the total number of groups reduces with the increasing values of k. Higher is the value of k, lower is the total number of groups, and hence lower is the utility. Figure 4C presents the generalization height applied by the anonymization algorithm from k = 2 to k = 750 when QIA size = 3 (user latitude, user longitude, parking id). The generalization heights from k = 2 to k = 50 are much lower as compared to others because due to the repeated locations (user latitude and longitude) and parking spots, the anonymization algorithm was able to achieve indistinguishable records satisfying the requirement of k at lower generalization heights. However, from k = 75 to k = 750, the generalization heights keep increasing because since the anonymization algorithm has to fulfill the indistinguishable of records equal to higher values of k, it achieved it by applying higher generalization heights. However, at k = 150, 250, 300, 600, the generalization heights are slightly lower compared with their neighbors. This is because at these values of k, the anonymization algorithm was able to enhance the utility by applying slightly lower generalization height at the cost of suppressing the nonanonymized records (N non anon ) lower than k (ie, N non anon < k)(discussed next in Figure 4D). Overall, this result shows that the generalization height increases with the increasing values of k. Higher is the value of k, higher is the generalization height, and hence lower is the utility. Figure 4D presents the number of suppressed records by the anonymization algorithm to generate an anonymized parking database from k = 2 to k = 750 when QIA size = 3 (user latitude, user longitude, parking id). The records are suppressed to maximize the utility by applying as minimal generalization height as possible. The number of suppressed records are highest at k = 600, that is, around 200 more suppressed records than its neighbors, for example, k = 400 − 750. This is because the anonymization algorithm was able to apply one less generalization height than its neighbors, hence enhancing the utility. Figure 4E presents the discernibility cost from k = 2 to k = 750 when QIA size = 3 (user latitude, user longitude, parking id). The discernibility costs from k = 2 to k = 50 are comparably quite low because the anonymization algorithm was able to make clusters of smaller groups having lower groups sizes due to the repeated locations (user latitude and longitude) and parking spots. However, from k = 75 to k = 750, the discernibility cost keeps increasing because since the anonymization algorithm has to fulfill the indistinguishable of records equal to k, it ended up making bigger groups and suppressing the more number of records. However, at k = 150, 250, 300, 600, the discernibility costs are comparable lower in the neighborhood because at these values of k, the anonymization algorithm generated smaller groups sizes by applying slightly lower generalization height. Hence, smaller groups sizes results in lower discernibility cost. Overall, this result shows that the discernibility cost increases with the increasing values of k. Higher is the value of k, higher is the discernibility cost, and hence lower is the utility.
Finally, Figure 4F presents the execution time from k = 2 to k = 750 when QIA size = 3 (user latitude, user longitude, parking id). For all values of k, the execution time is almost similar because the main execution time is consumed in making the generalizations of the records. Since there is no much difference in the generalization heights, therefore the execution time is similar.

Analysis of three quasi-identifier attributes (case 2)
In this section, we consider the second case and analyze the performance of k-anonymity when three attributes are selected as QIA, that is, QIA = 3 (user latitude, user longitude, timestamp). Figure 5A presents the average groups size for the second case generated by the anonymization algorithm from k = 2 to k = 750 when QIA size = 3 (user latitude, user longitude, timestamp). When k = 2 and k = 10, the average groups size is similar and is around 90 because the anonymization algorithm was able to make smaller groups to fulfill the indistinguishability of records for lower values of k. The average groups size keeps increasing with the higher values of k because for the higher values of k, the anonymization algorithm has to maintain groups having sizes equal to or greater than the higher values of k. Another point to note here is that from k = 450 to k = 750, the average groups size is zero, this is because at this point (ie, when k ≥ 450), the anonymization algorithm is unable to make an anonymized data set from the original data set. This result shows that the average groups size increases with the increasing values of k and the anonymization is not possible beyond k ≥ 450. Higher is the value of k, higher is the group size, and hence lower is the utility. Figure 5B presents the total number of groups generated by the anonymization algorithm from k = 2 to k = 750 when QIA size = 3 (user latitude, user longitude, timestamp). In this result, the total number of groups are very high for k = 2 and k = 10 having a total number of groups around 160. This is because the anonymization algorithm has to maintain very smaller groups of indistinguished records, that is, 2 and 10, hence it makes higher number of groups by creating smaller groups sizes. However, from k = 25 to k = 400, the total number of groups reduces drastically low because as the value of k gets higher, the anonymization algorithm has to ensure high indistinguishability of records, resulting in larger groups sizes and lower number of groups. Finally, from k = 450 to k = 750, the total number of groups is zero because the anonymization algorithm is unable to make an anonymized parking data set from the original parking data set by satisfying the requirement of k = 450 − 700. This result shows that the total number of groups reduces with the increasing values of k. Higher is the value of k, lower is the total number of groups, and hence lower is the utility. Figure 5C presents the generalization height applied by the anonymization algorithm from k = 2 to k = 750 when QIA size = 3 (user latitude, user longitude, timestamp). The generalization height for k = 2 to k = 50 is almost similar, that is, around 45 because the anonymization algorithm is able to make an anonymized parking data set at this generalization height. However, from k = 75 to k = 450, the generalization height is around 60 and is same because this is the highest generalization height possible. At this point, the anonymization algorithm has to suppress the records (presented in next Figure 5D) because there is no more higher generalization available. Figure 5D presents the number of suppressed records by the anonymization algorithm to generate an anonymized parking data set from k = 2 to k=750 when QIA size = 3 (user latitude, user longitude, timestamp). The records are suppressed to maximize the utility by applying as minimal generalization height as possible. The anonymization algorithm tries to enhance the utility by applying the minimum possible generalizations and suppressing the records. The number of suppressed records from k = 2 to k = 50 are same because the anonymization algorithm is able to make anonymized parking data set at the same generalization height. However, the number of suppressed records keep increasing from k = 75. This is because at this point, the anonymization algorithm has applied the maximum possible generalizations (as discussed in previous figure). Hence, the only possibility is to suppress the records to achieve the k-anonymity. Here, note that from k = 450 to k = 750, the number of suppressed records is 15306 that is equal to the size of our original data set. This is because the anonymization algorithm could not find an anonymization that fulfills the requirement of k = 450 to k = 750, hence it dropped all the records. Figure 5E presents the discernibility cost from k = 2 to k = 750 when QIA size = 3 (user latitude, user longitude, timestamp). The discernibility cost is another measure of utility that is dependent on number of groups, size of groups, and number of suppressed tables. For k = 2 to k = 50, the discernibility cost is very low and is same (results in a higher utility) because of similar number of suppressed records, and the similar ratio of number of groups to groups sizes. However, the discernibility cost keeps increasing from 75 to 450 because of the phenomena described in the results of number of suppressed records, that is, it already has applied the maximum generalizations available, hence it suppressed the records (the only possible solution) and therefore incurs higher discernibility cost (and lower utility). The discernibility cost from k = 450 to k = 750 is the highest possible discernibility cost (and the worst utility) because the anonymization algorithm is unable to make anonymized data set and suppressed all the records. Overall, this result shows that the discernibility cost increases with the increasing values of k. Higher is the value of k, higher is the discernibility cost, and hence lower is the utility.
Finally, Figure 5F presents the execution time from k = 2 to k = 750 when QIA size = 3 (user latitude, user longitude, timestamp). For all values of k, the execution time is almost similar because the main execution time is consumed in making the generalizations of the records. Since there is no much difference in the generalization heights, therefore the execution time is similar.

Analysis of four quasi-identifier attributes
In this section, we analyze the performance of k-anonymity when all the four attributes are selected as QIA, that is, QIA = 4 (user latitude, user longitude, timestamp, and parking id). Figure 6A presents the average groups size generated by the anonymization algorithm from k = 2 to k = 750 when QIA size = 4 (user latitude, user longitude, timestamp, parking id). The average groups size in this figure is very similar to the average groups size in Figure 5A (the second case of QIA=3). This is because the most heterogeneous attribute is timestamp having 6242 distinct values, while the parking id attribute is not much heterogeneous as it has 265 distinct values that is much less diverse than the timestampattribute. This is why, we learned in this result that if timestamp and parking id attributes are both selected as QIA, then parking id attribute does not have much significance. In other words, we can say that when we select timestamp as QIA in anonymization, it also covers parking id attribute by default. To summarize the result in Figure 6B, the average groups size is very small and same at k = 2 and k = 10; however, it keeps increasing from k = 25 to k = 400 because of fulfilling the requirement of higher indistinguishability of records to satisfy higher values of k. For k ≥ 400, the average groups size is zero because no anonymization exists at these points. Overall, it shows that the average groups size increases with the increasing values of k and the anonymization is not possible beyond k ≥ 450. Higher is the value of k, higher is the group size, and hence lower is the utility. Figure 6B presents the total number of groups generated by the anonymization algorithm from k = 2 to k = 750 when QIA size = 4 (user latitude, user longitude, timestamp, parking id). The total number of groups in this figure is very similar to the total number of groups in Figure 5B (the second case of QIA=3). The reason is same as described above, that is, the timestamp attribute is much more diverse than parking id attribute and hence, it already covers the parking id attribute in anonymization. To summarize the Figure 6B, the total number of groups are very high at k = 2 and k = 10 because of having lower requirement of indistinguishability of records. The total number of groups then keep reducing from k = 25 to k = 400 in order to fulfill the requirement of higher indistinguishability of records. From k = 450 to k = 750, there is no group because anonymization is not possible. Overall, the total number of groups reduces with the increasing values of k. Higher is the value of k, lower is the total number of groups, and hence lower is the utility. Figure 6C presents the generalization height applied by the anonymization algorithm from k = 2 to k = 750 when QIA size = 4 (user latitude, user longitude, timestamp, parking id). The trend in this figure is similar to the trend in Figure 5C (the second case of QIA=3) but the values are different. The reason of similar trend is the same as discussed above, that is, the timestamp attribute is much more diverse than parking id attribute and hence, it already covers the parking id attribute in anonymization. While, the reason of different values of generalization height is that parking id attribute still has to be generalized to make it indistinguishable. Otherwise, it would not be possible to generate an anonymize data set without generalizing the parking id attribute. To summarize the Figure 6C, the generalization height for k = 2 to k = 50 is almost similar because the anonymization algorithm is able to make an anonymized parking data set at this generalization height. However, from k = 75 to k = 450, the generalization height is higher and same because this is the highest generalization height possible. At this point, the anonymization algorithm has to suppress the records (presented in next Figure 6D) because there is no more higher generalization available. Figure 6D presents the number of suppressed records by the anonymization algorithm to generate an anonymized parking data set from k = 2 to k = 750 when QIA size = 4 (user latitude, user longitude, timestamp, parking id). The total number of suppressed records in this figure is very similar to the total number of suppressed records in Figure 5D (the second case of QIA = 3). The reason is similar as described above, that is, the timestampattribute is much more diverse than parking id attribute and hence, it already covers the parking id attribute in anonymization. To summarize, the number of suppressed records from k = 2 to k = 50 are same because the anonymization algorithm is able to make anonymized parking data set at the same generalization height. However, the number of suppressed records keep increasing from k = 75 because at this point, the anonymization algorithm has applied the maximum possible generalizations and follow the only possible solution of suppressing the records to achieve the k-anonymity. From k = 450 to k = 750, the number of suppressed records is 15 306 that is equal to the size of our original data set because the anonymization algorithm could not find an anonymization that fulfills the requirement of k = 450 to k = 750, hence it dropped all the records. Figure 6E presents the discernibility cost from k = 2 to k = 750 when QIA size = 4 (user latitude, user longitude, timestamp, parking id). Similar to previous results, the discernibility cost in this figure is very similar to the discernibility cost in Figure 5E (the second case of QIA = 3). The same explanation applies here as well. To summarize, for k = 2 to k = 50, the discernibility cost is very low and constant (results in a higher utility) because of similar number of suppressed records, and the similar ratio of number of groups to groups sizes. However, the discernibility cost keeps increasing from k = 75 to k = 450 because of the phenomena described in the results of number of suppressed records, that is, it already has applied the maximum generalizations available, hence it suppressed the records (the only possible solution) and therefore incurs higher discernibility cost (and lower utility). Also, the discernibility cost increases with the increasing values of k. Higher is the value of k, higher is the discernibility cost, and hence lower is the utility.
Finally, Figure 6F presents the execution time from k = 2 to k = 750 when QIA size = 4 (user latitude, user longitude, timestamp, parking id). For all values of k, the execution time is almost similar and is around 2000 seconds because the main execution time is consumed in making the generalizations of the records. Since, there is no much difference in the generalization heights, therefore the execution time is similar.

Consolidated analysis
In this section, we present the consolidated results of all the previous analysis of k-anonymity with varying QIA sizes (ie, QIA = 1, 2, 3 [case 1 and case 2], and 4) in order to have a complete and consolidated view. Figure 7A presents the average groups size generated by the anonymization algorithm from k = 2 to k = 750 for all the QIA sizes presented before. The result shows that the average groups size increases with the higher values of k. However, there is a surprising behavior of the first case of QIA = 3 (user latitude, user longitude, parking id). It makes much average higher groups sizes compared with other QIA sizes (ie, QIA = 1, 2, 3 [case 2], and 4). This is because of nonsuppression of records, that is, QIA = 3 (case 1) does not suppress the records and hence, it results in larger groups sizes, while QIA = 3 (case 2) and QIA = 4 suppress the records, causing smaller groups sizes. The main insight is that the groups sizes increases with the increasing values of k, as well as with the increasing QIA sizes at a limit when no suppression is made. Overall, the average groups size increases with the increasing values of k. Higher is the value of k, higher is the groups size, and hence lower is the utility. Figure 7B presents the total numbers of groups generated by the anonymization algorithm from k = 2 to k = 750 for all the QIA sizes presented before. There are two insights gained from this result. First, the total number of groups reduces with the increasing values of k. Second, the total number of groups also reduces with the increasing QIA sizes. Higher is the value of k and QIA sizes, lower is the number of groups, and hence lower is the utility. Figure 7C presents the generalization heights applied by the anonymization algorithm from k = 2 to k = 750 for all the QIA sizes presented before. There are three insights gained from this result. First, the generalization height increases with the higher values of k. Second, the generalization height also increases with the higher QIA sizes. Third, the generalization heights increases until k = 75. After k = 75, the generalization height is same because no more higher generalization is available. Higher is the value of k and QIA sizes, higher is the generalization height, and hence lower is the utility. Figure 7D presents the number of suppressed records by the anonymization algorithm from k = 2 to k = 750 for all the QIA sizes presented before. There are four insights gained from this result. First, the number of suppressed records increases with the higher values of k. Second, the number of suppressed records also increases with the higher QIA sizes. Third, the number of suppressed records for QIA = 3 (case 2) and QIA = 4 increases until k = 400. From k ≥ 450, the number of suppressed records is equal to the size of the original data set, that is, no anonymization is made. Fourth, the number of suppressed records by QIA = 3 (case 2) and QIA = 4 is same, it means that they exhibit the same behavior. Higher is the value of k and QIA sizes, higher is the number of suppressed records, and hence lower is the utility. Figure 7E presents the number of discernibility cost incurred by the anonymization algorithm from k = 2 to k = 750 for all the QIA sizes presented before. There are four insights gained from this result. First, the discernibility cost increases with the higher values of k. Second, the discernibility cost also increases with the higher QIA sizes. Third, the discernibility cost for QIA = 3 (case 2) and QIA = 4 increases until k = 400. From k ≥ 450, the discernibility is equal and is the highest possible discernibility cost because all the records in the data set are suppressed. Fourth, the number of suppressed records by QIA = 3 (case 2) and QIA = 4 is same and they exhibit the same behavior. Higher is the value of k and QIA sizes, higher is the number of discernibility cost, and hence lower is the utility.
Finally, Figure 7F presents the execution time by the anonymization algorithm from k = 2 to k = 750 for all the QIA sizes presented before. There are two insights gained from this result. First, the execution time increases with the higher values of k. Second, for each QIA size, the execution time is almost similar. It means that the execution time is mainly dependent on the size of QIA, or in other words, it depends on the level of generalizations that need to be applied to construct an anonymized parking data set.

Evaluation of differential privacy
We evaluated differential privacy for the numeric query type as discussed in Section 4.2 by generating 1000 random queries. For each query, the user' current location, timestamp, parking spot, and rating are randomly selected from the parking data set. Subsequently, a time range is randomly selected between 1 and 30 days and for each location, we consider nearby locations within 5 km radius. We evaluate differential privacy using different values of privacy budget from = 0.1 to = 1.0. The sensitivity △f defines the number of records that gets affected with the addition or removal of a user. As in our parking data set, a user may appear multiple times but we do not know the exact number of appearances; therefore, we analyze the effects of different values of sensitivity △f values from 1 to 5, that is, the appearance or removal of a user in the data set affects one to five records, respectively.

Performance metrics
We evaluate the accuracy and privacy of differential privacy by using two widely adopted metrics: • Mean absolute error (MAE) measures the average amount of errors. It is an average over the number of queries of the absolute differences between the actual query result and noisy result. It measures the privacy and the utility. If the MAE is high, the difference between the actual and noisy query results is high that makes the privacy stronger but reduces the utility. While if the MAE is low, the difference between the actual and noisy query results is low that enhances the utility but weakens the privacy. MAE has been widely adopted in the literature for evaluating differential privacy. [38][39][40] It is defined as: where N is the total number of queries, r a, i is the actual response of query i, and r n, i is the noisy response of query i.
• Root mean square error (RMSE) is a quadratic scoring function and it also measures the average amount of errors. It is the square root of the average of squared differences between the actual query result and noisy result. It measures the privacy and the utility. Similar to MAE, if the RMSE is high, the difference between the actual and noisy query results is high that makes the privacy stronger but reduces the utility. While if the RMSE is low, the difference between the actual and noisy query results is low that enhances the utility but weakens the privacy. It has been widely adopted in the literature for evaluating differential privacy. [38][39][40][41] It is defined as: where N is the total number of queries, r a, i is the actual response of query i, and r n, i is the noisy response of query i.

Analysis of individual sensitivities
In this section, we analyze the performance of differential privacy in terms of accuracy and privacy using MAE and RMSE for privacy budget = 0.1 to = 1.0 by analyzing each sensitivity (△f ) individually. Figure 8 presents MAE and RMSE for privacy budget = 0.1 to = 1.0 when sensitivity △f = 1 (ie, the addition or removal of a user affects one record in the parking data set). It shows that when = 0.1, the MAE and RMSE are very high, that is, 10 and 14, respectively because = 0.1 guarantees the highest privacy, however at the cost of the worst utility. However, as keeps increasing, the MAE and RMSE keeps reducing drastically and at = 1.0, both MAE and RMSE are close to zero. It means that we have the highest utility at this point, however there is no privacy because the noisy results are very similar to the actual results. Overall, the MAE and RMSE reduces with the increasing values of privacy budget . Higher is the privacy budget , higher is the privacy but lower is the utility. Inversely, lower is the privacy budget , lower is the privacy but higher is the utility. Figure 9 presents MAE and RMSE for privacy budget = 0.1 to = 1.0 when sensitivity △f = 2 (ie, the addition or removal of a user affects two records in the parking data set). The trend is very similar to Figure 8 when △f = 1, however, at = 0.1 the MAE and RMSE are almost double. This is because when △f = 2, since an additional or removal of a user affects at least two records; therefore, we have two times (ie, 2×) MAE and RMSE than at △f = 1. However, the MAE and RMSE at △f = 2 keeps reducing with the increasing values of privacy budget and at = 1.0, the MAE and RMSE are almost similar to those at △f = 1.
Similarly, Figures 10,11, and 12 present MAE and RMSE for privacy budget = 0.1 to = 1.0 when sensitivity △f = 3, △f = 4, and △f = 5 (ie, the addition or removal of a user affects three, four, and five records in the parking data set), respectively. The trends are similar as discussed before. Initially, at = 0.1, the MAE and RMSE are three, four, and five times (3×, 4× and 5×) for △f = 3, 4, 5, respectively, compared with MAE and RMSE for △f = 1. However, as increases and reaches toward 1.0, the MAE and RMSE get close to zero. Overall, the MAE and RMSE are very high at privacy budget

Consolidated results
In this section, we present the consolidated results of all the previous analysis of differential privacy with all previously discussed sensitivity △f values (ie, △f = 1, 2, 3, 4, 5) in order to have a complete and consolidated view. Figure 13 presents the consolidated results of MAE and RMSE for all sensitivity values △f = 1, 2, 3, 4, 5 for privacy budget = 0.1 to = 1.0. These results provide two insights. First, initially at = 0.1, the MAE and RMSE are very high for all sensitivity △f values (that provides very strong privacy but no utility); however, as privacy budget keeps getting closer to 1.0, the MAE and RMSE are becoming similar and getting close to zero (that provides very high utility but no privacy). Second, as the sensitivity △f value increases, the MAE and RMSE also gets higher to many folds but then they get closer to other sensitivity △f values when privacy budget values gets higher. Overall, the MAE and RMSE reduces F I G U R E 10 MAE and RMSE for varying privacy budget when sensitivity (△f ) = 3 F I G U R E 11 MAE and RMSE for varying privacy budget when sensitivity (△f ) = 4 with the increasing values of privacy budget . Higher is the privacy budget , higher is the privacy but lower is the utility. Inversely, lower is the privacy budget , lower is the privacy but higher is the utility.

Summary
This section summarizes the findings of the experiments on privacy preservation through k-anonymity and differential privacy. We found that k-anonymity is suitable for smaller values of k and for lower QIA sizes. When k is much higher, the utility is very low. Specifically, when k ≥ 450 and QIA = 3 (user latitude, user longitude, timestamp) or QIA size = 4 (user latitude, user longitude, timestamp, parking id), the k-anonymity is unable to generate an anonymized parking data set because the requirement of k does not get fulfilled. Also, the behavior of QIA = 3 (user latitude, user longitude, timestamp) and QIA = 4 is very similar because timestamp attribute is much more diverse than parking id attribute and therefore, it covers parking id in anonymization by default. For differential privacy, we found that when the privacy budget is very low (eg, = 0.1), the privacy is very strong; however, the utility is worse. As the privacy budget keeps getting higher, the utility starts improving, however at the cost of weakening the privacy. Additionally, the sensitivity △f also affects the privacy and utility. The higher is the sensitivity △f value, stronger is the privacy but lower is the utility. Moreover, we considered one parking data set in our experiments which is presented in Table 1. However, the experiments can be reproduced with other data sets because the other parking data set will also have the similar attributes, as our considered attributes are those attributes that are required in almost all the smart parking systems. Therefore, we believe that there will not be any problem in reproducing the evaluation with other parking data set.

CONCLUSION
In this article, we preserve the privacy of users while sharing their historical parking information (which contains their private behavior and mobility patterns) with a semitrusted or untrusted third-party parking recommender system through two well-known privacy preservation techniques of anonymization and perturbation: k-anonymity and differential privacy. The proposed implementations preserve privacy of users while receiving parking spots recommendations based on their past parking experience. We discuss the system and adversary models, discussion and applicability of k-anonymity, and differential privacy on parking data set. Extensive experimental results evaluated the impact of utility and privacy of both privacy preservation techniques on our parking data set. For future works, since we preserved the privacy of users against parking recommender by anonymizing and perturbing the parking database. Therefore, when we go for privacy, we lose the data of individual users and correlation between the records, hence making it no longer possible to provide personalized recommendation with respect to the individual user's habits and preferences. This ultimately affects the quality of recommendations. Hence, there is a need to study the impact of privacy preservation on recommendations services. Additionally, blockchain also provides security and privacy through through its inherent features, such as applying strong cryptographic algorithms and hashing techniques. Hence, another future work is to explore the use of blockchain as a tool for security and privacy in smart parking system. Furthermore, it would be interesting to evaluate using other data sets and see their impact.

PEER REVIEW INFORMATION
Engineering Reports thanks the anonymous reviewers for their contribution to the peer review of this work.
Roberto Minerva holds a PhD in Computer Science and Telecommunications from Telecom SudParis, France. He was the Chairman of the IEEE IoT Initiative, an effort to nurture a technical community and to foster research in IoT. Roberto has been for several years in TIMLab with responsibilities on service architectures. Currently, he is involved in activities on SDN/NFV (technical leader of the SoftFIRE H2020 project), 5G, Big Data, and architectures for IoT. Now he is a research engineer in Telecom SudParis working on IoT software architecture and the digitalization of businesses in several industries. He is author of several papers published in international conferences, books, and magazines.