A wide range of applications such as StumbleUpon*, iGoogle* web page recommendations, and Netflix* movie recommendations suffer from the dilemma of requiring users to disclose sensitive profile information in order to benefit from personalized services. Today, users have no option but to trust the service provider with their profile data in return for personalization. One approach to preserve privacy is to build user profiles locally (i.e., on end user devices), communicate their high-level semantic categories to the service provider, and then rely on fine-grained content-based local filtering of received recommendations [7, 10, 19]. On the other hand, to benefit from the collaborative filtering (CF) approach [11, 15] (i.e., recommendations derived from like-minded users), the user profile must be exchanged with the content provider or with other users in a peer-to-peer fashion in order to be able to identify those similar users, and to compute top-rated items of the like-minded community.
Our previous work  introduces high-level design principles of a distributed privacy-preserving personalization (P3) approach where the user profile is built locally, and then used in a lightweight and privacy-preserving way to obtain both content-based and collaborative-filtering recommendations. One of its cornerstones is our suggested usage of locality-sensitive hashing (LSH) techniques as a means of identifying similar users without explicitly comparing their profiles. Furthermore, in  we investigated the impact of the user-profiling approach (item-based versus semantic profile representations) on the resulting quality of clustering and recommendations when LSH is used for privacy instead of its conventional usage for scalability. In this current paper, we focus on analyzing the behavior of our privacy-preserving recommender system as a function of varying LSH parameter configurations, given a fixed user profile representation. To that end, we exploit the MovieLens dataset of movie ratings , and empirically show the existence of LSH configurations that minimize the cost of the procured privacy from the perspective of the achieved quality of recommendations. We identify the major trends emerging from the choice of LSH parameters, and examine the tradeoff that exists between the quality of personalization and the degree of privacy protection.
LSH techniques are conventionally used for scalably finding approximate nearest neighbors in high dimensional spaces [2, 8]. An appropriate choice of LSH parameters, as well as the definition and tuning of different types of hash functions, have been extensively studied in the literature [6, 14, 17]. This has been done from the perspective of the accuracy of retrieved nearest neighbors combined with the search complexity or the required storage capacity. LSH has also been used to develop scalable but privacy-agnostic recommendation systems such as the hybrid CF/content-based querying system Yoda , or the MapReduce-based realization of the Google News* personalization system . Neither of these previous works dealt with privacy protection requirements. Moreover, to our knowledge, LSH techniques have not been examined from the perspective of the impact that their parameters have on the quality of recommendations.
The rest of this paper is structured as follows. We begin by describing the key concepts of the P3 approach, and defining the LSH parameters to be considered in the sequel. Next, we analyze the impact of these parameters on the quality of recommendations and the cluster size distribution and present the methodology and the results of quantitative evaluations built upon a well-known recall measure. We conclude with perspectives on future work.
Panel 1. Abbreviations, Acronyms, and Terms
—Area under the curve
—Distributed hash table
—The Onion Router
—Uniform resource locator
The P3 Approach
The P3 approach relies on the two main concepts illustrated in Figure 1. First, it adapts the LSH technique of doing a scalable nearest neighbor search [2, 8], in a way that assigns end users with similar profiles to the same cluster (i.e., interest group) without requiring the local profiles to be revealed to any central/trusted entity. Second, the P3 approach relies on a distributed pool of non-colluding group-wise aggregator nodes and anonymous communications channels. Each node acts on behalf of an interest group, and is in charge of anonymously aggregating the group's behavior, generating recommendations, and anonymously delivering them to the group members.
LSH-Based Approach to Formation of Interest Groups and Its Parameters
The user profile can be built by collecting and analyzing local traces of the user's online activities, e.g., as described in [10, 19]. We assume that the resulting profile is represented as a list of <item, value> pairs, where items can refer to consumed item references or their aggregated semantic characteristics (e.g., tags or categories), and values are the corresponding degrees of interest such as an explicit user-given rating or a derived implicit interest value. So, the user profile can be seen as a point in the domain of items. The user assignment to an interest group is then achieved by locally computing an LSH code for the user's profile.
An LSH scheme is defined as a probability distribution over a family of hash functions such that two similar objects, x and y, hash to the same value with high probability
A popular measure used in CF recommender systems is the cosine similarity, for which the random hyperplane-based LSH scheme has been proposed :
where each coordinate of the random vector r is drawn independently from a Gaussian distribution. The outputs of several randomly-selected hash functions are then concatenated in order to amplify the discriminating power of hash buckets: similar points are expected to collide on all hash values. Such a sequence of hash values, g(x) = [h1(x), …, hk(x)] can be used as a label (or key) of the cluster to which the hashed objects belong. In the case of cosine similarity, the cluster key is obtained by concatenating the sign bits of dot products between the user profile x and each of the k random vectors defining the hash functions.
Here, the cluster key size, k, is the first configuration parameter of the algorithm. It is easy to see that increasing the k value reduces the false positives (i.e., non-similar users assigned to the same cluster) as the cluster members need to agree on more hash values and the probabilities of such agreements/collisions are proportional to the similarity. For a given dataset, this results in a larger number of clusters (bounded by 2k), generally with a higher intra-cluster similarity.
Furthermore, in order to reduce the false negatives (similar users not getting into the same bucket), one can compute multiple LSH sequences gj(x) = [h1,j(x), h2,j(x), …, hk,j(x)], 1 ≤ j ≤ L, and subscribe each user to L clusters. This defines the second parameter of the algorithm, the number of clustering iterations, L.
Last, but not least, note that all this procedure requires is sharing only random seeds among clients to allow the generation of identical hash functions, while all the profiles can be processed locally.
Profile Slicing and Interest-Group Aggregation
Although the method described above performs cluster assignments locally, one still needs to discover popular items among the similar users in order to provide CF recommendations. Sharing user profiles in a peer-to-peer fashion or with an aggregator node would allow the holder of that information to intelligently infer the identity of the user (and his sensitive data) by using auxiliary information or external knowledge. Here our approach consists of slicing each profile into smaller segments (slices) composed of one or several <items, value> pairs. Then the slices of a given profile are associated with the cluster key computed from the entire profile before being uploaded anonymously and independently to the respective aggregator node. This ensures that the aggregator node cannot observe the entire profile of any group member, and prevents it from being able to intelligently infer the identity of a user. As this paper focuses on the impact of LSH parameters, we assume, without loss of generality, that slices are composed of a single item which also minimizes the probability of inferring a user's identity. Note that it is possible to hide the client's Internet Protocol (IP) address in their communications with aggregator nodes, as one can use an anonymizing network like The Onion Router (Tor ). In addition, in a distributed setting, an elegant way to define an addressing scheme that identifies the destination aggregator node for an interest-group is to use distributed hash tables (DHTs)  so that the cluster identifiers can serve as DHT lookup keys.
Next, each aggregator node computes group recommendations in terms of top N items representing the primary interests of the group. This is achieved by aggregating the consumption elements of the group members, <item, value> pairs, where the aggregated value (rating) for each item, <item, value*>, corresponds to its average rating within the group. At the final stage of the personalization process, each client anonymously retrieves the recommendation lists from its L interest groups, and keeps the one produced by its “best fit” cluster determined by the user-to-group profile similarity.
The design choices made in P3  procure several important privacy-preserving features. First, the profiling and the interest group assignment are done locally without communicating the profiles outside the user's device. Then, the unlinkability of profile data uploaded to the group aggregators is achieved by profile slicing and by using anonymous communication channels. As a consequence, the interest group members become indistinguishable within the aggregator node. Finally, the presence of a distributed pool of non-colluding physical nodes reduces the risk of a single attacker simultaneously accessing a large number of interest groups.
Impact Analysis of LSH Parameters
In this section we analyze the effect of LSH parameters k and L on the quality of recommendations and on user privacy. It demonstrates the benefits of P3 personalization compared to a non-personalized best seller (BS) list of recommendations which can be computed in a fully privacy-preserving way, as well as the cost of P3 privacy compared to a non privacy-preserving, but fully personalized, neighborhood-based CF algorithm .
We first present the experimental dataset and our evaluation methodology. Then we interpret the evaluation results on recommendation quality as well as the impact of the LSH parameters on privacy. Finally, we discuss the tradeoff between personalization and privacy.
We have based our evaluations on the MovieLens dataset , which contains 1,000,209 ratings provided by 6,040 users on 3,883 movies. Each user has rated at least 20 movies on a 1 to 5 star scale (5 being the highest level of preference). In order to perform evaluations, ratings are split into training and test sets by applying a typical random partitioning into 80 percent and 20 percent of data, respectively. Entries in the training set correspond to local profiles and are used for LSH clustering and group-wise aggregation. The entries in the test set are further filtered to contain only five star ratings, which express a high level of user preference. We refer to these entries as the positive test set. There are on average eight positive items per user in the positive test set. They are used to validate the list of generated movie recommendations.
To evaluate the usefulness of a recommender system, we measure the quality of top N recommendations as suggested in [12, 18]. Let Ru(N) designate the top N items recommended to the user u, and TPu be the user's positive test set. The recall and the precision of the top N recommendations are then defined as
It is easy to see that the top N recall defined above is proportional to the precision, i.e., the recommender system with larger recall also has better precision on fixed data and fixed N. For that reason, we build our evaluations only on the recall measure. The global recall of the recommender system, according to the top N recommendations, is evaluated by Recall(N), the average Recallu(N) over all tested users. Instead of explicitly referring to a ranking position, N can be normalized within the interval [0, 1] and can represent a fraction of items relative to their total number. By abuse of notation, when there is no ambiguity, we use N to refer to either of the two entities.
The Recall(N) measures the proportion of positive items among the first N recommendations, but does not reflect the order of recommendations, i.e., if the positive items are in the top positions within the first N recommendations. Given that the recall will increase rapidly with N if the positive items are in the top positions, the quality of top N recommendations can be captured by the truncated area under the curve (AUC) of Recall(n):
An ideal recommender system achieves when N < |TPu| and Recallu(N) = 1 otherwise.
Furthermore, we need to measure the loss of recommendation quality induced by the privacy-preserving mechanism in comparison with a reference recommender system without privacy. Let AUCREF(N) be the quality measurement of such a reference recommender. Then, for a given parameter configuration S, we define a utility function UTYS(N)as a ratio:
Note that the same function can be used for comparing S with any reference recommender system, e.g., the BS approach.
Finally, we need to contrast the utility of a privacy-preserving system with the degree of achieved privacy. As motivated in , the cluster size can be used as an indicator of user privacy; larger clusters exponentially increase the complexity of recomposing the profile slices into original profiles. We then define the privacy degree PCYS provided by the parameter configuration S as the proportion of the “safe” users assigned to clusters with an above-threshold size:
where w is the cluster size threshold, U is the set of all users, and Cl(u) denotes the cluster to which the user u belongs in the clustering iteration l.
Evaluation of Recommendations
We evaluate the P3 recommendation quality for different configurations of LSH parameters (k × L) and compare it with a conventional neighborhood-based CF algorithm (CF-REF) as a reference recommender system providing a high quality of personalization and no privacy protection . This algorithm operates on the exact user neighborhood by considering all the pair-wise cosine similarities between users, and builds individual recommendations for each user based on the ratings provided by his nearest neighbors. The higher the similarity of a neighbor, the higher the impact of its ratings on the final recommendations. Obviously, this algorithm is not privacy-preserving as it explicitly uses the knowledge of entire user profiles to compute the exact pair-wise similarities. In contrast, our LSH-based approach gathers approximate nearest neighbors into clusters/groups and generates the same recommendations for all group members based on the average group ratings, without being able to take into account the individual pair-wise similarities (because of profile slicing and the indistinguishability property). It is therefore reasonable to expect some deterioration in recommendation quality in return for the achieved privacy.
Figure 2 depicts the recall curves for top N recommendations produced by some LSH configurations, the reference CF approach (CF-REF), the best seller (BS) list of most popular (frequent) items, and a generator of random recommendations. Figure 2a spans the entire interval of (normalized) N values and shows the asymptotic behaviors of the recall functions, while Figure 2b zooms into the first 100 values which are of more practical significance (the end user will hardly consult the tail end of a longer recommendations list). The CF-REF and the BS curves respectively represent the upper bound (no privacy) and the lower bound (no personalization) for a reasonable privacy-preserving personalization solution.
It is noteworthy that the configuration (7 × 1) does not reach the recall of 1 even at high values of N (Figure 2a), and at some point falls below the lower bound BS. This can be explained by a high number of clusters (128 = 27) which, for the given dataset, results in a small average cluster size (47.2 = 6040/128). The low quality of recommendations suggests that this size is too small to capture the whole useful neighborhood of users. Moreover, as the P3 recommendations are built only from items available from the consumption/ratings of cluster members, small clusters will in general have a smaller total number of items in their recommendations, and therefore, if some positive items are missing, the recall will never reach 1. This does not happen with techniques that do not rely on a clustering approach (CF-REF, BS, or random recommender) since they operate on the entire set of items and will, at least in the long run, find all the positive items. Having said this, only the small values of N should be considered from a practicality perspective, and in that sense even the configuration (7 × 1) achieves better quality than the BS approach.
The two remaining LSH configurations (5 × 1) and (5 × 11) shown in Figure 2 are well positioned between the upper and the lower bounds. Note that the configuration with a higher number of iterations achieves significantly better performance.
We next test the pertinence of the latter observation by evaluating the utility function UTYk×L(N = 100) over a mesh grid of LSH configurations (k = 1..9, L = 1..15), as shown in Figure 3. We observe two general trends. First, for any value of k, increasing the number of clustering iterations L improves the system performance (utility). This is an expected result as every additional clustering iteration provides a chance for the randomized clustering algorithm to assign a given user to a better cluster of similar users; the pertinence of a cluster is estimated by the cosine similarity between the user profile and the cluster centroid. This improvement seems, however, to stagnate at higher values of L, as the chances to continue discovering better clusters decrease. Moreover, as we will see below, the increase of L can compromise privacy.
The second observation from Figure 3 is related to the existence of a cluster key size k = 5 (or a small surrounding interval of values 4 ≤ k ≤ 6), which provides the best performance for any fixed L. This indicates the most suitable average cluster size, 188.8 = 6040/25 (or a range of average cluster sizes [188.8/2, 188.8*2]), allowing the capture of useful user neighborhoods for the generation of group recommendations. The smaller clusters tend to reduce the captured neighborhood and miss useful items in the recommendations, while larger clusters assemble more distant profiles and then, not being able to differentiate among these profiles, generate averaged ratings, which reduce the accuracy of recommendations for each individual user.
The Impact on Privacy
Figure 4 illustrates the effect of LSH parameters on the privacy degree PCYk×L(w) for two cluster size thresholds, w = 50 and w = 100. Let us consider Figure 4a (w = 50). It shows that the number of safe users highly depends on the key size parameter k, which determines the number of clusters and thus the number of users per cluster. For small values of k, k ≤ 4, the theoretical maximum number of clusters is also small (2 to 16) and we find more than 50 users in all the clusters, i.e., all the users are considered safe. When the number of clusters increases (k ≥ 5), some clusters have less than 50 (i.e., non-safe) users.
To examine the impact of clustering iterations, L, let us consider the proportion of safe users in the first clustering iteration as an empirical probability p of a randomly selected user to be safe in a single clustering iteration. Then, by repeating L clustering iterations, the probability of each user being safe decreases to pL. This phenomenon can be observed in Figure 4a, especially for k > 5, where the initial proportion of non-safe users is high enough. The same trends are available, but starting with lower k values, when the cluster size threshold is doubled, w = 100 (see Figure 4b).
Discussion on Privacy Versus Personalization Tradeoff
By contrasting the trends observed in the range of recommendation quality (Figure 3) and degree of privacy (Figure 4), we can examine the famous trade-off between data utility and privacy as depicted in Figure 5. In fact, the choice of LSH parameters impacts the utility function and the privacy indicator differently. So, in order to improve the privacy degree PCY k×L, one needs to decrease both parameters k and L, while from the perspective of the recommendation quality we need to increase L and maintain a stable k, which maximizes the utility UTYk×L. First, let us consider the privacy degree indicator defined by cluster size threshold w = 50 (Figure 5a). The LSH configuration (5 × 15) that maximizes utility slightly compromises the degree of privacy (PCY > 0.9). At the same time, the configurations that are 100 percent safe from the privacy perspective pose a slight limitation to the recommendation performance, e.g., (4 × 15). If we put a stronger requirement on the cluster size, w = 100 (Figure 5b), then recommendation quality needs to be ceded further in order to achieve a decent percentage of safe users; e.g., now the configuration (5 × 15) provides an unacceptably low degree of privacy (PCY = 0.3), and the configuration (4 × 15) can represent a compromise (PCY > 0.9), while to achieve the highest degree of privacy (PCY = 1), one needs to use a smaller cluster key size, k = 3. Note that while making concessions to recommendation quality, we need to check whether it is still superior to the quality of BS recommendations, i.e., whether the system still provides a personalization gain. Because of space limitations we did not report the evaluations of the utility UTYk×L (N = 100) using the BS approach as a reference. In fact, the performance of our system in terms of the truncated area under the curve AUC(N = 100) proved to be from 1.25 to 1.69 times superior to the BS approach for the entire mesh grid of examined configurations (k = 1..9, L = 1.15). More specifically, the value of AUC3 × 15(N = 100) for the configuration (3 × 15), which achieves a 100 percent degree of privacy, is about 1.64 times larger than the corresponding value of AUCBS(N = 100) of the BS list recommender.
Conclusion and Future Work
In this paper, we described how the locality-sensitive hashing (LSH) technique for scalably finding nearest neighbors can be adapted to enable discovering similar users in the P3 privacy-preserving personalization framework. Some of the privacy-preserving features attained by using P3 affect data utility (i.e., the quality of generated recommendations), and induce the so-called privacy cost. Namely, by using a distributed LSH-based approach for clustering, we gather approximate nearest neighbors including some clustering errors which cannot be eliminated because of the profile indistinguishability property within the clusters. We examined the impact of varying the LSH parameters and observed decent quality of recommendation despite this privacy cost. In addition, given that the cluster key size parameter directly influences the average size of the clusters formed, which can be considered an indicator of the degree of privacy, we observed in our evaluations the famous tradeoff between data utility and privacy.
Our evaluations were carried out on a public dataset of movie ratings, which allowed us to elucidate major trends emerging from the choice of LSH parameters. Future work will target an extension of the evaluation methodology that allows inferring an operational configuration for a given large dataset by analyzing its representative samples. Namely, we will investigate how the LSH configurations, which achieve an acceptable tradeoff of privacy versus personalization for a sample dataset, can be extrapolated to the entire population.
(Manuscript approved October 2013)
Google News and iGoogle are trademarks of Google Inc.
Netflix is a registered trademark of Netflix, Inc.
StumbleUpon is a registered trademark of StumbleUpon, Inc.
ARMEN AGHASARYAN is a researcher in the Enabling Computing Technologies (ECT) Research Domain at Alcatel-Lucent Bell Labs in Villarceaux, France. He received an M.Sc. degree in control system engineering from the Yerevan Polytechnic Institute, Armenia, and an M.Sc. degree in industrial engineering from the American University of Armenia in Yerevan. He received his Ph.D. degree in signal processing and telecommunications from the University of Rennes, France. Before joining Alcatel, he spent two years with the France Telecom research department in Lannion, France. Dr. Aghasaryan has worked intensively in the area of network management and elaborating new probabilistic model-based techniques for distributed fault diagnosis and alarm correlation. His current interests include highly-scalable distributed recommender systems and privacy protection technologies.
MAKRAM BOUZID is a researcher within the Enabling Computing Technologies Research Domain (ECT) at Alcatel-Lucent Bell Labs in Nozay, France. He received a Ph.D. in computer sciences from Henri Poincaré—Nancy 1 University, France. He also obtained an engineering diploma in the same field from the National School for Computer Studies (ENSI) in Manouba, Tunisia. His research activities are in the areas of artificial intelligence and the application of AI technologies for the development of distributed and intelligent applications for mobile and non-mobile users. Dr. Bouzid has worked on agent and multi-agent design and simulations, as well as on agent-based services (generic services, service composition, and service aggregation). His current activities are focused on personalization (machine learning, reasoning, user profiling, recommendation systems) and on privacy-preserving personalization technologies.
DIMITRE KOSTADINOV is a research engineer at Alcatel-Lucent Bell Labs in Nozay, France. Prior to joining Bell Labs, he received his Ph.D. degree in computer science from the University of Versailles, France, in the area of personalized access to distributed databases. His domains of interest lie in personalized access to content, privacy protection, user profiling, context-awareness, pervasive and distributed environments, and evaluation of quality and performance of personalized systems. Dr. Kostadinov has authored more than 15 research papers and has been granted over 15 patents. He was one of the organizers of PRSAT 2010: International Workshop on Practical Use of Recommender Systems, Algorithms, & Technology (held in conjunction with ACM Recommender Systems 2010 in Barcelona, Spain). He also received the Best Paper Award at the 2010 User Modeling, Adaptation, and Personalization conference (UMAP 2010).
ANIMESH NANDI is a researcher with the Enabling Computing Technologies (ECT) Research Domain at Alcatel-Lucent Bell Labs in Bangalore, India. He received his Ph.D. degree from Rice University, Houston, Texas (with significant time spent at Max Planck Institute for Software Systems, Germany), a Masters degree from Rice University, and a B.Tech. from Indian Institute of Technology (IIT), Kharagpur, India. All of his above degrees are in Computer Science. Dr. Nandi's research interests are largely in the area of Internet-scale networked and distributed systems. He has designed and built several novel distributed middleware architectures for content distribution networks, communication systems, and big data systems. His current interests lie at the intersection of distributed systems, distributed datamining, and data privacy.