An efficient secure k nearest neighbor classification protocol with high‐dimensional features

k Nearest neighbor (kNN) classification algorithm is a prediction model which is widely used for real‐life applications, such as healthcare, finance, computer vision, personalization recommendation and precision marketing. The arrival of data explosion era results in the significant increase of feature dimension, which also makes for the increase of privacy concern over the available samples and unlabeled data in the applications of machine learning. In this paper, we present a secure low communication overhead kNN classification protocol that is able to deal with high‐dimensional features given in real numbers. First, to deal with feature values given in real numbers, we develop a specific data conversion algorithm, which is used in the chosen fully homomorphic scheme. This conversion algorithm is generic and applicable to other algorithms that need to handle real numbers using the fully homomorphic scheme. Second, we present a privacy‐preserving euclidean distance protocol (PPEDP), which works with the Euclidean distance computation between two points given in real numbers in a high‐dimensional space. Then, based on the novelty PPEDP and oblivious transfer, we propose a new classification approach, efficient secure kNN classification protocol, (ESkNN) with low communication overhead, which is appropriate for a sample set with high‐dimensional features and real number feature values. Moreover, we implement ESkNN in C++. Experimental results show that ESkNN is several orders of magnitude faster in performance than existing works, and scales up to 18 000 feature dimension in a memory limited environment.

the privacy-preserving machine learning for linear regression, LR, and neural network training using the stochastic gradient descent method. Using a novel strategy, SecureML is able to work with real numbers and had been tested on a data set with up to 5000 features. More recently, Cock et al. 9 proposes a gradient descent-based algorithm that trains LR model in a secure way. Their algorithm's novelty point is a protocol that computes the activation function, which does not require secure comparison protocols or Yao's garbled circuits. The said algorithm was tested using data sets which include up to 17 814 features. Hong 11 presents a privacy-preserving collaborative machine learning using TensorFlow, which performs well on data sets with up to 16 384 features.
Most of the earlier works on secure kNN classification adopt the linear scan strategy. [21][22][23][24][25][26][27][28][29][30][31] Chen et al. 32 present a secure kNN classification protocol using clustering strategy. To achieve privacy-preserving protocol, their solution uses additively homomorphic encryption, distributed oblivious RAM, and garbled circuits. They tested their protocol using a database consisting of 10 million entries, but only with 128 features. Whereas, Xu et al. 33 introduces a privacy-preserving protocol that is based on kd-tree data structure. This protocol employs the technique of combining oblivious RAM and SMC, which does not work well with high-dimensional features due to the fact that kd-tree structure only works well for low-dimensional data.
Many work use Paillier scheme, 21,[24][25][26] ElGamal scheme, 22 BFV homomorphic scheme, 32 or oblivious transfer (OT)-based multiplication 27 to compute the distance between the query and the samples in a secure manner. As for the top-k selection, either the garbled circuit [28][29][30]32 or BGV homomorphic scheme 31,34 is used. To be specific, Samanthula et al. 21 propose a secure kNN classifier over encrypted data in the cloud using Paillier scheme. Whereas, Rong et al. 22 present a secure outsourced kNN classifier in multiple cloud environments based on the proxy re-encryption (PRE) technique, where different data owners outsource their private data to different cloud server providers. Here, they use a two-servers model with the assumption that the two servers in every cloud environment do not collude with each other. Shaul et al. 34 propose a secure k-ish nearest neighbor classifier, where the output label is computed by the κ nearest neighbors for κ ≈ k with some probability. They implement their classifier using the BGV homomorphic scheme. Experiments show that the accuracy of their classifier is 98%.
While most of the previous works consider the security under the honest-but-curious model, Schoppmann et al. 28 propose a privacy-preserving kNN (PPkNN) classifier based on the cosine similarity between documents represented by their TF-IDF features under the malicious model.
These varieties of existing works were made in effort to make kNN classification more secure. However, they mostly focus on the upscale of the number of samples and do not really consider supporting data using real numbers.

| Motivations and contributions
In the age of big data, secure kNN classification needs to be applied to the large data set. In fact, with the development of artificial intelligence (AI), internet of things (IoT), genomics and genetics, the dimension of features is increasing dramatically. Existing secure kNN classification protocols mainly focus on the scale up in the horizontal direction, that is, the number of samples. Thus, many applications that deal with high-dimensional features have to adopt insecure solution instead. Another common issue for the existing protocols is the inability to deal with real numbers computation, which are often the form of the data given in real-life applications. As this computation is prohibitively expensive to be done on homomorphic SUN AND YANG | 1793 encryption or SMC, such as garbled circuits, most of the existing privacy-preserving machine learning protocols deal only with integers. This results in the inapplicability of these protocols to real-life applications.
Recall that most of the existing secure kNN protocol adopt the linear scan strategy, where the distance between the query and every sample in the training set needs to be computed in a privacy-preserving manner. Meanwhile, large amount of information about the relationship between the query and every sample needs to be transferred from one party to another. With the development of cloud computing, the computing power can easily be improved. In contrast, the communication infrastructure is expensive. Thus, another unsolved common issue of the existing secure kNN protocols is the reduction of communication overhead.
In this paper, we present a secure low communication overhead kNN classification protocol that is able to deal with high-dimensional features given in real numbers. The main novelty contributions of our work are as follows: (1) We present a new data conversion algorithm that works with real numbers that can be used with the BGV homomorphic scheme. 23 This conversion algorithm is generic and applicable to other algorithms with BGV homomorphic scheme that also need to handle real numbers. (2) We propose a privacy-preserving euclidean distance protocol (PPEDP) which is based on the BGV homomorphic scheme. By using the novel data conversion algorithm, PPEDP computes the Euclidean distance between two points given in real numbers in a highdimensional space. (3) We propose a new classification approach, efficient secure kNN (ESkNN) classification protocol, which is built based on PPEDP and OT. Likewise, our ESkNN can deal with real numbers. We build our protocols using HElib 35 and EMP toolkit, 36 which enables our building block PPEDP to be used on/for other applications. Our experimental results show that our ESkNN performs efficiently, when it is applied to a sample set with highdimensional features. The feature dimension is scaled up to 18 000 in a memory limited environment. Moreover, the communication overhead is low with carefully selected parameters. The performance of ESkNN is up to 402 times and 17 times speedup, when compared to the performances of existing protocols PPkNN 21 and OCkNN, 22 respectively. Our ESkNN, along with PPEDP, is secure in the honest-but-curious model. We provide the detailed security definitions and proofs for our protocols in the appendix.
Our protocols fall into the category of SMC, where we use the traditional mode as shown and explained in Figure 1. In our protocols, there are two involving parties, Alice and Bob. Assuming that Alice has a private sample data set, ESkNN allows the two parties to classify a F I G U R E 1 The mode of efficient secure kNN classification protocol. PPEDP, privacy-preserving Euclidean distance protocol; OT, oblivious transfer [Color figure can be viewed at wileyonlinelibrary.com] private query input by Bob using Alice's data set in a way that Bob gets the resulting label without him learning anything about Alice's data set. Furthermore, Alice will not learn anything about Bob's input and the query result.

| Outline
The remainder of this paper is set out as follows. We introduce the preliminaries in Section 2. Section 3 briefly introduces the data conversion algorithm. Next, it presents the build block, PPDEP. Section 4 propose the ESkNN classification protocol with the performance analysis. After that, we describe the implementation and experiments in Section 5. Finally, we present our concluding remarks in Section 6.

| Notation
The notations we use in this paper are as follows. σ represents the digits of the fraction part of the real number used in our protocols. Within the kNN algorithm, m is the size of samples, while n represents the feature dimension. In the BGV homomorphic scheme, τ is the slot length, λ is the security level, and p is the plaintext prime modulus. ρ is the bucket length which will be introduced and used in our ESkNN. For a real number r, r̅ represents the smallest integer bigger than r. δ n τ = / represents the number of the ciphertext corresponding to every sample. δ ρ ε = / represents the number of buckets corresponding to every sample.

| kNN classification algorithm
kNN classification algorithm is one of the best algorithms for multivariate classification in the field of machine learning. Euclidean distance is a commonly used metric in kNN. More specifically, for two points x x x x = , ,…, n ⋯ .
In the kNN classification algorithm, the training data set is a n-dimensional feature vector, and y t 1, 2, …, i ∈ is the label of S i . In practice, s i i , is always a real number. The kNN classification algorithm classifies a query X x x x = [ , , …, ] to a label y based on the labels of the k nearest neighbors. We describe the kNN classification algorithm in Algorithm 1.

| Fully homomorphic encryption
In this paper, we use the BGV fully homomorphic encryption (FHE) scheme to computes the Euclidean distance between two points in a privacy-preserving manner. Assuming that λ is the security level, τ is the slot length and p is the plaintext prime modulus, the BGV homomorphic scheme consists the following algorithms: pk sk FHE.KeyGen (1 ) ( , ) λ → . Given security parameter λ, the function generates the key pairs pk and sk used for encryption and decryption. sk

| Oblivious transfer
OT is a fundamental cryptographic primitive that is widely used in fields such as SMC. In OT 2 1 , the sender S enters a character pair x x ( , ) 0 1 , where x 0 is a random number and x 1 is a meaningful value. While the receiver R has the selection bit ∈ without revealing any information about b to the sender S, and receiver R cannot obtain any information other than x b .
In this paper, we use a more commonly used OT which is called m OT × 2 1 . Let's consider the scenario that S inputs m tuples x x ( , ) in a secure manner as OT 2 1 . For simplicity, we denote m OT × 2 1 by OT from here on.

| Security definition
In this paper, we use the honest-but-curious model which is widely used in SMC. Additionally, we follow the simulation-based security definition in the SMC literature. 37 Definition 1. Private computation under Honest-but-Curious model.
In the process of procotol π, the information that participants P 1 and P 2 obtain are recorded as: In the formulas above, r i represents the random number generated by P i , and m j i represents the jth message received by P i . After the protocol, the output of P i is recorded as Protocol π computes a deterministic function f under the honest-but-curious model privately if and only if there are two probability polynomial time algorithms S 1 and S 2 who conforms the following formulas:

| PRIVACY-PRESERVING EUCLIDEAN DISTANCE PROTOCOL
In this section, we present our PPEDP. First, we propose a data conversion algorithm which is used in the BGV homomorphic scheme. Then, we present our PPEDP and analyze the security and efficiency.

| Data conversion algorithm
To use the BGV homomorphic scheme on a real number, we need to convert it to an integer in advance. Assuming a real number x with a decimal representation has σ digits in the fraction part.
x is converted into an integer y by y x mod p = ( × 10 ) σ where p is the plaintext prime modulus in the BGV homomorphic scheme. When we decrypt a ciphertext y, the plaintext D y ( ) need to be converted into its true value x according to the operations performed on the ciphertext.
If no operations were performed on the ciphertext y, we can compute the true plaintext by x Dy m o dp = (10 × ( )) σ − . If y was obtained by m multiplications on some other ciphertext, we have x Dy m o d p . If y was obtained by m additions or subtractions on some other ciphertext, we have x Dy m o dp in a n-dimensional space, meanwhile Bob has a point y y y y = , ,…, n 1 2 . Alice and Bob want to calculate the Euclidean distance between their points without revealing their point information. After the calculation, only Bob gets the resulting Euclidean distance.
In practice, the values in a n-dimensional space are always real numbers. Before the protocol, Alice and Bob initially needs to convert their input to integers using our Data Conversion Algorithm as descripted above, so that the feature values can be performed using the BGV homomorphic scheme. For simplicity, we assume that every value in x and y has been multiplied by 10 σ − before the protocol. Our PPEDP is described in Protocol 1.

| Correctness and efficiency
Theorem 1. Assuming every element of the input can be represented in μ digits after data conversion, our PPEDP computes the Euclidean distance between two points correctly when μ lg p ρ < /( +1).
Proof. To compute the Euclidean distance correctly, the logical plaintext related to every ciphertext need to be less than p, where p is the plaintext prime modulus in the BGV homomorphic scheme. We observe that the largest plaintext is decrypted form sk C FHE.Dec( , ) i in step 5. Assuming α ′ i is one of the logical plaintext related to sk C FHE.Dec ( , ) i . To make it easier to understand, we consider the logical operations in clear. α ′ i is obtained by δ additions on some product of two elements. So, we have ρ . That is, μ p ρ < lg /( + 1).
As many previous works, we mainly focus on the efficiency in the processing stage. In our PPEDP, Alice and Bob transmit in total ε n τ ρ = /( × ) ciphertext and m real numbers. The computation overhead is δ n τ = / encryptions, δ homomorphic additions and ε decryptions.

| ESkNN CLASSIFICATION PROTOCOL
In this section, we propose our ESkNN using PPEDP and OT. We also analyze the performance of our new protocol theoretically.

| ESkNN
Our ESkNN consists of three phases, that is, the preprocessing phase, the Euclidean distance computation phase and the label computation phase.
2) Second, Alice encrypts S using the public key pk as follows.
Then, Alice discloses pk and E S ( ). Once Alice disclosed pk and E S ( ), everyone can get them without any authentication. 5) At last, Bob gets pk and E S ( ).

| Euclidean distance computation phase
In the linear scan kNN algorithm, the k nearest neighbors are chosen by the top k minimum Euclidean distances between every samples and the query. In ESkNN, we do not compute the true Euclidean distances e s x s x s 2) Bob encrypts R using the public key pk as follows.
3) Bob encrypts X using the public key pk.

| Performance analysis
Similar to PPEDP, our ESkNN performs the kNN prediction correctly when μ p ρ < lg /( + 1) . In the processing phase, besides the data transmitted in the OT protocol, Alice and Bob transmit m ε m n τ ρ × = × /( × ) ciphertext and m real numbers. The computation overhead is m δ m n τ × = × / encryptions, m δ × homomorphic additions, m ε × decryptions and the computation overhead in the OT protocol.

| IMPLEMENTATION AND PERFORMANCE RESULTS
We implement a PPkNN prediction system based on our protocol and show the experimental results in this section. We only evaluate the performance of our ESkNN because (1) a lot of previous works have been done for the performance evaluation of PPkNN prediction, and (2) PPEDP is achieved as part of ESkNN system.
The system is implemented in C++. We use the fully homomorphic cryptosystem HElib 35 to handle the encryption, decryption, and homomorphic operations. OT is implemented using the EMP toolkit. 36 The experiments are performed on a JD cloud c.n2.xlarge compute-optimized standard instance with an Intel© Xeon© Four-CoreTM CPU of a 2.4 GHz processor and 8 GB RAM. Each of the parties ran on a CPU core. We collected 10 runs for each test and report the average. We report experimental results only for the processing phase of our ESkNN because the preprocessing phase can be executed only one time for different queries. In our experiments, we use synthetic data sets to test our implementation. In the synthetic data set, the samples and the queries are generated randomly. Each feature value is two decimals between 0.00 and 0.99 after the normalization processing.

| Results analysis
We start with the experimental results for our ESkNN in different settings. We initialized the BGV homomorphic scheme with the following parameters: (1) the cyclotomic polynomial phi ω ( ) is set with ω = 32767, (2) the plaintext prime modulus p is set to be 999007, (3) the number of bits of the modulus chain is set to be 590. With the parameters above, the security level λ yields to be 99, while the length of slots τ is 180. According to Theorem 1, the maximum bucket length ρ is 100. We choose the parameters for three reasons. (1) The security level is similar to the comparing works. (2) To the extent allowed by hardware conditions, we want the plaintext space to be as large as possible. (3) To reduce the bandwidth, the expected length of slots τ is 180 which is an empirical value.
First, to examine how the processing phase scales, we run experiments on data set with the dimension n of the feature vector increasing from 3600 to 18 000 and the bucket length ρ increasing from 20 to 100. As shown in Figure 2, when fixing m k = 5, = 3, the running time scales almost linearly with the dimension of the feature vector while the bucket length has less effect. Meanwhile, when fixing the dimension of the feature vector, the communication size mainly depends on the bucket length. When fixing the bucket length, the relationship between the communication size and the dimension of the feature vector can be expressed in a step function. In particular, by fixing k = 3, ρ = 100, it takes 646.2 s with the communication size to be 408 MB when classification with kNN model in a privacy-preserving manner on five samples with 18 000 features each. It is obviously that, in this scenario, to make full use of resources, the SUN AND YANG | 1803 bucket length should set to be 100. If fact, the bucket length is set to be a const when using our implementation in practice. Usually, to make full use of the communication resources, the bucket length should be set to the maximum effective value according to Theorem 1.
Next, we run experiments on data set with the sample size m increasing from 20 to 60 and the number k of the nearest neighbors increasing from 3 to 11. Currently we could not implement our protocol for larger sample size due to large memory requirements. But if one can manage large memory, it is expected that our protocol would run on large parameters. As shown in Figure 3 We also observe that with the optimized bucket length and a fixed data set size, that is, m n × , the overhead of our protocol is much lower when the sample data set has many more features than samples, that is, n m ≫ . In particular, we analyze the situation by fixing the data set size to be 18 000 and the bucket length to be 100. When m n = 50, = 360, our protocol takes 270.8 s with the communication size to be 817 MB. Meanwhile, it takes 127.3 s with the communication size to be 81 MB by fixing m n = 5, = 3600. In conclusion, with fixed k ρ τ , , , we get that for our ESkNN, (1) the computation overhead which is evaluated by the running time scales almost linearly with the dimension of the feature vector and the sample size; (2) the communication size go in a step function with the dimension of the feature vector, while it grows linearly with the sample size; and (3) our protocol is more suitable for the scenario where the sample data set has many more features than samples. These observations agree with the asymptotic complexity derived above.

| Comparison with prior works
As surveyed in the related works, PPkNN classification protocols are also considered by PPkNN 21 and outsourced collaborative kNN (OCkNN) 22 based on the same distance metric. For comparison, we use the results reported in the work 22 directly as their security levels are nearly equal to ours and their running environment is more powerful than ours. PPkNN and OCkNN are tested on 4898 samples with 12 features each. Our protocol is tested on 33 samples with 1800 features each. It is obviously that the data set sizes are nearly equal. In fact, our data set size is a little bit more than theirs.
The running time and communication size of both PPkNN and OCkNN increases with the growth of k, while our overhead is independent of k. As shown in Figure 4, compared with PPkNN and OCkNN, the computation overhead which is evaluated by the running time is improved significantly in ESkNN. For example, compared with PPkNN, when k = 25, our protocol has a 402× speedup. Compared with OCkNN, when k = 25, our protocol has a 17× speedup.
Our communication size is significantly smaller than PPkNN's. Compared with OCkNN, our communication overhead is lower than OCkNN's when k > 15.
In addition, our ESkNN can work with high dimensional features given in real numbers while PPkNN and OCkNN mainly focus on low dimensional features represented in integers.

| CONCLUSION
In this paper, we presented a PPEDP, which enabled us to compute the Euclidean distance between two points given in real numbers in a high dimensional space without worrying about privacy leakage. Based on PPEDP and OT, we designed an ESkNN classification protocol, ESkNN. Experiment results showed that our ESkNN dealt with real numbers, performed well in the scenario where the sample data set had many more features than samples and run faster than existing works.
We note that our protocol has the following limitations. First, our data conversion algorithm is highly dependent on the operations performed on the ciphertext. Our future work will look to design a data conversion algorithm working in a black box manner with the fully homomorphic scheme. Second, with the fixed BGV homomorphic parameters, the overhead of our protocol depends on the dimension of the feature vector, the sample size, the slot length and the bucket length. The overhead of our protocol increases sharply when the sample data set had many more samples than features. At last, our protocol can only deal with plaintext which has two digits in our hardware limited experiment environment. We would like to design a protocol practical for larger digits of the plaintext in the future. In the honest-but-curious model, assuming that the underlying fully homomorphic encryption (FHE) scheme is secure, our privacy-preserving euclidean distance protocol (PPEDP) calculates the Euclidean distance securely.
Proof. First, we prove the security of PPEDP in case Alice is corrupted. In this section, we use protocol π to represent PPEDP. During the execution of π, the view of Alice is view x y {x f x y pk sk E X C α w v} ( , ) = , ( , ), ( , ), ( ), , , , where f x y ( , ) = 1 ⊥. Now, we construct a simulator S 1 so that S 1 receives Alice's input and output information, that is, x f x y x ( , ( , )) = ( , ) 1 ⊥ . S 1 simulates the execution of protocol π as follows. ⋯ } is a random number. 3) S 1 calls the key generation algorithm of the BGV homomorphic scheme to generate the key pairs pk sk ( , ) ′ ′ . 4) S 1 encrypts x′ using the public key pk′ and gets the ciphertext E X ( ) ′ . 5) S 1 encryptes y′ by using the public key pk′ and get E y ( ) ′ . 6) S 1 performs multiplication and addition on ciphertext to obtain a ciphertext set C′.
S 1 decrypts C′ using the private key sk′ to obtain v′.
. Assuming the underlying homomorphic encryption scheme is secure in the honest-but-curious model, we have pk sk pk sk ( , )~( , ) . As r ′ i and r i chosen randomly, we have r r ′ i i ≡ and α α ′ ≡ . As w w = ′ and α α ′ ≡ , we have v ṽ ′ ≡ . Now, we can get the conclusion that  3) S 2 calls the key generation algorithm of the BGV homomorphic scheme to generate the key pairs pk sk ( , ) ′ ′ . 4) S 2 encrypts x′ using the public key pk′ and gets the ciphertext E x ( ) ′ . 5) S 2 encrypts y′ by using the public key pk′ and get E y ( ) ′ . 6) S 2 performs multiplication and addition on ciphertext to obtain C′.
Because S 2 simulates π in case Bob is corrupted, the view of S 2 is view x y y f x y pk E x E y C v . Assuming the underlying homomorphic encryption scheme is secure in the honest-but-curious model, we can get E , E y E y ( )~( ) ′ ≡ , C C ′ ≡ . As r′ i and r i follow the same probability distribution, we can get r r ′ ≡ . As v′ and v are confused by r′ and r, respectively as shown in the following formula, we have As we assumed  is a random number. 3) S 1 generates the key pair pk sk ( , ) ′ ′ of the BGV homomorphic scheme.