HDKV: supporting efficient high-dimensional similarity search in key-value stores


Correspondence to: Jizhong Han, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China.

E-mail: hanjizhong@iie.ac.cn


Key-value stores are widely used on large-scale data management in the cloud environment. However, they can only naturally support key-based queries, and do not have efficient solutions for value-based queries. Thus, dealing with high-dimensional data in key-value stores is still a big challenge. State-of-the-art solutions apply value-based tree-structure indexes to solve this issue. These methods suffer from the curse of dimensionality and cannot achieve satisfactory performance. They also bring serious load unbalancing problem among servers, and result in dramatic system scalability degradation.

Meanwhile, similarity search in high-dimensional data space becomes more and more popular in today's cloud applications. Due to the lack of efficient algorithms for value-based queries, users have to wait for a long time before the results are returned. To address this issue, we propose a novel approach called high-dimensional similarity query in key-value stores (HDKV), which can generate similarity results in a short time and maintain good database scalability. In HDKV, a strict order-preserving hash function is designed to map nearby objects in the high-dimensional space onto adjacent keys of a continuous linear space in key-value stores. With this strategy, many expensive random accesses are replaced with more efficient scan accesses. The experimental evaluation on real world data set shows that compared to the state-of-the-art methods, HDKV can dramatically reduce the search time with little impact on the accuracy. Copyright © 2012 John Wiley & Sons, Ltd.