Get access

Efficient and exact duplicate detection on cloud

Authors

  • Chuitian Rong,

    1. Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China
    2. School of Information, Renmin University of China, Beijing, China
    Search for more papers by this author
  • Wei Lu,

    1. Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China
    2. School of Information, Renmin University of China, Beijing, China
    Search for more papers by this author
  • Xiaoyong Du,

    Corresponding author
    1. School of Information, Renmin University of China, Beijing, China
    • Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China
    Search for more papers by this author
  • Xiao Zhang

    1. Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China
    2. School of Information, Renmin University of China, Beijing, China
    Search for more papers by this author

  • The short version of the paper was published in WAIM 2011 conference [1].

Correspondence to: Xiaoyong Du, School of Information, Renmin University of China, Beijing, China.

E-mail: duyong@ruc.edu.cn

SUMMARY

As the recent proliferation of social networks, mobile applications, and online services increased the rate of data gathering, to find near-duplicate records efficiently has become a challenging issue. Related works on this problem mainly aim to propose efficient approaches on a single machine. However, when processing large-scale dataset, the performance to identify duplicates is still far from satisfactory. In this paper, we try to handle the problem of duplicate detection applying MapReduce. We argue that the performance of utilizing MapReduce to detect duplicates mainly depends on the number of candidate record pairs and intermediate result size, which is related to the shuffle cost among different nodes in cluster. In this paper, we proposed a new signature scheme with new pruning strategies to minimize the number of candidate pairs and intermediate result size. The proposed solution is an exact one, which assures none duplicate record pair can be lost. The experimental results over both real and synthetic datasets demonstrate that our proposed signature-based method is efficient and scalable. Copyright © 2012 John Wiley & Sons, Ltd.

Get access to the full text of this article

Ancillary