As the recent proliferation of social networks, mobile applications, and online services increased the rate of data gathering, to find near-duplicate records efficiently has become a challenging issue. Related works on this problem mainly aim to propose efficient approaches on a single machine. However, when processing large-scale dataset, the performance to identify duplicates is still far from satisfactory. In this paper, we try to handle the problem of duplicate detection applying MapReduce. We argue that the performance of utilizing MapReduce to detect duplicates mainly depends on the number of candidate record pairs and intermediate result size, which is related to the shuffle cost among different nodes in cluster. In this paper, we proposed a new signature scheme with new pruning strategies to minimize the number of candidate pairs and intermediate result size. The proposed solution is an exact one, which assures none duplicate record pair can be lost. The experimental results over both real and synthetic datasets demonstrate that our proposed signature-based method is efficient and scalable. Copyright © 2012 John Wiley & Sons, Ltd.
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.