A communication efficient probabilistic algorithm for mining frequent itemsets from a peer-to-peer network

Authors


Abstract

Data intensive large-scale distributed systems like peer-to-peer (P2P) networks are becoming increasingly popular where centralization of data is impossible for mining and analysis. Unfortunately, most of the existing data mining algorithms work only when data can be accessed in its entirety. Finding all the network-wide frequent itemsets is computationally difficult and usually has large communication overhead in such environment. This paper focuses on developing a communication efficient algorithm for discovering frequent itemsets from a P2P network. A sampling-based approach is adopted to find approximate solution instead of an exact solution with probabilistic guarantee. The benefit of approximation technique is reflected in the low communication overhead in discovering majority of frequent itemsets with probabilistic guarantee. The main principal followed by the algorithm assumes that an independent and identically distributed (iid) sample of the entire data is available at one location to generate a set of candidate itemsets. Collecting iid sample from a P2P network is a challenging problem because of varying degrees of connectivity and sizes of data shared. The paper first addresses this issue and shows how an iid sample of nodes and data can be collected from a P2P network using random walk. It applies the proposed sampling technique to identify most of the frequent itemsets from a P2P network. Theoretical analysis shows how to decide about optimum sample size and minimize communication to compute the results. Experimental results show that the proposed algorithm discovers all of the network-wide frequent itemsets using communication that scales sublinearly with network and datasize. © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2: 48-69, 2009

Ancillary