登入
選單
返回
Google圖書搜尋
Efficient Parallel Data Mining for Association Rules
Thomas J. Watson IBM Research Center
Jong-Soo Park
Ming-Syan Chen
Philip S. Yu
出版
IBM Thomas J. Watson Research Division
, 1995
URL
http://books.google.com.hk/books?id=xJ2_GwAACAAJ&hl=&source=gbs_api
註釋
Abstract: "In this paper, we explore the problem of parallel data mining for association rules. It is noted that database mining in general requires progressive knowledge collection and analysis based on a very large transaction database. When the transaction database is partitioned across a large number of nodes in a parallel database environment, the amount of inter-node data transmission required for reaching global decisions can be prohibitively large, thus significantly compromising the benefit achievable from parallelization. To address this issue, we develop in this paper an algorithm, called PDM, to conduct parallel data mining for association rules. Consider a transaction as a collection of items, and a large itemset is a set of items such that the number of transactions containing it exceeds a pre-specified threshold. PDM is so designed that the global set of large itemsets can be identified efficiently and the amount of inter-node data exchange required is minimized. Specifically, with a given database partition, each processing node will collect (count) information on each itemset from its local database efficiently via a hashing method. (Since only seeing a portion of the database, each node is in general not able to identify large itemsets by itself.) The information discovered by each node is next shared with other nodes via some communication schemes. Then, PDM employs a technique, called clue- and-poll, to address the uncertainty due to the partial knowledge collected at each node by judiciously selecting a small fraction of the itemsets for the exchange of count information among nodes, thus reducing the communication cost. The global set of large itemsets can hence be determined based on the aggregate count of itemsets. Two communication schemes are investigated: one is based on all-to-all broadcast and the other is based on global receiving and polling. An extensive performance study is conducted to evaluate PDM via simulations. It is experimentally shown that PDM not only attains very good parallelization efficiencies, but also provides robust performance for various input patterns."