An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data
Abstract
:1. Introduction
2. Related Work
2.1. Horizontal Layout-Based Algorithms
2.2. Vertical Layout-Based Algorithms
2.3. Tree-Based Algorithms
3. Preliminaries
3.1. Definition (Pattern)
3.2. Definition (Frequency of a Pattern)
3.3. Definition (Itemset)
3.4. Definition (Frequent Itemset)
3.5. Definition (Tidset)
3.6. Definition (Diffset)
4. SHFIM Proposed Framework
4.1. Singletons Extraction Phase
Algorithm 1. Discover frequent pattern using SHFIM | |||
Input:D: Dataset of transactions, min_sup: minimum support threshold. | |||
Output:frequent itemsets: list of frequent itemsets. | |||
dataRDD←Read data from HDFS | |||
singletons←getFirstFrequentItemset (dataRDD, min_sup) | |||
Ifsingletons = ∅ then | |||
system_exit () | |||
end if | |||
singletonsList ← broadcast (singletons) | |||
secondFrequentTidItemsetRDD ← findPairsBloomFilter (dataRDD, singletonsList, min_sup) | |||
hasCoverage ← false | |||
k←2 | |||
kFrequentItemsetRDD ← secondFrequentTidItemsetRDD | |||
WhilehasCoverage = false do | |||
candidateItemsets ← generatekCandidates(kFrequentItemsetRDD) | |||
IfcandidateItemsets = ∅ then | |||
hasCoverage ← true | |||
else | |||
k ← k +1 | |||
candidateItemsetsBC ← broadcast (candidateItemsets) | |||
assignCandidateToItemsetRDD ← assignCandidateToItemset (candidateItemsetsBC, kFrequentItemsetRDD) | |||
Ifk = 3 then | |||
kFrequentItemsetRDD ← getDiffsetFromTid (assignedCandidatesToItemsetRDD, min_sup) | |||
else | |||
kFrequentItemsetRDD ← getDiffsetFromDiff (assignedCandidatesToItemsetRDD, min_sup) | |||
end if | |||
end if | |||
end while |
4.2. 2nd Frequent Itemsets in a Vertical Layout Phase
4.3. K Frequent Itemset Extraction Phase
5. Performance Evaluation
5.1. Dataset
5.2. Experiment and Result
5.2.1. T1014D100k Dataset
5.2.2. Retail Dataset
5.2.3. Mushroom Dataset
5.2.4. Chess Dataset
6. Discussion
7. Time Complexity
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Jiawei, H.; Kamber, M. Data Mining Concepts and Techniques, 550. Available online: https://www.researchgate.net/publication/235902451_Data_Mining_Concept_and_Techniques (accessed on 13 December 2021).
- Apiletti, D.; Baralis, E.; Cerquitelli, T.; Garza, P.; Pulvirenti, F.; Venturini, L. Frequent Itemsets Mining for Big Data: A Comparative Analysis. Big Data Res. 2017, 9, 67–83. [Google Scholar] [CrossRef]
- Big Data Tutorial|All You Need to Know about Big Data|Edureka. Available online: https://www.edureka.co/blog/big-data-tutorial (accessed on 4 January 2022).
- Landset, S.; Khoshgoftaar, T.M.; Richter, A.N.; Hasanin, T. A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2015, 2, 24. [Google Scholar] [CrossRef] [Green Version]
- Tai, D.D.; Huynh, V.N. K-PbC: An Improved Cluster Center Initialization for Categorical Data Clustering. Applied Intelligence 2020, 50, 2610–2632. [Google Scholar] [CrossRef]
- Naulaerts, S.; Meysman, P.; Bittremieux, W.; Vu, T.N.; Berghe, W.V.; Goethals, B.; Laukens, K. A primer to frequent itemset mining for bioinformatics. Brief. Bioinform. 2015, 16, 216–231. [Google Scholar] [CrossRef] [Green Version]
- Ilayaraja, M.; Meyyappan, T. Efficient Data Mining Method to Predict the Risk of Heart Diseases through Frequent Itemsets. Procedia Comput. Sci. 2015, 70, 586–592. [Google Scholar] [CrossRef] [Green Version]
- Loshin, D. Knowledge Discovery and Data Mining for Predictive Analytics. Bus. Intell. 2013, 271–286. [Google Scholar] [CrossRef]
- Luna, J.M.; Fournier-Viger, P.; Ventura, S. Frequent itemset mining: A 25 years review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019, 9, e1329. [Google Scholar] [CrossRef]
- Apiletti, D.; Baralis, E.; Cerquitelli, T.; Chiusano, S.; Grimaudo, L. SeaRum: A Cloud-Based Service for Association Rule Mining. In Proceedings of the 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, Washington, DC, USA, 16–18 July 2013; pp. 1283–1290. [Google Scholar]
- Gao, C.; Tung, A.K.H.; Xu, X.; Pan, F.; Yang, J. FARMER: Finding interesting rule groups in microarray datasets. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France, 13–18 June 2004; p. 143. [Google Scholar] [CrossRef]
- Tania, C.; Di Corso, E. Characterizing Thermal Energy Consumption through Exploratory Data Mining Algorithms. 2016. Available online: https://iris.polito.it/handle/11583/2639284 (accessed on 9 January 2022).
- Antonie, M.; Zaiane, O.R.; Coman, A. Application of Data Mining Techniques for Medical Image Classification. In Proceedings of the Second International Conference on Multimedia Data Mining, San Francisco, CA, USA, 26 August 2001. [Google Scholar]
- Rakesh, A.; Srikant, R. Fast Algorithms for Mining Association Rules. Available online: https://dl.acm.org/doi/10.5555/645920.672836 (accessed on 9 January 2022).
- Apriori Algorithm—GeeksforGeeks. Available online: https://www.geeksforgeeks.org/apriori-algorithm/ (accessed on 4 January 2022).
- Zaki, M. Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 2000, 12, 372–390. [Google Scholar] [CrossRef] [Green Version]
- ML|ECLAT Algorithm—GeeksforGeeks. Available online: https://www.geeksforgeeks.org/ml-eclat-algorithm/ (accessed on 4 January 2022).
- Zaki, M.J.; Gouda, K. Fast vertical mining using diffsets. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’03, Washington, DC, USA, 24–27 August 2003; pp. 326–335. [Google Scholar]
- Rao, T.R.; Mitra, P.; Bhatt, R.; Goswami, A. The big data system, components, tools, and technologies: A survey. Knowl. Inf. Syst. 2019, 60, 1165–1245. [Google Scholar] [CrossRef]
- Big Data Analysis Using Apache Hadoop. Available online: https://www.researchgate.net/publication/261309523_Big_data_analysis_using_Apache_Hadoop (accessed on 29 November 2020).
- Apache Hadoop. Available online: http://hadoop.apache.org/ (accessed on 28 November 2020).
- Weets, J.-F.; Kakhani, M.K.; Kumar, A. Limitations and challenges of HDFS and MapReduce. In Proceedings of the 2015 International Conference on Green Computing and Internet of Things (ICGCIoT), NW Washington, DC, USA, 8–15 October 2015; IEEE: Manhattan, NY, USA, 2015; pp. 545–549. [Google Scholar]
- Frequent Pattern Mining—RDD-Based API—Spark 2.2.0 Documentation. Available online: https://spark.apache.org/docs/2.2.0/mllib-frequent-pattern-mining.html (accessed on 23 December 2020).
- Salloum, S.; Dautov, R.; Chen, X.; Peng, P.X.; Huang, J.Z. Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 2016, 1, 145–164. [Google Scholar] [CrossRef] [Green Version]
- Frequent Pattern Mining—Spark 3.0.1 Documentation. Available online: https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html (accessed on 22 December 2020).
- Cai, B.Z.; Zhu, X.; Zheng, Y.; Liu, D.; Xu, L. A Caching-Based Parallel FP-Growth in Apache Spark; Springer International Publishing: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
- BloomFilter (Spark 2.1.0 JavaDoc). Available online: https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/util/sketch/BloomFilter.html (accessed on 27 May 2021).
- Raj, S.; Ramesh, D.; Sreenu, M.; Sethi, K.K. EAFIM: Efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data. Knowl. Inf. Syst. 2020, 62, 3565–3583. [Google Scholar] [CrossRef]
- Rathee, S.; Kashyap, A. Adaptive-Miner: An efficient distributed association rule mining algorithm on Spark. J. Big Data 2018, 5, 6. [Google Scholar] [CrossRef]
- Sethi, K.K.; Ramesh, D. HFIM: A Spark-based hybrid frequent itemset mining algorithm for big data processing. J. Supercomput. 2017, 73, 3652–3668. [Google Scholar] [CrossRef]
- Zhang, F.; Liu, M.; Gui, F.; Shen, W.; Shami, A.; Ma, Y. A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Clust. Comput. 2015, 18, 1493–1501. [Google Scholar] [CrossRef]
- Li, H.; Wang, Y.; Zhang, D.; Zhang, M.; Chang, E.Y. RecSys ’08. In Proceedings of the 2008 ACM Conference on Recommender Systems, Lausanne, Switzerland, 23–25 October 2008; pp. 107–114. [Google Scholar]
- Rathee, S.; Kaul, M.; Kashyap, A. R-Apriori. In Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management; ACM Press: New York, NY, USA, 2015; pp. 27–34. [Google Scholar]
- Qiu, H.; Gu, R.; Yuan, C.; Huang, Y. YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark. In Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, Phoenix, AZ, USA, 19–23 May 2014; pp. 1664–1671. [Google Scholar]
- Huang, P.-Y.; Cheng, W.-S.; Chen, J.-C.; Chung, W.-Y.; Chen, Y.-L.; Lin, K.W. A Distributed Method for Fast Mining Frequent Patterns From Big Data. IEEE Access 2021, 9, 135144–135159. [Google Scholar] [CrossRef]
- Singh, P.; Singh, S.; Mishra, P.K.; Garg, R. RDD-Eclat: Approaches to Parallelize Eclat Algorithm on Spark RDD Framework. In Lecture Notes on Data Engineering and Communications Technologies; Springer: Berlin/Heidelberg, Germany, 2020; Volume 44, pp. 755–768. [Google Scholar] [CrossRef] [Green Version]
- Leung, C.K.; Zhang, H.; Souza, J.; Lee, W. Scalable Vertical Mining for Big Data Analytics of Frequent Itemsets. Lecture Notes in Computer Science (In-cluding Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer International Publishing: Cham, Switzerland, 2018; Volume 11029. [Google Scholar]
- Liu, J.; Wu, Y.; Zhou, Q.; Fung, B.C.M.; Chen, F.; Yu, B. Parallel Eclat for Opportunistic Mining of Frequent Itemsets. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); International Conference on Database and Expert Systems Applications: Valencia, Spain, 2015; Volume 9261, pp. 401–415. [Google Scholar]
- Moens, S.; Aksehirli, E.; Goethals, B. Frequent Itemset Mining for Big Data. 2013 IEEE Int. Conf. Big Data 2013, 111–118. [Google Scholar] [CrossRef]
- Ragaventhiran, J.; Kavithadevi, M. Map-optimize-reduce: CAN tree assisted FP-growth algorithm for clusters based FP mining on Hadoop. Futur. Gener. Comput. Syst. 2020, 103, 111–122. [Google Scholar] [CrossRef]
- Shi, X.; Chen, S.; Yang, H. DFPS: Distributed FP-growth algorithm based on Spark. In Proceedings of the 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 25–26 March 2017; IEEE: Manhattan, NY, USA, 2017; pp. 1725–1731. [Google Scholar]
- Han, J.; Pei, J.; Yin, Y. Mining frequent patterns without candidate generation. ACM SIGMOD Rec. 2000, 29, 1–12. [Google Scholar] [CrossRef]
- Frequent Pattern Mining. Freq. Pattern Min. 2014, 9783319078212, 1–471. [CrossRef]
- Frequent Itemset Mining Dataset Repository. Available online: http://fimi.uantwerpen.be/data/ (accessed on 12 December 2020).
FIM Algorithm | Advantages | Disadvantages | Framework | Algorithm |
---|---|---|---|---|
BigFIM [39] |
|
| MapReduce | Apriori/Eclat |
Dist-Eclat [39] |
|
| MapReduce | Eclat |
YAFIM [34] |
|
| Spark | Apriori |
R-Apriori [33] |
|
| Spark | Apriori |
DFIMA [31] |
|
| Spark | Apriori |
PECLAT [38] |
|
| MapReduce | Eclat/dEclat |
HFIM [30] |
| Spark | Apriori | |
DFPS [41] |
|
| Spark | FP-Growth |
Adaptive-Miner [29] |
|
| Spark | Apriori |
SVT [37] |
|
| Spark | Eclat/dEclat |
EAFIM [28] |
|
| Spark | Apriori |
RDD-Eclat [36] |
|
| Spark | Eclat |
Map-Optimize-Reduce [40] |
|
| MapReduce | FP-Growth |
Notation | Description |
---|---|
P | a Pattern in D |
D | a dataset of transactions |
T | a transaction in D |
n | Number of items in D |
i | an item |
R | a set of records in D |
m | Number of records in D |
r | a record in D |
I | an itemset in D |
k | Number of items in I and it is also the iteration number |
σ | Support |
min_sup | Minimum support |
t | Tidset of an itemset |
d | Diffset of an itemset |
fk | K Frequent itemsets |
N | The number of itemsets in k frequent itemsets |
C | K Candidate itemset |
O | Big O notation |
kn | The maximum number of iterations |
TID | Items |
---|---|
1 | A, B, C, D, F |
2 | A, C, D, F |
3 | B, C, D, F |
4 | A, C, D |
Dataset | No. of Transactions | No. Of Different Items | Density | Type |
---|---|---|---|---|
Mushroom: (http://fimi.uantwerpen.be/data/mushroom.dat) (accessed on 1 January 2022) | 8124 | 119 | Dense | Real-life |
Chess: (http://fimi.uantwerpen.be/data/chess.dat) (accessed on 1 January 2022) | 3196 | 75 | Dense | synthetic |
T1014D100k: (http://fimi.uantwerpen.be/data/T10I4D100K.dat) (accessed on 1 January 2022) | 100,000 | 870 | Sparse | Real-life |
Retail: (http://fimi.uantwerpen.be/data/retail.dat) (accessed on 1 January 2022) | 87,988 | 16,470 | Sparse | Real-life |
Dataset/Algorithm | Min_Sup | SHFIM | ECLAT | DECLAT |
---|---|---|---|---|
Mushroom | 90–30% | 39 | 81 | 49 |
Chess | 90–85% | 51 | 236 | 43 |
T1014D100k | 7–0.6% | 7 | 121 | 260 |
Retail | 30–0.9% | 5 | 18 | 105 |
Dataset | T1014D100k | Retail | Mushroom | Chess | ||||
---|---|---|---|---|---|---|---|---|
Sparsity | Sparse | Sparse | Dense | Dense | ||||
No. of Transactions | 100,000 | 87,988 | 8124 | 3196 | ||||
Min_sup (%) | 0.6% | 0.7% | 0.4% | 0.5% | 30% | 40% | 85% | 90% |
Min_sup | 600 | 700 | 352 | 440 | 2437 | 3250 | 2717 | 2876 |
No. of Frequent itemsets | 772 | 603 | 831 | 580 | 2735 | 565 | 2669 | 622 |
Tidset size per frequent itemset | 600–100,000 | 700–100,000 | 352–87,988 | 440–87,988 | 2437–8124 | 3250–8124 | 2717–3196 | 2876–3196 |
Diffset size per frequent itemset | At least 2 or 3 times less than tidset size |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Al-Bana, M.R.; Farhan, M.S.; Othman, N.A. An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data. Data 2022, 7, 11. https://doi.org/10.3390/data7010011
Al-Bana MR, Farhan MS, Othman NA. An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data. Data. 2022; 7(1):11. https://doi.org/10.3390/data7010011
Chicago/Turabian StyleAl-Bana, Mohamed Reda, Marwa Salah Farhan, and Nermin Abdelhakim Othman. 2022. "An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data" Data 7, no. 1: 11. https://doi.org/10.3390/data7010011
APA StyleAl-Bana, M. R., Farhan, M. S., & Othman, N. A. (2022). An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data. Data, 7(1), 11. https://doi.org/10.3390/data7010011