What Is (Not) Big Data Based on Its 7Vs Challenges: A Survey
Abstract
:1. Introduction
2. Data Mining versus KDD versus Big Data
2.1. Data Mining
- Apply clustering to group data into groups of similar elements.
- Search for an explanatory or predictive pattern for a target attribute in terms of other attributes.
- Search for frequent patterns and sub-patterns.
- Search for trends, deviations and interesting correlations between attributes.
2.1.1. Data Analysis versus Data Mining
2.1.2. Patterns in Data Mining
2.1.3. Classification in the Methods of Data Mining
2.1.4. Data Mining Applications
2.2. Knowledge Discovery in Databases
2.2.1. The Term KDD
2.2.2. Phases of KDD
- Data pre-processing: It has three steps [6].
- Reduction of the dimension through the selection of functions and the taking of useful samples for the intended purpose, which offers a reduction in the number of variables to be considered [26].
- Cleaning of data to eliminate noise generated by different data types, extreme values, and missing values due to default or non-compulsory values [7].
- Choose the right task or Data Mining method. They can be classification, regression, clustering or summarisation [6].
- Choose the Data Mining algorithm by selecting the specific method to be used for pattern searching [6]. A data mining algorithm is nothing more than a set of heuristic calculations and rules that allow a model to be created from data [31]. For example, artificial neural networks, support vector machines, Bayesian networks, decision trees or different clustering or regression algorithms. This phase is difficult as different algorithms can be used to perform the same work but each will give a different output [31].
- Use the chosen Data Mining algorithm [6].
- Evaluation and interpretation of the extracted patterns. This may mean having to iterate again between the previous phases. In addition, this pattern may involve viewing the extracted patterns or data [26].
- Display the pattern found in another dataset for use and testing, and/or documentation of the pattern [6].
2.2.3. KDD Applications
2.3. Big Data
2.3.1. Definitions of Companies and Academics
2.3.2. Big Data Applications
3. Literature Review
3.1. Methodology
- IEEE: IEEE Xplore (https://ieeexplore.ieee.org/Xplore/home.jsp, assessed on 29 November 2022).
- SD: ScienceDirect–Elsevier (http://www.elsevier.com, assessed on 29 November 2022).
- Wiley (https://onlinelibrary.wiley.com/, assessed on 29 November 2022).
3.2. Articles by Database and Type
- 3V: 2013 to 2022: 27 Conferences, 3 Magazines, 1 Journal.
- ○
- 2 conferences have been removed because they are about other fields.
- 4V: 2014 to 2022: 25 Conferences, 1 Magazine, 1 Book, 2 Journals.
- ○
- 3 conferences have been removed because they are about other fields, and 1 more for being an editorial.
- 5V: 2014 to 2022: 18 Conferences, 3 Journals.
- ○
- 3 conferences, 2 journals, and 1 magazine have been removed because they are about other fields.
- 6V: 2020: 1 Magazine.
- ○
- 1 conference talks about 10 ‘Vs’.
- 7V: 2014 to 2021: 3 Conferences.
- 8V: 2019: 1 Conference.
- ○
- 3 have been removed because they are about other fields.
- 9V: 0.
- 10V: 2021: 1 Conference.
- ○
- It has appeared in the 6 ‘Vs’ search.
- 3V: 2018 to 2019: 3 Research articles, 1 Book chapter.
- 4V: 2016 to 2019: 1 Review article, 5 Research articles.
- 5V: 2016 to 2022: 1 Review article, 1 Research article.
- 6V, 7V, 8V, and 9V: 0.
- 3V: 2017 to 2022: 2 Books, and 1 Journal.
- 4V: 2019: 1 Journal.
- ○
- 1 has been removed because it is about other fields.
- 5V: 2017: 1 Journal.
- ○
- 3 have been removed: 1 is about Big Data but not about the ‘Vs’, and 2 are about other fields.
- 6V: 2022: 1 Book.
- ○
- 1 has been removed because it is about Big Data but not about the ‘Vs’.
- 7V: 0.
- ○
- 2 have been removed because they are Issues and not about Big Data.
- 8V: 0.
- ○
- 1 has been removed because they are Issues and not about Big Data.
- 9V: 0.
- ○
- 1 has been removed because it is about Big Data but not about the ‘Vs’.
3.3. Results and Discussion
4. Challenges According to the ‘7Vs’ of Big Data
4.1. Volume
4.2. Velocity: Reading and Processing
4.3. Variety
4.4. Veracity: Origin, Veracity and Validity
4.5. Variability: Structure, Time Access and Format
4.6. Value
4.7. Visualisation
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- McAfee, A.; Brynjolfsson, E. Big data: The Management Revolution. Harv. Bus. Rev. 2012, 90, 60–68. [Google Scholar] [PubMed]
- Dijcks, J.-P. Oracle: Big Data for the Enterprise; Oracle: Redwood, CA, USA, 2013. [Google Scholar]
- Lavalle, S.; Lesser, E.; Shockley, R.; Hopkins, M.S.; Kruschwitz, N. Big Data, Analytics and the Path from Insights to Value. MIT Sloan Manag. Rev. 2011, 52, 21–31. [Google Scholar]
- Chen, H.; Chiang, R.H.L.; Storey, V.C. Business Intelligence and Analytics: From Big Data to Big Impact. MIS Q. 2012, 36, 1165–1188. [Google Scholar] [CrossRef]
- Menzies, T.; Hu, Y. Data mining for very busy people. Computer 2003, 36, 22–29. [Google Scholar] [CrossRef]
- Rokach, L.; Maimom, O. Data Mining with Decision Trees: Theory and Applications; World Scientific Publishing Co. Pte Ltd.: Danvers, MA, USA, 2007; ISBN 9789812771711. [Google Scholar]
- Frawley, W.J.; Piatetsky-Shapiro, G.; Matheus, C.J. Knowledge Discovery in Databases: An Overview. AI Mag. 1992, 13, 57–70. [Google Scholar]
- Fan, W.; Bifet, A. Mining Big Data: Current Status, and Forecast to the Future. ACM SIGKDD Explor. Newsl. 2013, 14, 1. [Google Scholar] [CrossRef]
- Letouzé, E. Big Data for Development: Challenges & Opportunities. 2012. Available online: https://unstats.un.org/unsd/trade/events/2014/beijing/documents/globalpulse/Big%20Data%20for%20Development%20-%20UN%20Global%20Pulse%20-%20June2012.pdf (accessed on 27 October 2022).
- Witten, I.H.; Frank, E.; Hall, M.A. Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed.; Morgan Kaufmann: Burlington, MA, USA, 2007; ISBN 9780123748560. [Google Scholar]
- Cloud Security Alliance. Top Ten Big Data Security and Privacy Challenges. 2012. Available online: https://downloads.cloudsecurityalliance.org/initiatives/bdwg/Big_Data_Top_Ten_v1.pdf (accessed on 27 October 2022).
- Nti, I.K.; Quarcoo, J.A.; Aning, J.; Fosu, G.K. A mini-review of machine learning in big data analytics: Applications, challenges, and prospects. Big Data Min. Anal. 2022, 5, 81–97. [Google Scholar] [CrossRef]
- The Apache Software Foundation. Apache™ Hadoop®. Available online: http://hadoop.apache.org/ (accessed on 27 October 2022).
- Ahrens, J.; Hendrickson, B.; Long, G.; Miller, S.; Ross, R.; Williams, D. Data-Intensive Science in the US DOE: Case Studies and Future Challenges. Comput. Sci. Eng. 2011, 13, 14–24. [Google Scholar] [CrossRef]
- Manyika, J.; Chui, M.; Brown, B.; Bughin, J.; Dobbs, R.; Roxburgh, C.; Byers, A.H. Big Data: The Next Frontier for Innovation, Competition, and Productivity; McKinsey Global Institute: Washington, DC, USA, 2011. [Google Scholar]
- Mervis, J. Agencies Rally to Tackle Big Data. Science 2012, 336, 22. [Google Scholar] [CrossRef]
- Bello-Orgaz, G.; Jung, J.J.; Camacho, D. Social big data: Recent achievements and new challenges. Inf. Fusion 2016, 28, 45–59. [Google Scholar] [CrossRef]
- Greiner, L. What is Data Analysis and Data Mining? Available online: https://www.dbta.com/Editorial/Trends-and-Applications/What-is-Data-Analysis-and-Data-Mining-73503.aspx (accessed on 27 October 2022).
- Friedman, J.H. Data Mining and Statistics: What’s the connection? Comput. Sci. Stat. 1998, 29, 3–9. [Google Scholar]
- Manaris, B. Natural Language Processing: A Human-Computer Interaction Perspective. In Advances in Computers; Elsevier: Amsterdam, The Netherlands, 1998; Volume 47, pp. 1–66. ISBN 9780120121472. [Google Scholar]
- Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. The KDD process for extracting useful knowledge from volumes of data. Commun. ACM 1996, 39, 27–34. [Google Scholar] [CrossRef]
- Assunção, M.D.; Calheiros, R.N.; Bianchi, S.; Netto, M.A.S.; Buyya, R. Big Data computing and clouds: Trends and future directions. J. Parallel Distrib. Comput. 2015, 79–80, 3–15. [Google Scholar] [CrossRef]
- Leskovec, J.; Rajaraman, A.; Ullman, J.D. Mining of Massive Datasets; Cambridge University Press: Cambridge, UK, 2014; ISBN 9781139058452. [Google Scholar]
- Labrinidis, A.; Jagadish, H.V. Challenges and opportunities with big data. Proc. VLDB Endow. 2012, 5, 2032–2033. [Google Scholar] [CrossRef] [Green Version]
- Piatetsky-Shapiro, G. From Data Mining to Big Data and Beyond. Available online: https://www.kdnuggets.com/2012/04/from-data-mining-to-big-data-and-beyond.html (accessed on 27 October 2022).
- Fayyd, U.M.; Piatetsky-Shapiro, G.; Smyth, P.; Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. From Data Mining to Knowledge Discovery in Databases. AI Mag. 1996, 17, 37–54. [Google Scholar]
- Ha, S.H.; Park, S.C. Application of data mining tools to hotel data mart on the Intranet for database marketing. Expert Syst. Appl. 1998, 15, 1–31. [Google Scholar] [CrossRef]
- Buxton, B.; Hayward, V.; Pearson, I.; Kärkkäinen, L.; Greiner, H.; Dyson, E.; Ito, J.; Chung, A.; Kelly, K.; Schillace, S. Big data: The next Google. Nature 2008, 455, 8–9. [Google Scholar]
- NIST Big Data Public Working Group: Reference Architecture Subgroup. NIST Big Data Interoperability Framework: Volume 5, Architectures White Paper Survey; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2015; Volume 5.
- Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P.; Uthurusamy, R. Advances in Knowledge Discovery and Data Mining; The MIT Press: Cambridge, MA, USA, 1996; ISBN 9780262560979. [Google Scholar]
- Data Mining Algorithms (Analysis Services—Data Mining). Available online: https://msdn.microsoft.com/en-us/library/ms175595.aspx (accessed on 27 October 2022).
- Hand, D.J. Discrimination and Classification; John Wiley and Sons Inc.: New York, NY, USA, 1981; Volume 1, ISBN 9780471280484. [Google Scholar]
- González García, C.; Núñez-Valdez, E.R.; García-Díaz, V.; Pelayo G-Bustelo, C.; Cueva Lovelle, J.M. A Review of Artificial Intelligence in the Internet of Things. Int. J. Interact. Multimed. Artif. Intell. 2019, 5, 9–20. [Google Scholar] [CrossRef]
- Wang, M.; Sheng, L.; Zhou, D.; Chen, M. A Feature Weighted Mixed Naive Bayes Model for Monitoring Anomalies in the Fan System of a Thermal Power Plant. IEEE/CAA J. Autom. Sin. 2022, 9, 719–727. [Google Scholar] [CrossRef]
- He, W.; He, Y.; Li, B.; Zhang, C. A Naive-Bayes-Based Fault Diagnosis Approach for Analog Circuit by Using Image-Oriented Feature Extraction and Selection Technique. IEEE Access 2020, 8, 5065–5079. [Google Scholar] [CrossRef]
- Xue, Z.; Wei, J.; Guo, W. A Real-Time Naive Bayes Classifier Accelerator on FPGA. IEEE Access 2020, 8, 40755–40766. [Google Scholar] [CrossRef]
- Sanchis, A.; Juan, A.; Vidal, E. A Word-Based Naïve Bayes Classifier for Confidence Estimation in Speech Recognition. IEEE Trans. Audio. Speech. Lang. Process. 2011, 20, 565–574. [Google Scholar] [CrossRef]
- Shirakawa, M.; Nakayama, K.; Hara, T.; Nishio, S. Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts Using Extended Naive Bayes. IEEE Trans. Emerg. Top. Comput. 2015, 3, 205–219. [Google Scholar] [CrossRef]
- Kustanto, N.S.; Nurma Yulita, I.; Sarathan, I. Sentiment Analysis of Indonesia’s National Health Insurance Mobile Application using Naïve Bayes Algorithm. In Proceedings of the 2021 International Conference on Artificial Intelligence and Big Data Analytics, Bandung, Indonesia, 27–29 October 2021; pp. 38–42. [Google Scholar]
- Castro, W.; De-la-Torre, M.; Avila-George, H.; Torres-Jimenez, J.; Guivin, A.; Acevedo-Juárez, B. Amazonian cacao-clone nibs discrimination using NIR spectroscopy coupled to naïve Bayes classifier and a new waveband selection approach. Spectrochim. Acta—Part A Mol. Biomol. Spectrosc. 2022, 270, 120815. [Google Scholar] [CrossRef] [PubMed]
- Yoshikawa, H. Can naive Bayes classifier predict infection in a close contact of COVID-19? A comparative test for predictability of the predictive model and healthcare workers in Japan. J. Infect. Chemother. 2022, 28, 774–779. [Google Scholar] [CrossRef] [PubMed]
- Bhatia, S.; Malhotra, J. Naïve Bayes Classifier for Predicting the Novel Coronavirus. In Proceedings of the 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, 4–6 February 2021; pp. 880–883. [Google Scholar]
- Shanbehzadeh, M.; Nopour, R.; Kazemi-Arpanahi, H. Using decision tree algorithms for estimating ICU admission of COVID-19 patients. Inform. Med. Unlocked 2022, 30, 100919. [Google Scholar] [CrossRef]
- Ghane, M.; Ang, M.C.; Nilashi, M.; Sorooshian, S. Enhanced decision tree induction using evolutionary techniques for Parkinson’s disease classification. Biocybern. Biomed. Eng. 2022, 42, 902–920. [Google Scholar] [CrossRef]
- Elhazmi, A.; Al-Omari, A.; Sallam, H.; Mufti, H.N.; Rabie, A.A.; Alshahrani, M.; Mady, A.; Alghamdi, A.; Altalaq, A.; Azzam, M.H.; et al. Machine learning decision tree algorithm role for predicting mortality in critically ill adult COVID-19 patients admitted to the ICU. J. Infect. Public Health 2022, 15, 826–834. [Google Scholar] [CrossRef]
- Hiranuma, M.; Kobayashi, D.; Yokota, K.; Yamamoto, K. Chi-square automatic interaction detector decision tree analysis model: Predicting cefmetazole response in intra-abdominal infection. J. Infect. Chemother. 2023, 29, 7–14. [Google Scholar] [CrossRef]
- Alex, S.; Dhanaraj, K.J.; Deepthi, P.P. Private and Energy-Efficient Decision Tree-Based Disease Detection for Resource-Constrained Medical Users in Mobile Healthcare Network. IEEE Access 2022, 10, 17098–17112. [Google Scholar] [CrossRef]
- Wang, X.; Liu, F. Data-Driven Relay Selection for Physical-Layer Security: A Decision Tree Approach. IEEE Access 2020, 8, 12105–12116. [Google Scholar] [CrossRef]
- Kuang, W.; Chan, Y.-L.; Tsang, S.-H.; Siu, W.-C. Machine Learning-Based Fast Intra Mode Decision for HEVC Screen Content Coding via Decision Trees. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 1481–1496. [Google Scholar] [CrossRef]
- Chen, Y.; Mao, Q.; Wang, B.; Duan, P.; Zhang, B.; Hong, Z. Privacy-Preserving Multi-Class Support Vector Machine Model on Medical Diagnosis. IEEE J. Biomed. Health Inform. 2022, 26, 3342–3353. [Google Scholar] [CrossRef] [PubMed]
- Lei, H.; Guoxing, Y.; Chao, H. A sparse algorithm for adaptive pruning least square support vector regression machine based on global representative point ranking. J. Syst. Eng. Electron. 2021, 32, 151–162. [Google Scholar] [CrossRef]
- Astuti, S.D.; Tamimi, M.H.; Pradhana, A.A.S.; Alamsyah, K.A.; Purnobasuki, H.; Khasanah, M.; Susilo, Y.; Triyana, K.; Kashif, M.; Syahrom, A. Gas sensor array to classify the chicken meat with E. coli contaminant by using random forest and support vector machine. Biosens. Bioelectron. X 2021, 9, 100083. [Google Scholar] [CrossRef]
- Pang, J.; Pu, X.; Li, C. A Hybrid Algorithm Incorporating Vector Quantization and One-Class Support Vector Machine for Industrial Anomaly Detection. IEEE Trans. Ind. Inform. 2022, 18, 8786–8796. [Google Scholar] [CrossRef]
- Bernardini, M.; Romeo, L.; Misericordia, P.; Frontoni, E. Discovering the Type 2 Diabetes in Electronic Health Records Using the Sparse Balanced Support Vector Machine. IEEE J. Biomed. Health Inform. 2020, 24, 235–246. [Google Scholar] [CrossRef]
- Ali Hammouri, Z.A.; Delgado, M.F.; Cernadas, E.; Barro, S. Fast SVC for large-scale classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 1. [Google Scholar] [CrossRef]
- Azgomi, H.; Haredasht, F.R.; Safari Motlagh, M.R. Diagnosis of some apple fruit diseases by using image processing and artificial neural network. Food Control 2023, 145, 109484. [Google Scholar] [CrossRef]
- Zhu, H.; Jiao, L.; Ma, W.; Liu, F.; Zhao, W. A Novel Neural Network for Remote Sensing Image Matching. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 2853–2865. [Google Scholar] [CrossRef]
- Qin, C.; Schlemper, J.; Caballero, J.; Price, A.N.; Hajnal, J.V.; Rueckert, D. Convolutional Recurrent Neural Networks for Dynamic MR Image Reconstruction. IEEE Trans. Med. Imaging 2019, 38, 280–290. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wu, N.; Phang, J.; Park, J.; Shen, Y.; Huang, Z.; Zorin, M.; Jastrzebski, S.; Fevry, T.; Katsnelson, J.; Kim, E.; et al. Deep Neural Networks Improve Radiologists’ Performance in Breast Cancer Screening. IEEE Trans. Med. Imaging 2020, 39, 1184–1194. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Dong, X.; Zhou, Y.; Wang, L.; Peng, J.; Lou, Y.; Fan, Y. Liver Cancer Detection Using Hybridized Fully Convolutional Neural Network Based on Deep Learning Framework. IEEE Access 2020, 8, 129889–129898. [Google Scholar] [CrossRef]
- Ulloa-Cazarez, R.L.; Garcia-Diaz, N.; Soriano-Equigua, L. Multi-layer Adaptive Fuzzy Inference System for Predicting Student Performance in Online Higher Education. IEEE Lat. Am. Trans. 2021, 19, 98–106. [Google Scholar] [CrossRef]
- Ibragimov, B.; Toesca, D.A.S.; Yuan, Y.; Koong, A.C.; Chang, D.T.; Xing, L. Neural Networks for Deep Radiotherapy Dose Analysis and Prediction of Liver SBRT Outcomes. IEEE J. Biomed. Health Inform. 2019, 23, 1821–1833. [Google Scholar] [CrossRef]
- Haghighat, M.H.; Li, J. Intrusion detection system using voting-based neural network. Tsinghua Sci. Technol. 2021, 26, 484–495. [Google Scholar] [CrossRef]
- Wisanwanichthan, T.; Thammawichai, M. A Double-Layered Hybrid Approach for Network Intrusion Detection System Using Combined Naive Bayes and SVM. IEEE Access 2021, 9, 138432–138450. [Google Scholar] [CrossRef]
- Gu, J.; Lu, S. An effective intrusion detection approach using SVM with naïve Bayes feature embedding. Comput. Secur. 2021, 103, 102158. [Google Scholar] [CrossRef]
- Li, M.; Vanberkel, P.; Zhong, X. Predicting ambulance offload delay using a hybrid decision tree model. Socioecon. Plann. Sci. 2022, 80, 101146. [Google Scholar] [CrossRef]
- Feng, X.; Zhou, Y.; Hua, T.; Zou, Y.; Xiao, J. Contact temperature prediction of high voltage switchgear based on multiple linear regression model. In Proceedings of the 2017 32nd Youth Academic Annual Conference of Chinese Association of Automation (YAC), Hefei, China, 19–21 May 2017; pp. 277–280. [Google Scholar]
- Li, S.; Song, P.; Zhang, W. Transferable discriminant linear regression for cross-corpus speech emotion recognition. Appl. Acoust. 2022, 197, 108919. [Google Scholar] [CrossRef]
- Huang, L.; Song, T.; Jiang, T. Linear regression combined KNN algorithm to identify latent defects for imbalance data of ICs. Microelectron. J. 2022, 131, 105641. [Google Scholar] [CrossRef]
- Duan, J.; Chang, M.; Chen, X.; Wang, W.; Zuo, H.; Bai, Y.; Chen, B. A combined short-term wind speed forecasting model based on CNN–RNN and linear regression optimization considering error. Renew. Energy 2022, 200, 788–808. [Google Scholar] [CrossRef]
- Abbas, S.A.; Aslam, A.; Rehman, A.U.; Abbasi, W.A.; Arif, S.; Kazmi, S.Z.H. K-Means and K-Medoids: Cluster Analysis on Birth Data Collected in City Muzaffarabad, Kashmir. IEEE Access 2020, 8, 151847–151855. [Google Scholar] [CrossRef]
- Rong, Y.; Liu, Y. Staged text clustering algorithm based on K-means and hierarchical agglomeration clustering. In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 27–29 June 2020; pp. 124–127. [Google Scholar]
- Jeong, W.; Yu, U. Effects of quadrilateral clustering on complex contagion. Chaos Solitons Fractals 2022, 165, 112784. [Google Scholar] [CrossRef]
- Bhagat, H.V.; Singh, M. DPCF: A framework for imputing missing values and clustering data in drug discovery process. Chemom. Intell. Lab. Syst. 2022, 231, 104686. [Google Scholar] [CrossRef]
- Tian, Y.; Zheng, R.; Liang, Z.; Li, S.; Wu, F.-X.; Li, M. A data-driven clustering recommendation method for single-cell RNA-sequencing data. Tsinghua Sci. Technol. 2021, 26, 772–789. [Google Scholar] [CrossRef]
- Krishnaveni, A.S.; Madhavan, B.L.; Ratnam, M.V. Aerosol classification using fuzzy clustering over a tropical rural site. Atmos. Res. 2022, 282, 106518. [Google Scholar] [CrossRef]
- Monshizadeh, M.; Khatri, V.; Kantola, R.; Yan, Z. A deep density based and self-determining clustering approach to label unknown traffic. J. Netw. Comput. Appl. 2022, 207, 103513. [Google Scholar] [CrossRef]
- Xin, X.; Liu, K.; Loughney, S.; Wang, J.; Yang, Z. Maritime traffic clustering to capture high-risk multi-ship encounters in complex waters. Reliab. Eng. Syst. Saf. 2023, 230, 108936. [Google Scholar] [CrossRef]
- Zhou, T.; Qiao, Y.; Salous, S.; Liu, L.; Tao, C. Machine Learning-Based Multipath Components Clustering and Cluster Characteristics Analysis in High-Speed Railway Scenarios. IEEE Trans. Antennas Propag. 2022, 70, 4027–4039. [Google Scholar] [CrossRef]
- Feigin, Y.; Spitzer, H.; Giryes, R. Cluster with GANs. Comput. Vis. Image Underst. 2022, 225, 103571. [Google Scholar] [CrossRef]
- Piatetsky-Shapiro, G. Knowledge Discovery in Real Databases: A Report on the IJCAI-89 Workshop. AI Mag. 1990, 11, 68–70. [Google Scholar]
- Fayyad, U.; Haussler, D.; Stolorz, P. KDD for Science Data Analysis: Issues and Examples. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland Oregon, 2–4 August 1996; pp. 50–56. [Google Scholar]
- Fayyd, U.M.; Piatetsky-Shapiro, G.; Smyth, P. From data mining to knowledge discovery: An overview. In Advances in Knowledge Discovery and Data Mining; Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Eds.; Morgan Kaufmann: Menlo Park, CA, USA, 1996; pp. 1–34. ISBN 0-262-56097-6. [Google Scholar]
- Microsoft. Data Mining. 2006. Available online: https://msdn.microsoft.com/en-us/library/aa227240(v=vs.60).aspx (accessed on 27 October 2022).
- Microsoft. Discretization Methods (Data Mining). Available online: https://msdn.microsoft.com/en-us/library/ms174512.aspx (accessed on 27 October 2022).
- Fayyad, U.M.; Irani, K.B. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI-93), Chambèry, France, 28 August–3 September 1993; pp. 1022–1027. [Google Scholar]
- Vučetić, M.; Hudec, M.; Božilović, B. Fuzzy functional dependencies and linguistic interpretations employed in knowledge discovery tasks from relational databases. Eng. Appl. Artif. Intell. 2020, 88, 103395. [Google Scholar] [CrossRef]
- de Oliveira, E.F.; de Lima Tostes, M.E.; de Freitas, C.A.O.; Leite, J.C. Voltage THD Analysis Using Knowledge Discovery in Databases with a Decision Tree Classifier. IEEE Access 2018, 6, 1177–1188. [Google Scholar] [CrossRef]
- Chen, Z.; Zhu, S.; Niu, Q.; Zuo, T. Knowledge Discovery and Recommendation with Linear Mixed Model. IEEE Access 2020, 8, 38304–38317. [Google Scholar] [CrossRef]
- Molina-Coronado, B.; Mori, U.; Mendiburu, A.; Miguel-Alonso, J. Survey of Network Intrusion Detection Methods from the Perspective of the Knowledge Discovery in Databases Process. IEEE Trans. Netw. Serv. Manag. 2020, 17, 2451–2479. [Google Scholar] [CrossRef]
- Sanchez Sanchez, P.A.; Cano Zuluaga, J.; Garcia Herazo, D.; Pinzon Baldion, A.F.; Rodriguez Mercado, G.; Garcia Gonzalez, J.R.; Perez Coronell, L.H. Knowledge Discovery in Musical Databases for Moods Detection. IEEE Lat. Am. Trans. 2019, 17, 2061–2068. [Google Scholar] [CrossRef]
- Kamm, S.; Jazdi, N.; Weyrich, M. Knowledge Discovery in Heterogeneous and Unstructured Data of Industry 4.0 Systems: Challenges and Approaches. Procedia CIRP 2021, 104, 975–980. [Google Scholar] [CrossRef]
- González García, C.; García-Bustelo, C.P.; Espada, J.P.; Cueva-Fernandez, G. Midgar: Generation of heterogeneous objects interconnecting applications. A Domain Specific Language proposal for Internet of Things scenarios. Comput. Netw. 2014, 64, 143–158. [Google Scholar] [CrossRef]
- Rosa, C.R.M.; Steiner, M.T.A.; Steiner Neto, P.J. Knowledge Discovery in Data Bases: A Case Study in a Private Institution of Higher Education. IEEE Lat. Am. Trans. 2018, 16, 2027–2032. [Google Scholar] [CrossRef]
- Mashey, J.R. Big Data and the next wave of infraStress. In Computer Science Division Seminar; University of California: Berkeley, CA, USA, 1997. [Google Scholar]
- Weiss, S.M.; Indurkhya, N. Predictive DATA Mining: A Practical Guide, 1st ed.; Morgan Kaufmann: San Francisco, CA, USA, 1997; ISBN 978-1558604032. [Google Scholar]
- Diebold, F. On the Origin(s) and Development of the Term Big Data; University of Pennsylvania: Philadelphia, PA, USA, 2012. [Google Scholar]
- Hey, T.; Tansley, S.; Tolle, K. The Fourth Paradigm: Data-Intensive Scientific Discovery; Microsoft Research: Redmond, WA, USA, 2009; ISBN 9780982544204. [Google Scholar]
- Philip Chen, C.L.; Zhang, C.Y. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Inf. Sci. 2014, 275, 314–347. [Google Scholar] [CrossRef]
- Wu, X.; Zhu, X.; Wu, G.-Q.; Ding, W. Data mining with big data. IEEE Trans. Knowl. Data Eng. 2014, 26, 97–107. [Google Scholar]
- Howie, T. The Big Bang: How the Big Data Explosion Is Changing the World. Available online: https://blogs.msdn.microsoft.com/microsoftenterpriseinsight/2013/04/15/the-big-bang-how-the-big-data-explosion-is-changing-the-world/ (accessed on 27 October 2022).
- NIST Big Data Public Working Group: Definitions and Taxonomies Subgroup. NIST Big Data Interoperability Framework: Volume 1, Definitions; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2015; Volume 1.
- Chen, M.; Mao, S.; Liu, Y. Big Data: A Survey. Mob. Netw. Appl. 2014, 19, 171–209. [Google Scholar] [CrossRef]
- Dutcher, J. What Is Big Data? Available online: https://datascience.berkeley.edu/what-is-big-data/ (accessed on 25 May 2016).
- Ward, J.S.; Barker, A. Undefined By Data: A Survey of Big Data Definitions. arXiv 2013, arXiv:1309.5821. [Google Scholar]
- Intel IT Center. Big Data Analytics. Intel’s IT Manager Survey on How Organizations Are Using Big Data; Intel Corporation: Santa Clara, CA, USA, 2012. [Google Scholar]
- Pettey, C.; Goasduff, L. Gartner Says Solving “Big Data” Challenge Involves More Than Just Managing Volumes of Data. Available online: https://web.archive.org/web/20180924135856/https://www.gartner.com/newsroom/id/1731916 (accessed on 13 November 2018).
- Gartner Inc. IT Glossary: Big Data. Available online: https://www.gartner.com/en/information-technology/glossary/big-data (accessed on 27 October 2022).
- Gantz, B.J.; Reinsel, D. Extracting Value from Chaos. IDC 2011, 1142, 1–12. [Google Scholar]
- NIST Big Data Public Working Group: Technology Roadmap Subgroup. NIST Big Data Interoperability Framework: Volume 7, Standards Roadmap; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2015; Volume 7.
- Mohammadi, M.; Al-Fuqaha, A.; Sorour, S.; Guizani, M. Deep Learning for IoT Big Data and Streaming Analytics: A Survey. IEEE Commun. Surv. Tutor. 2018, 20, 2923–2960. [Google Scholar] [CrossRef] [Green Version]
- Lin, R.; Ye, Z.; Wang, H.; Wu, B. Chronic Diseases and Health Monitoring Big Data: A Survey. IEEE Rev. Biomed. Eng. 2018, 11, 275–288. [Google Scholar] [CrossRef]
- Manley, K.; Nyelele, C.; Egoh, B.N. A review of machine learning and big data applications in addressing ecosystem service research gaps. Ecosyst. Serv. 2022, 57, 101478. [Google Scholar] [CrossRef]
- Nguyen, T.; Gosine, R.G.; Warrian, P. A Systematic Review of Big Data Analytics for Oil and Gas Industry 4.0. IEEE Access 2020, 8, 61183–61201. [Google Scholar] [CrossRef]
- Rawat, D.B.; Doku, R.; Garuba, M. Cybersecurity in Big Data Era: From Securing Big Data to Data-Driven Security. IEEE Trans. Serv. Comput. 2021, 14, 2055–2072. [Google Scholar] [CrossRef]
- Ma, S.; Ding, W.; Liu, Y.; Ren, S.; Yang, H. Digital twin and big data-driven sustainable smart manufacturing based on information management systems for energy-intensive industries. Appl. Energy 2022, 326, 119986. [Google Scholar] [CrossRef]
- Jaber, M.M.; Ali, M.H.; Abd, S.K.; Jassim, M.M.; Alkhayyat, A.; Aziz, H.W.; Alkhuwaylidee, A.R. Predicting climate factors based on big data analytics based agricultural disaster management. Phys. Chem. Earth Parts A/B/C 2022, 128, 103243. [Google Scholar] [CrossRef]
- Ang, K.L.-M.; Ge, F.L.; Seng, K.P. Big Educational Data & Analytics: Survey, Architecture and Challenges. IEEE Access 2020, 8, 116392–116414. [Google Scholar] [CrossRef]
- Laney, D. 3D Data Management: Controlling Data Volume, Velocity, and Variety. META Gr. Res. Note 2001, 6, 70. [Google Scholar]
- Saggi, M.K.; Jain, S. A survey towards an integration of big data analytics to big insights for value-creation. Inf. Process. Manag. 2018, 54, 758–790. [Google Scholar] [CrossRef]
- Goldston, D. Big data: Data wrangling. Nature 2008, 455, 15. [Google Scholar] [CrossRef] [Green Version]
- Deepa, N.; Pham, Q.-V.; Nguyen, D.C.; Bhattacharya, S.; Prabadevi, B.; Gadekallu, T.R.; Maddikunta, P.K.R.; Fang, F.; Pathirana, P.N. A survey on blockchain for big data: Approaches, opportunities, and future directions. Futur. Gener. Comput. Syst. 2022, 131, 209–226. [Google Scholar] [CrossRef]
- NIST Big Data Public Working Group: Security and Privacy Subgroup. NIST Big Data Interoperability Framework: Volume 4, Security and Privacy; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2015; Volume 4.
- IBM. Big data at the speed of business. Available online: https://web.archive.org/web/20161121123223/http://www-01.ibm.com/software/data/bigdata/ (accessed on 13 November 2022).
- Liu, Z.; Zhang, A. Sampling for Big Data Profiling: A Survey. IEEE Access 2020, 8, 72713–72726. [Google Scholar] [CrossRef]
- Tripathi, M.K.; Kumar, R.; Tripathi, R. Big-data driven approaches in materials science: A survey. Mater. Today Proc. 2020, 26, 1245–1249. [Google Scholar] [CrossRef]
- Syed, D.; Zainab, A.; Ghrayeb, A.; Refaat, S.S.; Abu-Rub, H.; Bouhali, O. Smart Grid Big Data Analytics: Survey of Technologies, Techniques, and Applications. IEEE Access 2021, 9, 59564–59585. [Google Scholar] [CrossRef]
- Terzi, R.; Sagiroglu, S.; Demirezen, M.U. Big Data Perspective for Driver/Driving Behavior. IEEE Intell. Transp. Syst. Mag. 2020, 12, 20–35. [Google Scholar] [CrossRef]
- Seddon, J.J.J.M.; Currie, W.L. A model for unpacking big data analytics in high-frequency trading. J. Bus. Res. 2017, 70, 300–307. [Google Scholar] [CrossRef]
- Khan, M.A.; Uddin, M.F.; Gupta, N. Seven V’s of Big Data understanding Big Data to extract value. In Proceedings of the 2014 Zone 1 Conference of the American Society for Engineering Education, Bridgeport, CT, USA, 3–5 April 2014; pp. 1–5. [Google Scholar]
- Gupta, Y.K.; Kumari, S. A Study of Big Data Analytics using Apache Spark with Python and Scala. In Proceedings of the 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS), Thoothukudi, India, 3–5 December 2020; pp. 471–478. [Google Scholar]
- Fatima Ezzahra, M.; Nadia, A.; Imane, H. Big Data Dependability Opportunities & Challenges. In Proceedings of the 2019 1st International Conference on Smart Systems and Data Science (ICSSD), Rabat, Morocco, 3–4 October 2019; pp. 1–4. [Google Scholar]
- Sivarajah, U.; Kamal, M.M.; Irani, Z.; Weerakkody, V. Critical analysis of Big Data challenges and analytical methods. J. Bus. Res. 2017, 70, 263–286. [Google Scholar] [CrossRef] [Green Version]
- Hattawi, W.; Shaban, S.; Al Shawabkah, A.; Alzu’bi, S. Recent Quality Models in BigData Applications. In Proceedings of the 2021 International Conference on Information Technology (ICIT), Amman, Jordan, 14–15 July 2021; pp. 811–815. [Google Scholar]
- Bhardwaj, D.; Ormandjieva, O. Toward a Novel Measurement Framework for Big Data (MEGA). In Proceedings of the 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 12–16 July 2021; pp. 1579–1586. [Google Scholar]
- González García, C.; Meana-Llorián, D.; G-Bustelo, B.C.P.; Lovelle, J.M.C. A review about Smart Objects, Sensors, and Actuators. Int. J. Interact. Multimed. Artif. Intell. 2017, 4, 7–10. [Google Scholar] [CrossRef]
- Bell, G.; Hey, T.; Szalay, A. Beyond the Data Deluge. Science 2009, 323, 1297–1298. [Google Scholar] [CrossRef] [PubMed]
- Doctorow, C. Big data: Welcome to the petacentre. Nature 2008, 455, 16–21. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Beaver, D.; Kumar, S.; Li, H.C.; Sobel, J.; Vajgel, P. Finding a needle in Haystack: Facebook’s photo storage. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10), Vancouver, BC, Canada, 4–6 October 2010; pp. 1–8. [Google Scholar]
- Trewe, M. How carriers gather, track and sell your private data. The American Genius. 2012. Available online: https://theamericangenius.com/tech-1363news/how-carriers-gather-track-and-sell-your-private-data/ (accessed on 27 October 2022).
- Sharp, A. Dispatch from the Denver debate. Available online: https://blog.twitter.com/2012/dispatch-from-the-denver-debate (accessed on 27 October 2022).
- Zapponi, C. GitHut. Available online: http://githut.info/ (accessed on 27 October 2022).
- Sawant, N.; Shah, H. Big Data Application Architecture Q&A A Problem—Solution Approach. In Intergovernmental Panel on Climate Change; Cambridge University Press: Cambridge, UK, 2013; p. 172. ISBN 978-1430262923. [Google Scholar]
- World Data Group. The World Bank. Available online: http://data.worldbank.org/indicator/ (accessed on 27 October 2022).
- Twitter Inc. Twitter: Company. Available online: https://about.twitter.com/es/company (accessed on 27 October 2022).
- Michel, F. How Many Public Photos are Uploaded to Flickr Every Day, Month, Year? Available online: https://www.flickr.com/photos/franckmichel/6855169886/ (accessed on 27 October 2022).
- YouTube. YouTube: Statistics. Available online: https://www.youtube.com/yt/press/en/statistics.html (accessed on 9 June 2016).
- Savitz, E. Gartner: 10 Critical Tech Trends for The Next Five Years. Available online: http://www.forbes.com/sites/ericsavitz/2012/10/22/gartner-10-critical-tech-trends-for-the-next-five-years/ (accessed on 27 October 2022).
- Google. Google Photos: One Year, 200 Million Users, and a Whole Lot of Selfies. Available online: https://googleblog.blogspot.com.es/2016/05/google-photos-one-year-200-million.html (accessed on 27 October 2022).
- Facebook. Newsroom. Available online: https://web.archive.org/web/20160609081220/https://newsroom.fb.com/company-info/ (accessed on 27 October 2022).
- Cisco. Cisco Visual Networking Index: Forecast and Methodology. 2016. Available online: http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/complete-wh (accessed on 9 June 2016).
- Warner, J. GitHub Blog. Available online: https://github.blog/2018-11-08-100m-repos/ (accessed on 27 October 2022).
- Alvi, P.; Ali, K. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model. arXiv 2022, arXiv:2201.11990. [Google Scholar]
- Floridi, L.; Chiriatti, M. GPT-3: Its Nature, Scope, Limits, and Consequences. Minds Mach. 2020, 30, 681–694. [Google Scholar] [CrossRef]
- Dewdney, P.E.; Hall, P.J.; Schilizzi, R.T.; Lazio, T.J.L.W. The Square Kilometre Array. Proc. IEEE 2009, 97, 1482–1496. [Google Scholar] [CrossRef]
- Lazer, D.; Kennedy, R.; King, G.; Vespignani, A. The Parable of Google Flu: Traps in Big Data Analysis. Science 2014, 343, 1203–1205. [Google Scholar] [CrossRef]
- Boyd, D.; Crawford, K. Critical Questions for Big Data: Provocations for a cultural, technological, and scholarly phenomenon. Inf. Commun. Soc. 2012, 15, 662–679. [Google Scholar] [CrossRef]
- ACM SC08 International Conference for High Performance Computing, Austin, TX, USA, 15–21 November 2008. IEEE Computer Society: Austin, TX, USA. Available online: http://sc08.supercomputing.org/ (accessed on 27 October 2022).
- Astrophysical Research Consortium. The Sloan Digital Sky Survey SDSS. Available online: https://www.sdss.org/ (accessed on 27 October 2022).
- Gayo-Avello, D. No, you cannot predict elections with twitter. Internet Comput. IEEE 2012, 16, 91–94. [Google Scholar] [CrossRef] [Green Version]
- Thusoo, A.; Sarma, J.S.; Jain, N.; Shao, Z.; Chakka, P.; Anthony, S.; Liu, H.; Wyckoff, P.; Murthy, R. Hive—A warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2009, 2, 1626–1629. [Google Scholar] [CrossRef]
- Apache Software Foundation. Hue. Available online: http://gethue.com/ (accessed on 27 October 2022).
1 | 2 | 3 Pre-Processing | Data Mining | 7 | 8 | |||||
---|---|---|---|---|---|---|---|---|---|---|
Domain | Selection/Data Collection | Data Reduction | Cleaning | Transformation | 4 Method | 5 Algorithm | 6 Use | Interpretation and Evaluation | Another Dataset | |
[87] | ? | |||||||||
[88] | ? | X | X | X | X | X | X | X | ||
[89] | ? | |||||||||
[90] | ? | X | X | X | X | X | X | X | X | ? |
[91] | ? | X | X | X | X | X | ||||
[92] | ? | X | X | X | X | X | ||||
[94] | ? | X | X |
Database\Vs | 3Vs | 4Vs | 5Vs | 6Vs | 7Vs | 8Vs | 9Vs | 10Vs | Total |
---|---|---|---|---|---|---|---|---|---|
IEEE | 31 | 29 | 21 | 1 | 3 | 1 | 0 | 1 | 87 |
SD | 4 | 6 | 2 | 0 | 0 | 0 | 0 | 0 | 12 |
Wiley | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 6 |
Total | 38 | 36 | 24 | 2 | 3 | 1 | 0 | 1 | 105 |
Type\Vs | 3Vs | 4Vs | 5Vs | 6Vs | 7Vs | 8Vs | 9Vs | 10Vs | Total |
---|---|---|---|---|---|---|---|---|---|
IEEE | |||||||||
Conferences | 27 | 25 | 18 | 3 | 1 | 1 | 75 | ||
Magazines | 3 | 1 | 1 | 5 | |||||
Articles | 1 | 2 | 3 | 6 | |||||
Books | 1 | 1 | |||||||
SD | |||||||||
Articles | 3 | 6 | 2 | 11 | |||||
Books | 1 | 1 | |||||||
Wiley | |||||||||
Articles | 1 | 1 | 1 | 3 | |||||
Books | 2 | 1 | 3 | ||||||
Total | 38 | 36 | 24 | 2 | 3 | 1 | 0 | 1 | 105 |
Authors\Vs | Volume | Velocity | Variety | Veracity | Variability | Value | Visibility | Vulnerability | Visualisation | Validity | Volatility | Viscosity | Virality | Vincularity | Valence | Vitality |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[1,107,108,109,118,119] | X | X | X | |||||||||||||
[123] | X | X | X | X | ||||||||||||
[2] | X | X | X | X | ||||||||||||
[102] | X | X | X | X | ||||||||||||
[8] | X | X | X | X | X | |||||||||||
[22,125,126] | X | X | X | |||||||||||||
[127] | X | X | X | X | X | X | ||||||||||
[128] | X | X | X | X | X | X | ||||||||||
[129,134] | X | X | X | X | X | X | X | |||||||||
[130,131] | X | X | X | X | X | X | X | |||||||||
[128] | X | X | X | X | X | X | X | |||||||||
[132] | X | X | X | X | X | X | X | X | ||||||||
[135] | X | X | X | X | X | X | X | X | X | X |
Type | Quantity | Commentary/Year |
---|---|---|
Earth Satellites [7] | 1 terabyte (1012) | In 1 day in 1990 |
Websites indexed by Google [8] | 1 million | 1998 |
Websites indexed by Google [8] | 1000 million | 2000 |
Computing [138] | 320 terabytes | 2 h of human genome study in 2008 |
Websites indexed by Google [8] | 1 trillion (1018) | 2008 |
Astronomical or physical particle experiment [137] | 1 petabyte | In 1 year in 2009 |
Facebook [15] | 30 billions of content shared each month | 2010 |
Photos per second on Facebook [139] | 1 million | 2010 |
Photos stored on Facebook [139] | 260 billions = 20 petabytes | 2010 |
Photos uploaded per week on Facebook [139] | 1 billion = 60 terabytes | 2010 |
Bookstore of the United States of America Congress [15] | 235 terabytes of data collected | April 2011 |
Hadron Collider at the discovery of the Higgs Boson [97] | 1 petabyte (1015) | Per second in 2012 |
Human race [11] | 2.5 quintillions (1030) of data bytes | Every day in 2012 |
Walmart user information every day [1] | 2.5 petabytes (1015) | 2012 |
Multi-Media Messages (MMS) [140] | 28,000 per second | 2012 |
New data [1] | 2.5 exabytes | New data every day since 2012 and doubling every 40 months |
Electronic data [16] | 1.2 zettabytes | Every year in 2012 |
Twitter [141] | 10,300,000 tweets in 1 h 30 m | Presidential debate in 2012 |
GitHub [142] | 550,000 repositories | Q2 2012 |
Creators of social content [143] | 600 million (106) | 33% of Internet users in 2013 |
Other periodical publications [143] | 10,000 | Newspapers and others in 2013 |
Blogs [143] | 70 million (106) | 2013 |
Google queries per day [8] | More than 1000 million | 2013 |
Tweets per day [8] | +250 million | 2013 |
Facebook updates per day [8] | +800 million | 2013 |
YouTube views per day [8] | +4000 million | 2013 |
Jet engine [2] | 10 terabytes | 30 min in 2013 |
Internet [143] | 20 exabytes (1018) of information | 2013 |
Internet [144] | 40.7% of the population used it in 2014 = 2.954 million | 7,259,691,769 people in 2014 |
Web pages [143] | 1.5 trillion (1012) | 2013 |
Twitter [145] | 310 million active monthly users | 2013 |
Twitter [145] | 500 million tweets per day | August 2013 |
Tweets [143] | 20 thousand million (109) | 50 million of users/2013 |
Tweets [145] | 143,199 per second | 3 August 2013 |
GitHub [142] | 1,300,000 repositories | Q4 2013 |
GitHub [142] | 2,200,000 repositories | Q4 2014 |
Sequencing of human gene [103] | 600 Gb | 2014 |
Flickr [146] | Almost 70 million public photos uploaded monthly | 2015 |
YouTube [147] | More than 1000 million users (109) = 1/3 Internet users | 2015 |
YouTube [147] | +100 million hours of video views daily | 2015 |
Hospital data [103] | 167 Tb to 665 Tb | 2015 |
Emails [148] | 204 million | In 1 min in 2016 |
Pandora: hours of music heard [148] | 61,000 h | In 1 min in 2016 |
Flickr [148] | 3 million uploads | In 1 min in 2016 |
Flickr [148] | 20 million photos viewed | In 1 min in 2016 |
Google [148] | 2 million searches | In 1 min in 2016 |
Google Photos [149] | 200 million users | In its first year in 2016 |
Google Photos [149] | 1.6 billion (109) | In its first year in 2016 |
Google Photos [149] | 2 trillion (1018) tags | In its first year in 2016 |
Google Photos [149] | 24 billion (109) selfies | In its first year in 2016 |
Facebook [150] | 1650 million (106) users | 31 March 2016 |
Annual Internet traffic [151] | 1 zettabyte (1018) | 2016 |
Facebook [133] | +500 terabytes of data per day | 2017 |
GitHub [152] | 100,000,000 repositories | 2018 |
ELMo [153] | 94 million of parameters | 2018 |
BERT-Large | 340 million of parameters | 2018 |
GPT [154] | 110 million of parameters | 2018 |
GPT-2 [153] | 1.5 billion of parameters | 2019 |
Megatron-LM [153] | 8.3 billion of parameters | 2019 |
T5 [153] | 11 billion of parameters | 2019 |
Annual Internet traffic [151] | 2.3 zettabytes (1018) | 2020 |
Square Kilometre Array [155] | 524 terabytes per second (estimated) | Will be produced in 2020 (postponed to 2027) |
Turing-NLG [153] | 17.2 billion of parameters | 2020 |
GTP-3 [153] | 175 billion of parameters | 2020 |
Daily generated data [12] | 56 zettabytes | 16 December 2020 |
Megatron-Turing [153] | 15 datasets of a total of 339 billion tokens | 2021 |
Megatron-Turing [153] | 530 billion of parameters | 2021 |
Daily generated data [12] | Estimated 149 zettabytes | 2024 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
González García, C.; Álvarez-Fernández, E. What Is (Not) Big Data Based on Its 7Vs Challenges: A Survey. Big Data Cogn. Comput. 2022, 6, 158. https://doi.org/10.3390/bdcc6040158
González García C, Álvarez-Fernández E. What Is (Not) Big Data Based on Its 7Vs Challenges: A Survey. Big Data and Cognitive Computing. 2022; 6(4):158. https://doi.org/10.3390/bdcc6040158
Chicago/Turabian StyleGonzález García, Cristian, and Eva Álvarez-Fernández. 2022. "What Is (Not) Big Data Based on Its 7Vs Challenges: A Survey" Big Data and Cognitive Computing 6, no. 4: 158. https://doi.org/10.3390/bdcc6040158