Towards Unlocking the Hidden Potentials of the Data-Centric AI Paradigm in the Modern Era
Abstract
:1. Introduction
- Systematic introduction of DC-AI: We systematically present what DC-AI entails and how it can be systematically applied in practice. We also discuss the role of DC-AI in advancing AI technology, which has not been comprehensively discussed in the literature [26].
- Generic architecture of DC-AI: We devise the general architecture of the DC-AI paradigm and highlight its key requirements that need to be ensured in each phase, as well as in the entire lifecycle of AI-based projects, whereas existing research only discusses some of the DC-AI techniques [27].
- Insight into when to amalgamate MC-AI with DC-AI: We pinpoint and describe five scenarios/situations when amalgamating DC-AI with MC-AI is necessary to solve longstanding technical, social, and industrial problems of conventional AI approaches, which have remained unexplored in the current literature [28].
- Case study along with the empirical results: We report a case study (or specific example) to highlight the application of DC-AI in some real-world scenarios and report empirical results by comparing DC-AI with MC-AI, whereas most of the existing papers theoretically discuss the co-design of these approaches [26,28].
- Holistic overview of DC-AI challenges and future prospects: We pinpoint the key challenges that are currently hindering DC-AI adoption worldwide and recommend promising avenues for future research that can assist in transforming AI from academic labs to the market.
- Next-generation computing for the DC-AI paradigm: We discuss the next-generation computing for the DC-AI paradigm along with the relevant technologies that can contribute to transitioning DC-AI from theory to practice, which has remained unexplored in the recent literature [29].
- A call for action to harness the potential of DC-AI: Through this paper, we aim to foster technical advancements in AI by utilizing DC-AI, and we hope to open up avenues for future development/research in this line of work. This is the first work that provides a broader understanding of DC-AI while keeping MC-AI in the loop.
2. Background and Related Work
3. Introduction of Model-Centric AI and Data-Centric AI
- Rigorously fine-tune the code/algorithm of the AI model.
- Obtain additional data for everything and take the average, or optimize the learning rate ().
- Switch to an alternate AI model (CNN → LSTM or LSTM → RNN).
Quantitative Analysis/Comparison between MC-AI and DC-AI
4. General Architecture of the DC-AI
5. Insight to When to Amalgamate DC-AI with MC-AI
5.1. When Outcomes of MC-AI Alone Are Not Reliable
5.2. To Meet Certain Performance Targets
5.3. When Computing Overhead Is Unaffordable beyond a Certain Limit
5.4. Augmenting Lifetimes of AI Systems
5.5. Limited Availability of Representative Data
6. Case Study (Proof of Concept Example) to Evaluate the Effects of the Key Parameters of DC-AI on Performance
Enhancing the Learning Ability and Generalization Power of ML Algorithms in Safety-Critical Applications (e.g., Medical Scenario)
- In setting I, we fixed the parameters of the ML models, but varied the amount of test and training data. For the sake of simplicity, we name this setting as .
- In setting II, we fixed the amount of training and test data but varied the parameters of the ML models. For the sake of simplicity, we name this setting as .
- In setting III, we simultaneously changed both (the parameters and the amount of data) while computing the value. For the sake of simplicity, we name this setting as .
7. Challenges and Future Prospects
- There is a significant lack of technologies that can assist in realizing DC-AI. At present, there are only a few enabling technologies (synthetic data, transfer learning (TL), etc.) used in DC-AI. However, the potential synergies between DC-AI and other AI technologies, such as transfer learning and generative models, can expand the application horizon for DC-AI. For instance, the TL approach is handy in addressing data scarcity and availability issues, and it can foster data re-usability across identical domains/problems [68]. TL can assist in realizing the DC-AI concept in low-resource healthcare settings [68]. TL-like approaches are required to prepare data for diverse domains with minimal effort by transferring knowledge/information from one domain to another. On the other hand, generative models like GAN/TVAE are handy for generating new data (also known as synthetic data), which can be used to enlarge the data size, as well as the diversity [66]. The generative models can assist in solving imbalanced learning problems in machine learning (ML), and can increase prediction accuracy for minor classes [69,70]. In some cases, they contribute to reducing the bias in ML/AI models [71]. Furthermore, generative models assist in performing complex tasks (e.g., human analysis) with AI, which are not possible in conventional settings, owing to limited high-quality data [72]. However, there is still a serious lack of technologies that can pinpoint faults in the data, and that can raise alerts when some parts of the data are misleading or labeled incorrectly. Therefore, developing new DC-AI-enabling technologies or modifying the existing technologies to accomplish the DC-AI criteria is very challenging. Thus, the development of enabling technologies that can assist in achieving multiple aspects of DC-AI is an exciting area of future research.
- In some cases, the DC-AI approach is no longer required when the data are sound, complete, up-to-date, and highly diverse. However, there is a lack of methodologies/procedures through which one can identify when DC-AI is inherently fulfilled. Furthermore, the decision about when to use DC-AI also needs rigorous criteria/methods, because MC-AI yields better results in some areas. For example, if there are fewer vulnerabilities in the data, the corresponding AI model can overcome them without a DC-AI approach by utilizing averaging, or by employing an optimized (learning rate). To this end, one may not need the DC-AI approach. Hence, developing techniques that can distinguish whether DC-AI is imperative or not can be a feasible topic for future research.
- It is very hard to decide how much quality enhancement is needed when it comes to different areas (e.g., predictive maintenance vs. human activity analysis). For example, an application for text analysis may need less pre-processing because the application needs to handle both noisy and well-written text. In contrast, an automated diagnosis system employed for cancer disease may need rigorous data quality enhancement to ensure accurate diagnoses. Hence, it is very tricky to assess what amount of data quality is reasonable in a DC-AI approach when it comes to diverse applications. Therefore, developing unified tools/techniques that can provide a reasonable analysis of the data is a pending issue, and requires further investigation from the AI community.
- There is a misconception about the DC-AI approach and traditional pre-processing. It is important to note that pre-processing is a common practice while building AI systems, and has significance. In contrast, DC-AI is a complete discipline that includes pre-processing as one of the aspects. Pre-processing is applied to data that have already been collected, whereas DC-AI is all about the data-first strategy—what data to collect, from whom to collect them, how to collect them, how to improve them, what aspects to improve, etc. Also, DC-AI includes data versioning, completeness, timeliness, risk/error analysis, etc., which are not part of pre-processing at all. Therefore, convincing AI developers to accept this new paradigm may be challenging. Hence, the amalgamation of DC-AI and pre-processing is imperative to highlight the need for DC-AI in future endeavors.
- Another challenge is to find some potential/attractive use cases, like the detection of defects in steel, to pinpoint the efficacy/benefits of DC-AI (https://venturebeat.com/ai/why-data-remains-the-greatest-challenge-for-machine-learning-projects/, accessed on 5 March 2024). In the future, making the DC-AI approach a de facto standard for AI applications imposes various challenges, because many AI systems have already been developed and deployed in real-world settings. In some cases, obtaining more data to improve performance or fiddling with AI models is beneficial when it comes to applications like voice-activated assistants. In contrast, obtaining more data or improving models when all the data belong to one source (e.g., defect detection in a machine based on its operating sounds) may not be beneficial. Hence, it is challenging to differentiate applications that require DC-AI versus MC-AI. In the future, identifying relevant domains where DC-AI can bring more good than harm is an exciting research area.
- DC-AI is about debugging and compiling data, but in the absence of sophisticated tools, it is hard to identify the ambiguous parts of the data and fine-tune them accordingly. In addition, separating faulty and non-faulty parts of data that are enclosed in diverse formats (e.g., tables, graphs, trajectories, and images) is also very challenging. Therefore, the development of data debugging and compiling tools similar to programming languages (i.e., Java, C/C++) is imperative in the coming years. The development of such compilers can give valuable hints about the type/nature of problems/vulnerabilities in the underlying data that can be corrected at the earliest.
- Recently, the umbrella of DC-AI techniques has been expanding (or being amalgamated with traditional pre-processing methods), posing the challenge of which technique might yield promising results when it comes to different datasets/applications. For example, simple visualization may help to identify missing labels in 100 tuples. In contrast, one cannot use simple visualization when it comes to image data of different species, or a billion images for one application. Hence, it is difficult to select suitable DC-AI techniques for each respective AI application. In this regard, the selection of optimal DC-AI techniques that can yield reliable results, regardless of the domain, is an attractive area of future research.
- In the absence of successful implementations, it is challenging to determine the order in which DC-AI techniques will be applied to yield robust AI systems. For example, determining when to apply data labeling and data completeness analysis in a gaze estimation application is challenging. Similarly, it is hard to decide when to not use a certain technique (e.g., data availability). Considering these circumstances, deciding the order and types of techniques is tricky. Similarly, the set of DC-AI techniques applied to tabular data may not yield feasible results in image data, and therefore, a set of unified DC-AI techniques is required for each data type. Furthermore, determining the optimal order and types of DC-AI techniques to apply for different datasets and applications is a very complex problem. In our recent work, we devised a general system by discussing the order and types of DC-AI techniques to be employed in stroke prediction scenarios, which can be used as a reference to determine the optimal order and types of DC-AI techniques to apply for different datasets and applications [45]. Similarly, some potential DC-AI techniques that can be generically applied across the ML pipeline are given in Seedat et al. [49]. However, determining the optimal order and types of DC-AI techniques to apply for different datasets and applications is a very challenging problem, and it requires further investigation from relevant communities. Ablation studies to determine appropriate combinations of DC-AI techniques for diverse domains is an exciting prospect to be explored in the DC-AI context.
- Just like conventional AI approaches, some pre-processing techniques (e.g., sampling) do not work well with some AI models. The same problem can happen with DC-AI techniques, and therefore, analyzing the suitability of DC-AI techniques with respect to data and AI models requires in-depth investigation. Generalizing DC-AI techniques to effectively work in related applications with slight modifications is also very challenging. Hence, conducting tests and identifying suitable DC-AI techniques for each AI model requires further investigation from the AI community.
- In conventional MC-AI, the data are evaluated just once (e.g., before feeding them into the AI model). By contrast, in DC-AI, the evaluation of the data is required at multiple stages in the AI system lifecycle. Hence, data quality analysis before building the model (e.g., during pre-processing) and after model training are different. In the pre-processing stage, data completeness/freshness is mandatory, but after training, balanced data utilization (whether the AI model utilizes all parts of the data equally or not) is imperative. To perform systematic analysis and quality assurance, sophisticated expertise may be required in each stage. However, there is a substantial lack of domain experts for each stage of the AI system lifecycle, which can hinder the applicability of DC-AI in practical scenarios. Identifying and documenting the list of requirements to be fulfilled concerning data in each stage is very challenging, and requires more investigation.
- There is a genuine lack of procedures and standards concerning data quality. For example, 10% of the data being good may be sufficient in some applications, whereas some applications may require 90% to yield consistent results. Considering the huge diversity in data styles and applications, developing appropriate standards and procedures for gauging data quality is challenging. In the future, it is vital to develop formal procedures and well-defined standards for gauging data quality in multiple AI applications.
- The key focus of DC-AI is to augment data quality. However, this is a very time-consuming, challenging, and laborious task, especially when the size of the data is very large. In addition, reducing the overall complexity of the data-quality-enhancement process is an interesting and urgent topic. At present, there is a lack of software or customized libraries to automatically perform most operations. Developing supportive libraries for the DC-AI paradigm is a vibrant avenue for research.
- Recently, there has been an increasing trend toward training complex AI models with as little data as possible to overcome computing overhead [17,29,73]. Similarly, data optimization techniques are needed for training complex AI models with the fewest (but complete) data. In this regard, developing data optimization and reduction techniques without compromising accuracy (or other objectives) by utilizing DC-AI concepts is a vibrant research area. Data quality is a key performance index for DC-AI, and therefore, the use of low-cost tools and techniques to improve data quality and out-of-distribution detection has become more urgent than ever. Hence, devising practical simulation/implementation tools that can enhance the quality of data through various operations (consistent labeling, outlier removal, etc.) at the least cost is needed.
8. Next-Generation Computing for DC-AI Paradigm
- In recent years, cloud-/fog-/edge-/serverless-based computing architectures have ensured the timely collecting, processing, and analytics of big data by using AI models. In the future, DC-AI integration with these computing architectures is imperative for cost savings, improved data management, automation, robustness, and addressing verifiability-related issues [74]. DC-AI can enhance the efficacy of cloud-/fog-/edge-/serverless-based applications by ensuring the data used in them are of high quality, robust, dependable, and representative of the problem being solved. However, the integration of the DC-AI paradigm with these latest architectures is not easy and can induce many research challenges such as the fusion, alignment, and consistency of data stemming from multiple sources being possibly tricky and slow. Furthermore, data reliability, data representation, and the privacy and security of data can be the main barriers to integrating DC-AI with next-generation computing architectures. In addition, preparing high-quality data for each architecture depending upon the application requirement is also very challenging. In the serverless computing architecture, most of the operations are managed by third parties, and therefore, DC-AI integration can lead to transparency, bias, and sustainability issues [74]. In the next generation of computing, the involvement of domain experts with AI-based systems regardless of computing architectures will be mandatory, and therefore, it is very challenging to prepare domain experts for relevant architectures. Finally, upgrading the current AI-integrated systems that are mostly based on MC-AI is another big challenge and requires much effort from the AI community.
- Thus far, there has been limited synergy from merging the DC-AI approach with other AI/ML/DL technologies such as pre-trained models, TL, and generative models. In the future, more such synergies are needed to improve critical aspects of DC-AI (i.e., performance overhead, domain adaptation, data curation, and integration) [31]. Hence, exploring the possible synergies between DC-AI and more of the latest technologies to expand the horizon for DC-AI is a vibrant area of future research.
- With the advent of quantum computing (QC) (a powerful computing paradigm), a drastic change in the performances of AI models is expected, and most AI models can work with zettabytes (ZB) of data [31]. QC encompasses many innovative architectures such as quantum entanglement and quantum superposition and can process huge volumes of data in a few milliseconds compared to classical systems [74]. However, the QC paradigm can also face obstacles in effective data utilization because most of the real-world datasets are poisoned, biased, noisy, skewed, and/or incomplete. To this end, an amalgamation of DC-AI with QC is imperative to solve longstanding problems in the healthcare sector (e.g., genome analysis), drug development, secure cryptosystems, and sustainability problems. The joint use of these paradigms can contribute to developing reliable and robust AI systems. However, it is challenging to apply DC-AI and QC technologies to some complex problems such as climate change owing to higher uncertainties and constraints. The development of DC-AI-integrated QC-based libraries with a higher level of abstraction and flexibility is an important avenue for future research. Lastly, developing QC-based architectures for data distribution and governance in AI systems is also a promising research area.
- TinyML is an emerging paradigm that brings ML algorithms close to ultra-low-powered devices, such as microcontroller units (MCUs), to enhance service quality [75]. In DC-AI, data availability and accessibility are imperative, and therefore, this synergy is handy. To that end, exploring the role of DC-AI in enhancing the technical persuasiveness and robustness of TinyML (also known as MLoPs) is a fascinating area of research amid the rapid rise in wearable devices around the globe.
- Recently, there has been an increasing focus on developing domain-specific and dedicated hardware accelerators to meet the growing demand for fast processing with the least energy consumption [76]. To this end, the analog in-memory computing (IMC) architecture has shown promising results in future generations of AI, as well as computer vision-related tasks. Similarly, developing hardware accelerators to improve the computing efficiency of DC-AI is also a promising area for future developments.
9. Concluding Remarks
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ker, J.; Wang, L.; Rao, J.; Lim, T. Deep learning applications in medical image analysis. IEEE Access 2017, 6, 9375–9389. [Google Scholar] [CrossRef]
- Fidon, L.; Aertsen, M.; Kofler, F.; Bink, A.; David, A.L.; Deprest, T.; Emam, D.; Guffens, F.; Jakab, A.; Kasprian, G.; et al. A Dempster-Shafer approach to trustworthy AI with application to fetal brain MRI segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3784–3795. [Google Scholar] [CrossRef] [PubMed]
- Shaker, A.M.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.H.; Khan, F.S. UNETR++: Delving into efficient and accurate 3D medical image segmentation. IEEE Trans. Med Imaging 2024. [Google Scholar] [CrossRef] [PubMed]
- Chen, J.; Yi, C.; Du, H.; Niyato, D.; Kang, J.; Cai, J.; Shen, X. A revolution of personalized healthcare: Enabling human digital twin with mobile AIGC. IEEE Netw. 2024. [Google Scholar] [CrossRef]
- Liu, Y.; Chen, B.; Wang, S.; Lu, G.; Zhang, Z. Deep Fuzzy Multi-Teacher Distillation Network for Medical Visual Question Answering. IEEE Trans. Fuzzy Syst. 2024, 1–15. [Google Scholar] [CrossRef]
- Li, G.; Xu, J.; Li, Z.; Chen, C.; Kan, Z. Sensing and navigation of wearable assistance cognitive systems for the visually impaired. IEEE Trans. Cogn. Dev. Syst. 2022, 15, 122–133. [Google Scholar] [CrossRef]
- Kanthimathi, T.; Rathika, N.; Fathima, A.J.; Rajesh, K.; Srinivasan, S.; Thamizhamuthu, R. Robotic 3D Printing for Customized Industrial Components: IoT and AI-Enabled Innovation. In Proceedings of the 2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 18–19 January 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 509–513. [Google Scholar]
- Chang, M.; Chen, K.H.; Chen, Y.S.; Hsu, C.C.; Chu, C.C. Developments of AI-Assisted Fault Detection and Failure Mode Diagnosis for Operation and Maintenance of Photovoltaic Power Stations in Taiwan. IEEE Trans. Ind. Appl. 2024. [Google Scholar] [CrossRef]
- Yuan, X.; Wang, Y.; Wang, C.; Ye, L.; Wang, K.; Wang, Y.; Yang, C.; Gui, W.; Shen, F. Variable Correlation Analysis-Based Convolutional Neural Network for Far Topological Feature Extraction and Industrial Predictive Modeling. IEEE Trans. Instrum. Meas. 2024, 73, 1–10. [Google Scholar] [CrossRef]
- Justus, V.; Kanagachidambaresan, G. Machine learning based fault-oriented predictive maintenance in industry 4.0. Int. J. Syst. Assur. Eng. Manag. 2024, 15, 462–474. [Google Scholar] [CrossRef]
- Li, L.; Ota, K.; Dong, M. Deep learning for smart industry: Efficient manufacture inspection system with fog computing. IEEE Trans. Ind. Inform. 2018, 14, 4665–4673. [Google Scholar] [CrossRef]
- Li, D.; Zhang, Z.; Yu, K.; Huang, K.; Tan, T. ISEE: An intelligent scene exploration and evaluation platform for large-scale visual surveillance. IEEE Trans. Parallel Distrib. Syst. 2019, 30, 2743–2758. [Google Scholar] [CrossRef]
- Yuan, X.; Xu, N.; Ye, L.; Wang, K.; Shen, F.; Wang, Y.; Yang, C.; Gui, W. Attention-Based Interval Aided Networks for Data Modeling of Heterogeneous Sampling Sequences with Missing Values in Process Industry. IEEE Trans. Ind. Inform. 2024, 20, 5253–5262. [Google Scholar] [CrossRef]
- Fan, Y.; Pang, W.; Lu, S. HFPQ: Deep neural network compression by hardware-friendly pruning-quantization. Appl. Intell. 2021, 51, 7016–7028. [Google Scholar] [CrossRef]
- Strickland, E. Andrew Ng, AI Minimalist: The Machine-Learning Pioneer Says Small is the New Big. IEEE Spectr. 2022, 59, 22–50. [Google Scholar] [CrossRef]
- Hegde, C. Anomaly Detection in Time Series Data using Data-Centric AI. In Proceedings of the 2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 8–10 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
- Motamedi, M.; Sakharnykh, N.; Kaldewey, T. A data-centric approach for training deep neural networks with less data. arXiv 2021, arXiv:2110.03613. [Google Scholar]
- Jakubik, J.; Vössing, M.; Kühl, N.; Walk, J.; Satzger, G. Data-centric Artificial Intelligence. arXiv 2022, arXiv:2212.11854. [Google Scholar] [CrossRef]
- Picard, A.M.; Hervier, L.; Fel, T.; Vigouroux, D. Influenciæ: A Library for Tracing the Influence Back to the Data-Points; IRT Saint Exupéry: Toulouse, France, 2023. [Google Scholar]
- Chorev, S.; Tannor, P.; Israel, D.B.; Bressler, N.; Gabbay, I.; Hutnik, N.; Liberman, J.; Perlmutter, M.; Romanyshyn, Y.; Rokach, L. Deepchecks: A library for testing and validating machine learning models and data. J. Mach. Learn. Res. 2022, 23, 1–6. [Google Scholar]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
- Ng, A. MLOps: From Model-Centric to Data-Centric AI. 2021. Available online: https://www.youtube.com/watch?v=06-AZXmwHjo (accessed on 15 March 2024).
- Liang, W.; Tadesse, G.A.; Ho, D.; Fei-Fei, L.; Zaharia, M.; Zhang, C.; Zou, J. Advances, challenges and opportunities in creating data for trustworthy AI. Nat. Mach. Intell. 2022, 4, 669–677. [Google Scholar] [CrossRef]
- Miranda, L.J. Towards Data-Centric Machine Learning: A Short Review. 2021. Available online: https://ljvmiranda921.github.io/notebook/2021/07/30/data-centric-ml/ (accessed on 20 March 2024).
- Parashar, M.; DeBlanc-Knowles, T.; Gianchandani, E.; Parker, L.E. Strengthening and democratizing artificial intelligence research and development. Computer 2023, 56, 85–90. [Google Scholar] [CrossRef]
- Zha, D.; Bhat, Z.P.; Lai, K.H.; Yang, F.; Hu, X. Data-centric ai: Perspectives and challenges. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), St. Paul Twin Cities, MN, USA, 27–29 April 2023; SIAM: Bangkok, Thailand, 2023; pp. 945–948. [Google Scholar]
- Majeed, A.; Hwang, S.O. Data-Centric AI, Pre-Processing, and the Quest for Transformative AI Systems Development. Computer 2023, 56, 1–6. [Google Scholar] [CrossRef]
- Hamid, O.H. From model-centric to data-centric AI: A paradigm shift or rather a complementary approach? In Proceedings of the 2022 8th International Conference on Information Technology Trends (ITT), Dubai, United Arab Emirates, 25–26 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 196–199. [Google Scholar]
- Kumar, S.; Datta, S.; Singh, V.; Singh, S.K.; Sharma, R. Opportunities and Challenges in Data-Centric AI. IEEE Access 2024. [Google Scholar] [CrossRef]
- Polyzotis, N.; Zaharia, M. What can data-centric AI learn from data and ML engineering? arXiv 2021, arXiv:2112.06439. [Google Scholar]
- Whang, S.E.; Roh, Y.; Song, H.; Lee, J.G. Data collection and quality challenges in deep learning: A data-centric ai perspective. VLDB J. 2023, 32, 791–813. [Google Scholar] [CrossRef]
- Aldoseri, A.; Al-Khalifa, K.N.; Hamouda, A.M. Re-thinking data strategy and integration for artificial intelligence: Concepts, opportunities, and challenges. Appl. Sci. 2023, 13, 7082. [Google Scholar] [CrossRef]
- Clemente, F.; Ribeiro, G.M.; Quemy, A.; Santos, M.S.; Pereira, R.C.; Barros, A. ydata-profiling: Accelerating data-centric AI with high-quality data. Neurocomputing 2023, 554, 126585. [Google Scholar] [CrossRef]
- Luley, P.P.; Deriu, J.M.; Yan, P.; Schatte, G.A.; Stadelmann, T. From concept to implementation: The data-centric development process for AI in industry. In Proceedings of the 2023 10th IEEE Swiss Conference on Data Science (SDS), Zurich, Switzerland, 22–23 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 73–76. [Google Scholar]
- Holstein, J. Bridging Domain Expertise and AI through Data Understanding. In Proceedings of the 29th International Conference on Intelligent User Interfaces, Greenville, SC, USA, 18–21 March 2024; pp. 163–165. [Google Scholar]
- Song, H.; Kim, M.; Lee, J.G. Toward robustness in multi-label classification: A data augmentation strategy against imbalance and noise. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 21592–21601. [Google Scholar]
- Zhu, W.; Wu, O.; Yang, N. IRDA: Implicit Data Augmentation for Deep Imbalanced Regression. Inf. Sci. 2024, 6, 120873. [Google Scholar] [CrossRef]
- Mitchell, M.; Luccioni, A.S.; Lambert, N.; Gerchick, M.; McMillan-Major, A.; Ozoani, E.; Rajani, N.; Thrush, T.; Jernite, Y.; Kiela, D. Measuring data. arXiv 2022, arXiv:2212.05129. [Google Scholar]
- Bertucci, D.; Hamid, M.M.; Anand, Y.; Ruangrotsakun, A.; Tabatabai, D.; Perez, M.; Kahng, M. DendroMap: Visual exploration of large-scale image datasets for machine learning with treemaps. IEEE Trans. Vis. Comput. Graph. 2022, 29, 320–330. [Google Scholar] [CrossRef]
- Johnson, N.; Cabrera, Á.A.; Plumb, G.; Talwalkar, A. Where does my model underperform? a human evaluation of slice discovery algorithms. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Delft, The Netherlands, 6–9 November 2023; Volume 11, pp. 65–76. [Google Scholar]
- Hansen, L.; Seedat, N.; van der Schaar, M.; Petrovic, A. Reimagining synthetic tabular data generation through data-centric AI: A comprehensive benchmark. Adv. Neural Inf. Process. Syst. 2023, 36, 33781–33823. [Google Scholar]
- Anik, A.I.; Bunt, A. Data-centric explanations: Explaining training data of machine learning systems to promote transparency. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 8–13 May 2021; pp. 1–13. [Google Scholar]
- Pi, Y.; Shi, Y.; Du, S.; Huang, Y.; Wang, S. Unsupervised Projected Sample Selector for Active Learning. IEEE Trans. Big Data 2024, 1–14. [Google Scholar] [CrossRef]
- Rausch, O.; Ben-Nun, T.; Dryden, N.; Ivanov, A.; Li, S.; Hoefler, T. A data-centric optimization framework for machine learning. In Proceedings of the 36th ACM International Conference on Supercomputing, Virtual, 27–30 June 2022; pp. 1–13. [Google Scholar]
- Majeed, A.; Hwang, S.O. A Data-Centric AI Paradigm for Socio-Industrial and Global Challenges. Electronics 2024, 13, 2156. [Google Scholar] [CrossRef]
- Rahman, A. Data collection, wrangling, and pre-processing for AI assurance. In AI Assurance; Elsevier: Amsterdam, The Netherlands, 2023; pp. 321–338. [Google Scholar]
- Budach, L.; Feuerpfeil, M.; Ihde, N.; Nathansen, A.; Noack, N.; Patzlaff, H.; Harmouch, H.; Naumann, F. The Effects of Data Quality on Machine Learning Performance. arXiv 2022, arXiv:2207.14529. [Google Scholar]
- Majeed, A.; Hwang, S.O. Technical Analysis of Data-Centric and Model-Centric Artificial Intelligence. IT Prof. 2023, 25, 62–70. [Google Scholar] [CrossRef]
- Seedat, N.; Imrie, F.; van der Schaar, M. DC-Check: A Data-Centric AI checklist to guide the development of reliable machine learning systems. arXiv 2022, arXiv:2211.05764. [Google Scholar]
- Li, P.; Rao, X.; Blase, J.; Zhang, Y.; Chu, X.; Zhang, C. CleanML: A study for evaluating the impact of data cleaning on ml classification tasks. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 13–24. [Google Scholar]
- Azeroual, O. Data wrangling in database systems: Purging of dirty data. Data 2020, 5, 50. [Google Scholar] [CrossRef]
- Maitra, C.; Seal, D.B.; De, R.K. NeuroDAVIS: A neural network model for data visualization. Neurocomputing 2024, 573, 127182. [Google Scholar] [CrossRef]
- Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.W.; Wallach, H.; Iii, H.D.; Crawford, K. Datasheets for datasets. Commun. ACM 2021, 64, 86–92. [Google Scholar] [CrossRef]
- Sloane, M. Here’s What’s Missing in the Quest to Make AI Fair; Nature: London, UK, 2022. [Google Scholar]
- Kumar, S.; Sharma, R.; Singh, V.; Tiwari, S.; Singh, S.K.; Datta, S. Potential Impact of Data-Centric AI on Society. IEEE Technol. Soc. Mag. 2023, 42, 98–107. [Google Scholar] [CrossRef]
- Yoon, W.; Yoo, J.; Seo, S.; Sung, M.; Jeong, M.; Kim, G.; Kang, J. Data-Centric and Model-Centric Approaches for Biomedical Question Answering. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages, Bologna, Italy, 5–8 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 204–216. [Google Scholar]
- Bossér, J.D.; Sörstadius, E.; Chehreghani, M.H. Model-centric and data-centric aspects of active learning for deep neural networks. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 5053–5062. [Google Scholar]
- Zhang, J.; Budhdeo, S.; William, W.; Cerrato, P.; Shuaib, H.; Sood, H.; Ashrafian, H.; Halamka, J.; Teo, J.T. Moving towards vertically integrated artificial intelligence development. NPJ Digit. Med. 2022, 5, 1–9. [Google Scholar] [CrossRef]
- Rodríguez, A.; Kamarthi, H.; Prakash, B.A. Epidemic Forecasting with a Data-Centric Lens; Association for Computing Machinery: New York, NY, USA, 2022; pp. 4822–4823. [Google Scholar]
- Kuzdeuov, A.; Koishigarina, D.; Varol, H.A. AnyFace: A Data-Centric Approach For Input-Agnostic Face Detection. In Proceedings of the 2023 IEEE International Conference on Big Data and Smart Computing (BigComp), Jeju, Republic of Korea, 13–16 February 2023; pp. 211–218. [Google Scholar]
- DCServCG: A data-centric service code generation using deep learning. Eng. Appl. Artif. Intell. 2023, 123, 106304. [CrossRef]
- Langer, A.; Mukherjee, A. Building Data-Centric Products. In Developing a Path to Data Dominance: Strategies for Digital Data-Centric Enterprises; Springer International Publishing: Berlin/Heidelberg, Germany, 2023; pp. 143–161. [Google Scholar]
- Baek, D.; Dasari, M.; Das, S.R.; Ryoo, J. DcSR: Practical Video Quality Enhancement Using Data-Centric Super Resolution; Association for Computing Machinery: New York, NY, USA, 2021; pp. 336–343. [Google Scholar]
- Dataset, S.P.D. Kaggle. Available online: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset (accessed on 7 June 2024).
- Sailasya, G.; Kumari, G.L.A. Analyzing the Performance of Stroke Prediction using ML Classification Algorithms. Int. J. Adv. Comput. Sci. Appl. 2021, 12. [Google Scholar] [CrossRef]
- Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Dina, A.S.; Siddique, A.; Manivannan, D. Effect of balancing data using synthetic data on the performance of machine learning classifiers for intrusion detection in computer networks. IEEE Access 2022, 10, 96731–96747. [Google Scholar] [CrossRef]
- Seedat, N.; Imrie, F.; van der Schaar, M. Navigating Data-Centric Artificial Intelligence with DC-Check: Advances, Challenges, and Opportunities. IEEE Trans. Artif. Intell. 2023, 1–15. [Google Scholar] [CrossRef]
- Hussein, H.I.; Anwar, S.A. Synthetic data and reduction method to enhancing prediction in SVM to imbalanced data classification problem. In Proceedings of the AIP Conference Proceedings; AIP Publishing: Long Island, NY, USA, 2024; Volume 2750. [Google Scholar]
- Yun, J.; Lee, J.S. Learning from class-imbalanced data using misclassification-focusing generative adversarial networks. Expert Syst. Appl. 2024, 240, 122288. [Google Scholar] [CrossRef]
- Juwara, L.; El-Hussuna, A.; El Emam, K. An evaluation of synthetic data augmentation for mitigating covariate bias in health data. Patterns 2024. [Google Scholar] [CrossRef]
- Joshi, I.; Grimmer, M.; Rathgeb, C.; Busch, C.; Bremond, F.; Dantcheva, A. Synthetic data in human analysis: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4957–4976. [Google Scholar] [CrossRef] [PubMed]
- Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J.; et al. Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5314–5321. [Google Scholar] [CrossRef]
- Gill, S.S.; Xu, M.; Ottaviani, C.; Patros, P.; Bahsoon, R.; Shaghaghi, A.; Golec, M.; Stankovski, V.; Wu, H.; Abraham, A.; et al. AI for next generation computing: Emerging trends and future directions. Internet Things 2022, 19, 100514. [Google Scholar] [CrossRef]
- Moin, A.; Challenger, M.; Badii, A.; Günnemann, S. Supporting AI Engineering on the IoT Edge through Model-Driven TinyML. In Proceedings of the 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC), Los Alamitos, CA, USA, 27 June–1 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 884–893. [Google Scholar]
- Örnhag, M.V.; Güler, P.; Knyaginin, D.; Borg, M. Accelerating AI Using Next-Generation Hardware: Possibilities and Challenges With Analog In-Memory Computing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 488–496. [Google Scholar]
Approach | Accuracy (%) | Approach | Accuracy (%) |
---|---|---|---|
Baseline | 76.20% | Baseline | 76.20% |
MC-AI approach | +0.00% | DC-AI approach | +16.9% |
New values | 76.20% | New values | 93.1% |
ML Algorithms | Results under Three Distinct Settings (Mostly MC-AI) | ||
---|---|---|---|
Decision tree | 0.7886∼0.8291 | 0.7986∼0.8123 | ∼83.31 |
SVM | 0.8186∼0.8491 | 0.8281∼0.8501 | ∼86.79 |
Random forest | 0.8286∼0.8798 | 0.8481∼0.8929 | ∼92.72 |
ML Algorithms | Results under Three Different Settings (Mostly DC-AI) | ||
---|---|---|---|
Decision tree | 0.8127∼0.8411 | 0.8103∼0.8513 | ∼89.11 |
SVM | 0.8386∼0.8701 | 0.9161∼0.9271 | ∼94.65 |
Random forest | 0.8802∼0.9198 | 0.8909∼0.9209 | ∼99.83 (≈100%) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Majeed, A.; Hwang, S.O. Towards Unlocking the Hidden Potentials of the Data-Centric AI Paradigm in the Modern Era. Appl. Syst. Innov. 2024, 7, 54. https://doi.org/10.3390/asi7040054
Majeed A, Hwang SO. Towards Unlocking the Hidden Potentials of the Data-Centric AI Paradigm in the Modern Era. Applied System Innovation. 2024; 7(4):54. https://doi.org/10.3390/asi7040054
Chicago/Turabian StyleMajeed, Abdul, and Seong Oun Hwang. 2024. "Towards Unlocking the Hidden Potentials of the Data-Centric AI Paradigm in the Modern Era" Applied System Innovation 7, no. 4: 54. https://doi.org/10.3390/asi7040054
APA StyleMajeed, A., & Hwang, S. O. (2024). Towards Unlocking the Hidden Potentials of the Data-Centric AI Paradigm in the Modern Era. Applied System Innovation, 7(4), 54. https://doi.org/10.3390/asi7040054