A Data-Centric AI Paradigm for Socio-Industrial and Global Challenges

Majeed, Abdul; Hwang, Seong Oun

doi:10.3390/electronics13112156

Open AccessReview

A Data-Centric AI Paradigm for Socio-Industrial and Global Challenges

by

Abdul Majeed

^*

and

Seong Oun Hwang

^*

Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(11), 2156; https://doi.org/10.3390/electronics13112156

Submission received: 1 May 2024 / Revised: 24 May 2024 / Accepted: 27 May 2024 / Published: 1 June 2024

(This article belongs to the Special Issue Big Data and Blockchain Technologies: Explorations, Solutions and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Due to huge investments by both the public and private sectors, artificial intelligence (AI) has made tremendous progress in solving multiple real-world problems such as disease diagnosis, chatbot misbehavior, and crime control. However, the large-scale development and widespread adoption of AI have been hindered by the model-centric mindset that only focuses on improving the code/architecture of AI models (e.g., tweaking the network architecture, shrinking model size, tuning hyper-parameters, etc.). Generally, AI encompasses a model (or code) that solves a given problem by extracting salient features from underlying data. However, when the AI model yields a low performance, developers iteratively improve the code/algorithm without paying due attention to other aspects such as data. This model-centric AI (MC-AI) approach is limited to only those few businesses/applications (language models, text analysis, etc.) where big data readily exists, and it cannot offer a feasible solution when good data are not available. However, in many real-world cases, giant datasets either do not exist or cannot be curated. Therefore, the AI community is searching for appropriate solutions to compensate for the lack of giant datasets without compromising model performance. In this context, we need a data-centric AI (DC-AI) approach in order to solve the problems faced by the conventional MC-AI approach, and to enhance the applicability of AI technology to domains where data are limited. From this perspective, we analyze and compare MC-AI and DC-AI, and highlight their working mechanisms. Then, we describe the crucial problems (social, performance, drift, affordance, etc.) of the conventional MC-AI approach, and identify opportunities to solve those crucial problems with DC-AI. We also provide details concerning the development of the DC-AI approach, and discuss many techniques that are vital in bringing DC-AI from theory to practice. Finally, we highlight enabling technologies that can contribute to realizing DC-AI, and discuss various noteworthy use cases where DC-AI is more suitable than MC-AI. Through this analysis, we intend to open up a new direction in AI technology to solve global problems (e.g., climate change, supply chain disruption) that are threatening human well-being around the globe.

Keywords:

AI technology development; data-centric AI; data quality; AI models; model-centric AI; scarce training data; global problems; social issues; artificial intelligence; hyper-parameter tuning

1. Introduction

Artificial intelligence (AI) is one of the community-beneficial technologies with a wide range of applications in multiple sectors such as healthcare, predictive maintenance, smart cities, conversational applications, etc. AI has been rigorously upgraded from the perspectives of architecture (code, network, models, etc.), hyper-parameter optimization, model size reduction, and the amalgamation of AI techniques with other statistical methods. Usually, AI encompasses a model (or code) that solves a given problem by extracting predominant features (or patterns) from underlying data. In many practical applications, AI models yield deficient performance owing to multiple issues like low-quality or incomplete data, strict/loose learning rates, or other external conditions. There exist two prospects for augmenting the performance of AI: model-centric AI (MC-AI) and data-centric AI (DC-AI). In MC-AI, when the model yields substandard performance, developers iteratively improve the code/algorithm but rarely inspect the data (i.e., the type and amount of data are fixed). MC-AI yields marginal performance improvements, even with significant enhancements in the code, and has been regarded as computationally expensive and limited to those few businesses/applications where big data exists. However, in many real-world cases, giant datasets either do not exist or cannot be curated, owing to constrained budgets for data collection or a lack of expertise in handling datasets. Therefore, compensating for the lack of giant datasets without compromising model performance is challenging, and the AI community is demanding a feasible solution in this context. To this end, DC-AI can be a feasible approach that encompasses a series of data-tailored actions (e.g., data quality enhancement and debugging), and it can enhance AI model performance, which can contribute to developing high-quality AI systems for real-world problems. DC-AI is expected to change the horizon of AI research, which has been mainly based on MC-AI in the past three decades [1].

In particular, DC-AI can offer concrete solutions to data-related problems stemming from manufacturing and industrial applications with pertinent data-focused techniques and pipelines [2]. A wide range of applications can harness the potential of DC-AI and can benefit from it when data availability, collection, and/or quality are poor [3]. For example, autonomous vehicles rely on multiple sensors (e.g., cameras, lidars, radars, and GPS) to perceive their environment and make decisions. These sensors generate a massive amount of data that needs to be pre-processed to remove noise, correct errors, and ensure consistency across different sensors before it can be fused to create a comprehensive understanding of the vehicle’s operating environment, including obstacles, road conditions, and other vehicles. To this end, a data-centric approach can help identify and address data quality issues of both basic and advanced types such as missing or inconsistent data, low-resolution data, skewed data, low-fidelity data, etc. Good quality data can augment the result/performance of most real-world, data-driven services such as healthcare and conversation assistants. However, there is a growing need for the fair and responsible use of data in the current AI and big data er (https://redasci.org/, accessed on 3 January 2024). To that end, DC-AI efforts are mandatory for data governance/use when it comes to decision-making for the well-being of the general public.

Although data have been one of the building blocks in AI systems from the beginning, developers often pay less attention to data when AI model performance is poor. The three common practices to fix performance- or accuracy-related issues in AI systems are (i) fine-tuning the code of the model, (ii) obtaining more data for everything, and (iii) applying an alternate model. However, the first strategy (also known as model-centric AI) yields marginal improvements in prediction/classification performance; the second category increases computing overhead (or entails a never-ending cycle of data collection); the third strategy may increase the time to commercialization of AI products (i.e., from development to deployment). Considering the potential solution to these problems, one needs to go back and scrutinize (i.e., debug) the whole process, even when data are not being collected. However, tracing the whole process is costly, and may stop development for an indefinite period. In this context, a model-centric mindset alone may not work, and AI use in some sectors can diminish, or at least cannot yield the desired results. DC-AI can be a promising and attractive solution to fix these problems with minimal overhead. Since the main focus of DC-AI is to improve data and use a pre-trained model, it can solve most problems stemming from data.

To the best of our knowledge, this new DC-AI paradigm has not been thoroughly investigated in conjunction with the MC-AI approach. Furthermore, the capabilities of DC-AI for solving the crucial problems of MC-AI have not been highlighted in the current literature. To cover this research gap, we provide insightful coverage of DC-AI, which is rapidly evolving and has become an emerging paradigm in the AI community. The major contributions of this work are as follows.

Insight into two mainstream approaches used for AI technology development: We identify two mainstream approaches (model-centric AI and data-centric AI) used in AI technology development that have pivotal importance in translating AI from the academic lab to the marketplace, and we discuss the integral relationships that are imperative in advancing AI from the perspective of solving real-world problems.
Pinpoint and categorize crucial problems of the MC-AI approach: We explore persistent and potential threats to AI systems (model/concept drift, societal issues, environmental concerns, performance quality) in real-life scenarios by using the MC-AI approach, and categorize them into six broad categories.
Highlight prospects of solving crucial problems of MC-AI with DC-AI: We explore and discuss the prospects of how DC-AI is invaluable in solving those crucial MC-AI problems when systematically applied in real settings. We perform a feasibility analysis of the envisioned concepts to emphasize the importance of DC-AI.
Practical knowledge about the DC-AI paradigm and associated topics: We uncover the workflow, potential benefits, and key components of DC-AI. We also provide details concerning the implementation of the DC-AI approach, and discuss many techniques/implementations that are vital in bringing DC-AI from theory to practice.
DC-AI use cases and enabling technologies: We present noteworthy use cases, and enabling technologies (frameworks, pipelines, etc.) that can play a vital role in realizing DC-AI in the coming years. As a result, AI technology can contribute more to value generation, real-world problem solutions, and large-scale adoption.
Challenges and future research/development avenues: We present challenges associated with the DC-AI approach that are currently hindering the large-scale developments in this paradigm. We also provide avenues for future research and developments. To the best of our knowledge, this is the first work presenting a concrete discussion on the DC-AI approach to solving many longstanding problems in the AI discipline and can provide a solid foundation for future research in this line of work.

The rest of the paper is structured as follows. Section 2 discusses the background and related work related to this study. Section 3 presents the basics of, and a comparison between, DC-AI and MC-AI approaches. Section 4 presents the crucial problems with the MC-AI approach. Section 5 presents the technical and detailed descriptions of crucial problems with the conventional MC-AI approach in real-world scenarios. Section 6 highlights the efficacy of DC-AI in solving crucial problems in MC-AI. Section 7 provides technical and detailed descriptions concerning the solution for the crucial problems of the MC-AI approach using the DC-AI approach. Section 8 presents enabling technologies and noteworthy use cases of the DC-AI approach. Section 9 discusses the main results obtained, limitations, main trends, challenges, and promising avenues for future research. We conclude this paper in Section 10.

2. Background and Related Work

In this section, we provide background about the subject matter covered in this paper and summarize related works concerning DC-AI and associated concepts.

2.1. Background

AI technology is making a huge impact in almost every field, owing to the rapid rise in the skilled AI workforce and significantly magnified investments by both the public and private sectors. Another key enabler of its profound success in most fields is the availability of diverse (e.g., tables, images, videos, audios, graphs, sensors’ readings, location traces, etc.) and abundant data [4]. The two key components of AI technology are code (

C

) and data (

D

), as given in Equation (1).

A I = D + C

(1)

MC-AI gives higher priority to

C

while DC-AI gives higher priority to

D

in the above equation. In the past, most AI researchers and practitioners have focused on advancing the code of AI models/algorithms. Consequently, a large number of new AI models were developed, leading to advancing the overall development status of this technology. Some of these developments focused on improving AI models’ applicability to new data modalities [5,6], and some focused on lowering the complications from the AI models by proposing various modifications such as quantization, weights pruning, and redundant layers removal [7,8]. Some researchers have developed large AI models by connecting millions of application programming interfaces (APIs) to enable rapid data access and to enhance the generalizability of AI models on unseen data [9]. In some cases, AI technologies were developed to curate data that are very close to real data to compensate for the lack of good data, or to train AI models with more data [10,11]. All these developments have mostly contributed to improving the accuracy of AI models, lowering the complications in terms of parameters/FLOPs (floating point operations), and extending the horizons of AI applications. Most of the above developments fall into the category of MC-AI, which has been rigorously investigated to advance the technical effectiveness of AI in solving many real-world problems.

The main focus of MC-AI is to advance the code of AI algorithms/models, rarely inspecting or improving the data. For example, a naive perceptron model that can work with text data was extended to complex neural networks to solve real-life problems in which the input can be multimedia (i.e., images or videos). Similarly, the complex neural network size was reduced by pruning redundant weights to enable AI models to work on resource-constrained devices like microcontroller units (MCUs). In some cases, the AI models developed for one particular application were applied to other related applications with minor modifications in the code/model. The essence of MC-AI is to advance the technical supremacy of AI by investigating new models, reducing model size, enhancing code, and improving network architectures. However, MC-AI pays less attention to data quality or consistency, leading to poor adoption in real-world cases when giant datasets are either unavailable or cannot be immediately curated. Due to this model-centric mindset, many AI models fail to yield the desired performance when data are of poor quality. In addition, improving the AI model alone can impact humans in various ways through negative consequences of AI, such as biased decisions, low accuracy in prediction/classification tasks, data/concept drift, higher CO₂ emissions, technology fragmentation, and poor control. This MC-AI has given rise to unequal development (more focus on model development, but less focus on data quality/consistency) that can lead to poor adoption of AI in modern society. In Figure 1, we highlight six major developments (or research trajectories) in AI technology stemming from MC-AI. From the analysis presented in Figure 1, we can see that MC-AI gives preference to code/algorithms while ignoring the dataset, which is a vital component in AI technology advancement. Recently, MC-AI has made huge progress in many applications, but data quality seriously impacts this approach, and therefore, another feasible approach is imperative in rectifying AI technology development for value creation and wide adoption.

Andrew Ng coined the concept of DC-AI in 2021 in a live-stream session. According to Ng, the core idea of DC-AI is to pay ample attention to the data while benefiting from the pre-trained (or already developed) AI models/code as much as possible. Since the inception of DC-AI, many breakthroughs have been achieved that were unlikely to be achieved with MC-AI alone, and many are underway around the globe [12]. In recent years, the significance of data in AI systems has been substantially magnified, giving rise to the notion/concept of DC-AI. With the inception of DC-AI, the attention of AI researchers and practitioners has gradually turned from tweaking or improving model design alone to augmenting the quality of the data to build transformative AI systems.

The essence of DC-AI is to improve data with a set of sophisticated techniques, and this concept can be confused with many other established concepts in the AI field. However, it is quite different from the existing established concepts such as pre-processing, data-driven modeling, and data-driven innovation. In our recent work [12], we differentiated the pre-processing and DC-AI to prevent possible misconceptions in AI society and to increase the work on this emerging paradigm. The data-driven modeling and innovation are also different than DC-AI as they mainly focus on innovation discovery/execution by leveraging data and AI models for financial gains. DC-AI is all about engineering and monitoring data throughout the lifecycle of an AI project while resolving any potential problem that may arise in the lifecycle. It is also about the valuation of data with the set of questions or checklists and raising a flag when any of its criteria is not met/violated. It can contribute to cost reduction and navigating the potential drawbacks of the conventional MC-AI. It also provides opportunities to train AI models with less but good data, contributing to cost/energy reduction and sustainability goals. The innovation of this work lies in (i) the comparison and introduction of two tracks (e.g., model-centric and data-centric) in AI developments, (ii) the systematic introduction of different concepts associated with DC-AI, (iii) a discussion about potential benefits of DC-AI, (iv) a discussion of various crucial problems in conventional MC-AI approach, (v) exploring possibilities to solve longstanding crucial problems in MC-AI approach through this brand new paradigm, (vi) providing list of techniques that can contribute to realizing DC-AI in practice, (vii) providing knowledge about enabling technologies for DC-AI, noteworthy use cases, and drawbacks, and (viii) opening a new research track in the AI field to make current developments more robust and dependable. Lastly, through this article, we intend to highlight the possibilities of solving those problems (e.g., AI controllability, climate change, supply chain) that have a very big impact on our lives rather than accomplishing a few business motives with AI models.

Most of the real-world datasets are noisy, messy, sparse, scattered, and under-representative, leading to many developments that solely improve data quality and make them production-ready. Just like MC-AI, some developments are underway to seamlessly improve data quality in AI-based projects. We highlight the following two famous developments: (i) feature store, and (ii) data engineering pipelines that have been recently suggested to improve data quality in real scenarios to build reliable AI systems [13,14]. DC-AI and MC-AI are imperative to advancing the AI technology, that, in return, can assist in developing community-beneficial systems. MC-AI yields better performance in some specific problems, such as natural language processing [15]; it is mostly limited to a few businesses/applications where giant datasets readily exist. To resolve issues of the MC-AI approach, DC-AI can be a workable approach that encompasses a series of data-tailored actions and can significantly enhance AI model performance. DC-AI is expected to change the MC-AI mindset, fostering AI technology development and advancement in the future [16]. Taking practical steps to adopt DC-AI will have a significant impact on solving longstanding problems in conventional AI, and can increase AI adoption in industrial settings. Furthermore, the research span of MC-AI is very wide (e.g., more than 30 years) while DC-AI has been only investigated by a few researchers in the past three years. Hence, more research and developments are needed in the DC-AI paradigm in the modern data-driven era.

2.2. Related Work

In recent years, the DC-AI topic has been gaining compelling interest from researchers, and many empirical and theoretical studies have been published. However, most studies are theoretical, advocating the potential benefits of DC-AI or highlighting preliminary results of this paradigm in diverse problems. Jakubik et al. [17] highlighted the promises of DC-AI in the business and information systems engineering (BISE) domain. The authors focused on introducing DC-AI to the BISE community and summarized the potential benefits of this paradigm for the BISE community. Clemente et al. [18] developed a tool, named ydata-profiling to find different data quality issues in complex data. Tools like ydata-profiling are preliminary steps toward the implementation of DC-AI. In [19], the authors discuss the implementation of the DC-AI approach for industrial applications. The authors employed the collaboration strategy between ML engineers and domain experts to improve data quality for manufacturing and machining. In [20], the authors discussed practical ways to understand large and complex datasets by developing a system named VIS4ML systems. The proposed system assists domain experts in preparing sound data for ML projects. Angelakis et al. [21] explored the possibilities of applying the DC-AI approach to class-specific bias reduction problems by leveraging diverse DL models. Kumar et al. [22] discussed various impacts of this novel paradigm on our society. The authors highlighted the need to navigate the negative impacts of this technology. Zha et al. [23] discussed the three generic goals of DC-AI: data maintenance, training data development, and inference data development. Huyn et al. [24] proposed ways to figure out inconsistencies in the training data, leading to the reliability enhancement of ML models. Ilager et al. [25] extended the DC-AI approach to edge devices. The authors identified the challenges of Edge-AI and proposed efficient strategies for data processing, which are termed data-centric Edge-AI. Elhefnawy et al. [26] proposed the fusion methodology for heterogeneous data in industrial settings to build accurate DL models. The authors improved the results by 20% by employing DC-AI concepts on ultrasonic and pressure data.

Several studies have recently explored the concepts related to DC-AI such as the basic concept of DC-AI, the conceptual difference between MC-AI and DC-AI, the perspective and challenges of DC-AI in modern times, and DC-AI use in some specific scenarios (e.g., anomaly detection, industry applications, time series analysis, entity linking, etc.) [23,27,28,29,30]. We affirm the contributions of the above studies, but systematic coverage of DC-AI with MC-AI has not been thoroughly provided in the above studies. To the best of our knowledge, the key problems with the MC-AI approach and their solutions by amalgamating the DC-AI paradigm with MC-AI have not been identified by any of the previous works. Furthermore, the hidden benefits of the DC-AI paradigm in terms of advancing AI technology by solving longstanding MC-AI problems remained unexplored in the current literature. Lastly, the important use cases where DC-AI is urgently required, and enabling technologies that are helping in realizing DC-AI, have not been provided in previous studies. To address these research gaps, and to provide systematic knowledge about this fledgling paradigm are the main motivation behind this research. Our work and analysis can provide comprehensive knowledge about the DC-AI paradigm and can provide a solid foundation for subsequent research in this line of work (e.g., the DC-AI paradigm).

3. Working Mechanisms of Model-Centric AI and Data-Centric AI

3.1. Workflow and Comparisons between Model-Centric AI and Data-Centric AI

In this section, we explain the workings of DC-AI and MC-AI in real-life scenarios in a systematic way. The concept of DC-AI was coined by Andrew Ng in a live-stream session hosted on 24 March 2021 [31]. In contrast, MC-AI has been a commonly used practice (data → model → accuracy/other results) for the past 30 years. The former is recent, whereas the latter has been adopted in industry and academia for the past three decades. The workflow diagram of the model-centric AI that is mostly applied to any given real-world AI-related problem (e.g., human activity recognition, or fault diagnosis of bearings using signal data) is shown in Figure 2.

As shown in Figure 2, data are needed to train AI models. Hence, after the problem definition, relevant data are collected from people/environments depending upon the nature of the problem under investigation. After data collection and minimal pre-processing, the AI model is trained using the collected data. After some analysis and performance tests, the model is deployed in real-world settings, and performance is analyzed with new data. If the performance is not good, the model (code) is tuned to enhance performance. In these circumstances, the developer only focuses on the code, rather than the data, to augment performance. This is regarded as the main flaw in the MC-AI approach when it comes to most industrial sector problems in which benchmark datasets are not available.

In contrast, DC-AI shown in Figure 3 has some similarity to MC-AI, but two fundamental differences exist (Steps 3 and 8). In Step 3, the data are screened from multiple aspects (completeness, accuracy, timeliness, relevance, and for outliers, alignment, missing values, labels, size, data-source analysis, annotations, data versioning, feature engineering, domain analysis, value formats, etc.) before being fed into the AI model. In Step 8, if there are performance-related issues, developers need to look into the data, rather than the model alone. This approach has proven most effective in scenarios when getting more (or good quality) data is not possible [33]. DC-AI has shown promising results when the one-model-fits-all concept becomes invalid. In addition, DC-AI favors the industrial setting, and therefore, it offers many promising applications in the years to come. Detailed comparisons of MC-AI and DC-AI approaches are given in Table 1 and Table 2.

3.2. Implementation of DC-AI Approach

The DC-AI approach can be implemented in the following two ways:

By extending the open-source implementations: In recent years, many end-to-end ML projects and pipelines have been developed, and are released under open-source licenses. These developments are the best starting point for implementing the latest techniques such as DC-AI. These open-source implementations provide the code, reports, and manuals for the development of the ML lifecycle, and therefore, one can adopt and extend these implementations to some specific problems. The open-source implementations can run on any platform, and the code owners are constantly upgrading those implementations, leading to saving time for ML developers who are working in similar domains. For example, MLFLOW [34] is an open-source platform that provides a complete implementation of the ML lifecycle, and its code can be accessed from GitHub (https://github.com/mlflow/mlflow/, accessed on 5 January 2024).Any organization or company that wishes to develop an ML-based project can acquire the code of MLFLOW and can customize it according to their needs. By utilizing the open-source code, the time and development costs can be reduced significantly, and the software can be delivered to the market in a very short time. Similarly, there exist many open-source implementations of DC-AI such as dcbench (https://github.com/data-centric-ai/dcbench, accessed on 5 January 2024), Xel [35], auto-sklearn (https://github.com/automl/auto-sklearn, accessed on 5 January 2024), TSxtend [36], K-NIME (https://www.knime.com/, accessed on 5 January 2024), and MLFLOW enhancement [37]. All these open-source tools, prototypes, and pipelines include basic functionalities (e.g., data pre-processing, feature selection, feature extraction, model selection, etc.) of DC-AI, and therefore, can be extended to include the advanced functions (e.g., data augmentation, alignment, consistency, etc.) of DC-AI. To implement DC-AI, exploring open-source implementations is the feasible way to begin with.
By implementing prototypes, pipelines, framework, etc., from scratch: Another relatively tedious way to implement DC-AI in a specific problem is initiating the development from scratch. In this method, small-scale prototypes with basic DC-AI functions such as outlier detection, missing value handling, duplicate removal, etc., can be developed first, and later, they can be extended to provide advanced functions (e.g., data augmentation, active learning, etc.). An example of such a method is the DataPerf tool [38], which was developed to improve data quality for a variety of ML applications. These kinds of implementations have a potential impact on ML applications and can enable AI adoption to new problems. However, developing software packages for AI from scratch is time-consuming and can take a much longer period to implement DC-AI compared to the first method. However, large-scale companies with substantial number of technical employees can implement this method within a short period. Therefore, the DC-AI implementation choice can be made depending on the specific problem, budgets, development needs, commercialization plans, etc.

Recently, many well-reputed conferences such as Neural Information Processing Systems (NeurIPS) are also arranging special workshops on DC-AI (https://datacentricai.org/neurips21/, accessed on 7 January 2024), where industry leaders share ideas regarding the implementation of DC-AI. Furthermore, only technical papers that provide the implementation of tools and methodologies, algorithms for working with only limited labeled data, algorithms for improving data quality and label efficiency, and responsible AI are accepted. Most papers also release code as open source, which can serve as a stepping stone in the implementation of DC-AI techniques in specific problems with slight modifications. Recently, the International Conference on Machine Learning (ICML) also arranged a competition (https://icml.cc/Conferences/2022/ScheduleMultitrack?event=19951, accessed on 7 January 2024) on DC-AI to foster the implementation of DC-AI techniques. As a result, some practical tools were developed to improve data quality, leading to an overall enhancement in DC-AI [39]. Additionally, many initiatives (https://www.mgi.gov/, accessed on 7 January 2024) have been recently launched to create shared repositories of good data for training ML models. Furthermore, some companies such as AutoGluon (https://auto.gluon.ai/stable/index.html, accessed on 7 January 2024) have already made lots of code public, which can be customized depending upon the problem/task. All these activities are playing a central role in the implementation of DC-AI techniques across the globe.

The implementation of DC-AI systems is identical to traditional systems. However, in DC-AI systems, there are many principles and rules concerning data, that need to be followed throughout the lifecycle [40,41]. In the traditional AI system, ample attention is paid to the AI code/algorithms, and data are rarely inspected or improved. Figure 4 presents the implementation architecture of DC-AI in medical domains along with the supportive strategies/techniques. Specifically, we demonstrate that the DC-AI system can be implemented to serve the users to check the possibility of stroke based on demographics. The key steps from the DC-AI perspective are to include domain experts in the loop, curate data from offline and online sources, improve data quality with the help of multiple data engineering techniques, evaluate data multiple times, and continuously monitor the developed system. We present the technology stack of the DC-AI and actors that are involved in the development of such systems in Figure 4. The involvement of domain experts and rigorous evaluation of data contribute to the development of high-quality AI systems.

In Figure 4, we provide the generic implementation architecture of DC-AI-based systems in medical tasks (e.g., stroke prediction from text data). In this implementation architecture, there are ten important steps: (i) problem definition, (ii) data collection, (iii) data analysis and basic tuning, (iv) checklist-based data evaluation, (v) suitable AI/ML model selection, (vi) AI/ML model training, (vii) model evaluations with best quality test data as well as adversarial examples, (viii) model deployment, (ix) serving users, and (x) monitoring deployed model. We also discuss the DC-AI implementation stack and practical techniques that can be applied in each step. It is worth noting that this system included many data engineers and domain experts to curate sound, representative, and diverse datasets. In addition, the extensive checklist and DC-AI tools are used to curate datasets which are complete in most aspects. The items in the checklist are generic and can be extended/shortened depending on the problem at hand. In some cases, generative tools such as generative adversarial networks (GANs) are also used to generate synthetic data to enhance the diversity of training data. Few steps can be performed manually (e.g., steps (i) and (v)) in this system. The testing of the model is performed with the normal samples as well as adversarial examples to enhance model robustness and resilience [42]. It is worth noting that some steps might be omitted, while some cases may require additional steps in the proposed system. Lastly, the data after preparation can be stored in structured (SQL) as well as non-structured databases (MongoDB, Elastic search, etc.). It is important to note that the number of techniques to be used in curating good-quality data can also vary depending on the data modalities used in the system. Lastly, pre-trained AI models with slight customization can also be employed to shorten the development duration of projects.

3.3. Supportive Techniques for DC-AI Development/Implementation

There exist many data engineering strategies/techniques that can be used as support in practice while developing DC-AI-based systems/frameworks. However, ample attention is required to choose relevant strategies and their correct application order for specific problems. For example, in anomaly detection problems, data cleaning, wrangling, data augmentation, and feature selection may be sufficient [43,44]. In contrast, more sophisticated data engineering techniques such as active learning, confident learning, data augmentation, etc., are needed along with basic data engineering techniques in medical domains involving image/video data [45]. In DC-AI, domain experts are engaged in the entire lifecycle of the project to spot and rectify data problems. The stack of DC-AI technology is constantly expanding, and many sophisticated techniques are emerging for different data modalities. Table 3 summarizes the important techniques (mainly data-related) that are vital while developing DC-AI-based systems. Further technical information about these techniques can be learned from recently published articles in reputed venues [28,46,47,48,49,50,51,52,53,54,55,56,57]. In the coming years, many innovative techniques are expected in DC-AI to improve the quality of data enclosed in different modalities [58]. In the future, many conferences/workshops under the theme of DC-AI are to be held, and therefore, many prototypes/frameworks/pipelines will likely be introduced. Consequently, DC-AI is expected to become one of the leading technologies of modern times.

3.4. Practical Guidelines for DC-AI Paradigm

The core concept of DC-AI is to figure out data-tailored problems that can undermine the learning abilities of underlying AI models, and subsequently improve data with the help of different techniques. Figure 5 shows the six main principles of the DC-AI approach, as explained in [40]. Referring to Figure 5, DC-AI tends to curate data of supreme quality for AI models by leveraging data-tailored operations and fostering extensive collaborations between AI and domain experts. Furthermore, data consistency is regarded as a vital component to creating consistent views of the data. Furthermore, the simultaneous improvement in both model and data is imperative to advance AI systems, leading to better performance in downstream AI tasks. In conclusion, DC-AI magnifies the data’s role in the entire lifecycle of the AI/ML project and goes beyond conventional approaches like pre-processing and data wrangling.

Next, we describe some practical guidelines (e.g., best practices) for researchers and practitioners who intend to adopt this paradigm. In the planning phase of data collection, the best practice is to identify all required variables/features and determine the optimal data size. While collecting data, it is paramount to choose a diverse sample of people/devices and to identify the potential sources of bias. The best practice is to avoid random data collection, as it can lead to an imbalance in the collected data. Similarly, including fewer variables than required may lead to missed analytics/prediction opportunities. In the early stage of pre-processing, the best practice is to select relevant techniques for data quality enhancement from the available pool and to perform error analysis. In the later phase, the best practice is to augment data if there are too few samples for some classes or data size is small. In data augmentation, the best practice is adding limited and high-quality samples generated with GANs/statistical methods to preserve the truthfulness of real data. To fix data quality issues, the best practice is to harness the potential of automated learning algorithms such as isolation forest (for outlier removal), confident learning (label error identification), active learning (annotation and labeling), etc. While preparing data, it is vital to involve different domain experts and ensure extensive collaboration between them to overcome inconsistencies in the final data. The laborious tasks such as annotation, segmentation, labeling, and noise removal can be delegated to automated techniques such as active learning. The best practice is to properly document the data origins and their evolution with time in a ‘datasheet for datasets’ [60]. Before data finalization, the best practice is to ensure consensus between involved experts, and measures like the kappa coefficient can be used to guarantee it. Before model training, the best practice is to use a data checklist to verify the order and correctness of the data engineering (or quality enhancement) method employed to clean and prepare data. Furthermore, the best practice is to analyze data size and properly divide them into training, testing, and validating data.

In the model training, it is vital to ensure that all data points are equally used by the model. In this regard, the best practice is to assess the impact of each sample on model performance and identify ambiguous or under-performing sub-populations via libraries like Influenciae [61]. In some cases, it is best to sort based on complexity to ensure smooth learning. Techniques like curriculum learning can be adopted to accomplish data-sorting tasks. In the training, the best practice is to present diverse examples to ensure detailed learning. In the testing phase, the best practice is to test the model with data that closely mimic real-world settings to enhance generalization on unseen data. Furthermore, the data and model should be simultaneously and iteratively upgraded to capture the relationship between input and output. The best practice is to employ multiple techniques rather than one and document the impact of each technique. In the entire lifecycle of DC-AI, it is paramount to ensure proper data governance and to prevent social issues like privacy disclosures or personal data misuse.

Real-world examples of the successful implementation of DC-AI include 100% accuracy in anomaly detection scenarios with time series data [33], 6% accuracy enhancement and 1.54× reduction in model size [2], accuracy ≥ 99% while processing noisy data in industrial applications [43], 22% accuracy enhancement in defect detection scenarios [62], medical image analysis with improved accuracy [63], etc. Some methods have harnessed the potentials of both MC-AI and DC-AI to assess the quality of datasets [64]. Some developments have been made to enhance the robustness of multiple AI models by employing this new paradigm [65]. Lastly, some successful prototypes and small-scale tools like Influenciae, MLPerf, and dcbench have also been developed to realize this new paradigm [1]. In the coming years, more developments are expected in this paradigm to advance AI systems and extend their robustness/performance.

3.5. Datasets Used in DC-AI Research

Thus, far, DC-AI concepts have been tested on data of different modalities including tables, graphs, time series, images, audio, videos, and hybrid data. Most of the datasets used in DC-AI research are available at public repositories such as UCI (https://archive.ics.uci.edu/, accessed on 20 May 2024), Kaggle (https://www.kaggle.com/, accessed on 20 May 2024), Github (https://github.com/, accessed on 20 May 2024), Keel (https://sci2s.ugr.es/keel/datasets.php, accessed on 20 May 2024), etc. Some studies have used real-time data stemming from wearable devices or sensors and subsequently applied DC-AI concepts [26]. Lastly, the DC-AI concept has also been applied to datasets prepared by different academic institutions for research purposes. An example of such a dataset is CIFAR-10, which is hosted and maintained by the University of Toronto (https://www.cs.toronto.edu/~kriz/cifar.html, accessed on 20 May 2024) for academic purposes.

4. Crucial Problems of the Conventional MC-AI Approach in Real-World Scenarios

Recently, many negative consequences of AI systems have been observed by researchers when it comes to ethical/moral values and the security/privacy of personal data. In simple terms, we have started living with emerging technologies at the cost of our liberty, human rights, freedom, and moral values [66]. Soon, rectifying/controlling AI systems to lower the unintended consequences on humans will be challenging. In many sectors and countries, substantial efforts have been devoted to controlling the harm from technology [67]. Figure 6 shows the key components of AI quality (https://truera.com/ai-quality-education/, accessed on 22 March 2024) and the crucial problems of the MC-AI approach. These problems, along with their subtypes, have not been reported in the literature. The four key components that are vital in AI system quality are defined as follows:

Model performance ( $M_{p}$ ) includes the key parameters/attributes concerning the business value and risk.
Societal impacts ( $S_{i}$ ) include the key parameters/attributes concerning social values and risks.
Data quality ( $D_{q}$ ) includes the key parameters/attributes of the dataset used to train, run, and/or test AI models.
Operational complexity ( $O_{c}$ ) includes the key parameters/attributes that help humans work productively with AI systems. Furthermore, it includes attributes that help AI systems work with other systems.

As cited above, one of the main components that contribute to the quality of AI systems is the data. However, data were regarded as one-time (or fixed) events in most MC-AI-based techniques, and therefore, many crucial problems can emerge in realistic environments [68]. In this work, we identify crucial problems in the MC-AI approach, and we classify them into six broad categories, as shown in Figure 6b. Concise descriptions of the crucial problems in MC-AI are given below. Technical and detailed descriptions concerning crucial problems in MC-AI are given in Section 5.

Societal problems: When it comes to the question of whether AI is a beneficial social technology or not, unfortunately, the answer is unclear, neither absolutely yes nor no. There can be various reasons, but the most important one is curiosity about how AI works, how it generates prediction/classification results, how it uses data, and whether it is useful to deploy an AI system or not. Without answering such questions, the transition of AI from academic labs to the market is not possible and hardly contributes to research. In this regard, we highlight seven societal problems in Figure 6b that limit the adoption of AI systems in real-world scenarios. The proper solutions to these problems can make AI a more socially beneficial technology from the perspective of its adoption/use.
Performance problems: AI models’ performance in academic labs and the market is imbalanced, especially in the medical domain [68]. The main reason behind this imbalanced performance is that special attention is paid solely to internal issues concerning AI code/algorithms. However, when it comes to the actual use of AI technology in real-world scenarios, factors such as unrepresentative and incomplete data, unpredictable conditions, lack of human expertise, and diverse characteristics of deployment environments can degrade the performance of even a very carefully built AI model. Hence, optimizing networks and improving only internal parts of AI algorithms cannot yield reliable results in real-world scenarios. In the MC-AI approach, we identify five performance problems in Figure 6b that impact the adoption (or technological development) of AI systems in real-life scenarios. Effective solutions to these problems can improve the confidence stakeholders have in AI-powered systems.
Drift problems: In many real-world scenarios, a significant amount of time is spent taking an AI model from development to deployment, but AI models do not yield the desired results under actual use. Furthermore, a common assumption while developing an AI model is the future == the past. For example, most AI models assume that training data are fixed, and will not change much when it comes to deployment/use. However, this is not the case in realistic scenarios. For example, most of the AI-based facial recognition systems that were trained and developed before SARS-Cov-2 yield infeasible results in the current pandemic era [69]. The main reason for this problem is the drastic shift in the data (images with masks now) and the environment. In addition, most developers test a model with the same data repeatedly, which can lead to poor performance in out-of-dataset scenarios. In AI systems, drift can occur due to changes in the external world, the application of a model to a new context, or training data that are limited/incomplete. This paper pinpoints three drift-related problems (see Figure 6b) with the MC-AI approach that require urgent solutions from the AI community to enhance the practicality of AI systems in real-world scenarios.
Sustainability problems: As pointed out by previous studies, AI can significantly hurt or help the environment [70]. On the one hand, the use of AI helps to augment the performance of companies using prediction and classification. On the other hand, the energy consumption in training large-scale AI models increases CO₂ emissions, impacting environmental safety. Recently, AI has been rigorously investigated to help address the emerging challenges of climate change [71]. This paper pinpoints four sustainability-related problems of the MC-AI approach (see Figure 6b). For example, improving code and rebuilding models on large datasets require graphical processing units (GPUs) and high-performance centers (HPCs), leading to negative impacts on the environment (e.g., carbon emissions). In some cases, training a large AI model can create a CO₂ footprint roughly equivalent to emissions from five cars [72]. Considering these issues, fiddling with complex AI models and excessive rebuilding introduces environmental challenges. To that end, substantial efforts are required to address the sustainability problems of AI.
Affordance problems: AI technology is advancing with time and is influencing most aspects of our lives. However, when it comes to the understandability/affordability of AI systems, there is a significant gap between human wisdom and AI-based systems. For example, a human can easily recognize cats and dogs, even from blurred images. However, to perform the same task with an AI system, a voluminous amount of data and exhaustive training are needed. Furthermore, utilizing advanced technology such as AI in a small business can bring only financial loss/overhead [73]. This paper pinpoints four problems of the MC-AI approach concerning the affordances of AI in realistic scenarios (see Figure 6b). The proper solution to these problems is imperative as AI becomes increasingly integrated with many businesses.
Controllability problems: These are major issues and require urgent attention from the research community to harness the benefits of AI while avoiding pitfalls [74]. Recent research highlighted how controlling AI is no longer possible, which can pose serious threats to human safety and lifestyles [74]. On the other hand, empowering the user to control AI technology is not a good solution [75]. Proper control of AI technology is imperative in applications such as self-driving cars, voice assistants, drug discovery and repurposing, and robots. The current MC-AI puts less focus on controllability, and therefore, it can lead to financial loss and adverse social issues. For example, an AI-powered chatbot made racist comments about minorities and was removed from Facebook (https://www.theguardian.com/world/2021/jan/14/time-to-properly-socialise-hate-speech-ai-chatbot-pulled-from-facebook, accessed on 5 February 2024). This article identifies five types of controllability problems under the MC-AI approach (see Figure 6b). The controllability problems need an effective solution to lessen their consequences on human safety and power.

5. Technical and Detailed Descriptions of Crucial Problems with the Conventional MC-AI Approach in Real-World Scenarios

In this pioneering work, we identify crucial problems with MC-AI and classify them into six broad categories. These problems can constrain development as well as deployment of AI systems and can lead to misleading results from AI systems deployed in realistic scenarios. The solutions to these problems are imperative because many well-trained AI models are unlikely to be successful when implemented in actual fields (or in low-resource environments). Technical descriptions of these problems (and sub-problems) of MC-AI are given below.

Societal problems: These problems are related to the direct impact (e.g., value vs. risk) of AI technology on society. For example, an AI system deployed in XYZ hospitals as a helper to real clinicians can assist in correctly identifying some dangerous diseases like cancer. The world cannot benefit much without pinpointing and solving societal problems of AI. In this regard, we identify seven societal problems (fairness, trustworthiness, transparency, explainability, ethics, privacy and security, and accountability) that are limiting the adoption of AI systems in real-world scenarios. Proper solutions to these problems can make AI a more socially beneficial technology from the perspective of adoption/use. For instance, the results of an AI system can be highly biased when unrepresentative data are used in training it. If such a biased AI system is deployed in some realistic scenarios, it cannot be acceptable to the general public and increases societal risk. In most existing developments, data labeling is done either manually or with the help of internal labelers, which can be hard to audit and slow. Consequently, incorrectly labeled data inadvertently propagate inconsistencies in the training process, leading to less trustworthy AI applications. Although there are some approaches, such as programmatic labeling, much work is still needed to make AI trustworthy. From the perspective of transparency, most of the machine learning (ML) models are transparent (e.g., algorithmic aspects, mathematical functions, and relationships between output and input) but the same is not true with deep learning (DL) models. Thus, one can only understand the input supplied and results achieved by understanding the process in between, and therefore, the black-box nature (i.e., opaqueness) of DL makes AI less transparent. Similarly, an AI system might be deployed in a hospital to determine whether a particular disease (e.g., COVID-19) is present or not from lung image data. If such a system is giving a binary answer, or if a clinician cannot interpret those results, then AI is no longer a socially beneficial technology [76,77]. From an ethical perspective, AI cannot work without excessive human involvement, and therefore, in the absence of humans, it is not yet certain what policies and regulations can be employed to govern AI use in each sector. Furthermore, AI can misuse human rights, liberties, and freedom. On top of that, AI cannot distinguish what constitutes discrimination, and what ways are needed to overcome such problems [78]. Privacy and security are the main hurdles when it comes to the deployment of AI solutions involving personal data (e.g., electronic health records, facial images, demographics, etc.). In the recent past, the adoption of AI-powered beneficial community technologies was significantly low due to privacy concerns [79]. Preserving privacy from AI-based systems is the stuff of dreams because AI has sophisticated abilities such as capturing unintended features (background information in images), memorizing data and using it for predictions, and deriving privacy-sensitive knowledge from underlying data. From the accountability perspective, there are serious implications when an AI model, using a tremendous amount of data and a black-box model, can turn the results in either way, and the unlawfulness-by-default concept may not hold [80]. For example, offensive content generated by AI cannot be held accountable, and the current literature does not provide a basis for such accountability. One concrete example of the low performance of the MC-AI approach in this category of problems was COVID-19 prediction from demographics data. On most sites/hospitals, the data were imbalanced for +ve and −ve patients, which can lead to higher misclassifications of either +ve or −ve patients, owing to the poor generalizability of AI models. Furthermore, there were big gaps in the infection-related predictions made by ML/DL methods and actual infections due to fewer variables utilized and overlooking government policies enforced in each region. All these problems can bring more harm than good when attention is solely given to the architectural aspects of AI models (i.e., MC-AI) while ignoring other aspects (e.g., data).
Performance problems: These problems are related to the accuracy, computing overhead, scalability, etc., of AI technology while solving real-world problems. AI models’ performance in academic labs and the market is highly imbalanced. The main reason behind this unequal performance is that special attention is paid solely to improving code/algorithms. This work pinpoints five problems (accuracy, stability, conceptual soundness, robustness, and recoverability) from the perspective of AI performance when employing MC-AI. These problems seriously impact the adoption (or development) of AI systems in real-world scenarios. An effective solution to these problems can improve the confidence of stakeholders in AI-powered systems. Besides other problems, accuracy is the main problem when it comes to AI model quality. However, despite rigorous code improvements, accuracy only improves marginally in some cases. In Table 4, we highlight the potential problems with MC-AI from an accuracy enhancement perspective by a taking computer vision task (https://landing.ai/data-centric-ai, accessed on 21 May 2024) (steel inspection sheet) as a case study [81]. In this case study (steel inspection sheet), good data are absent, and acquiring new data is difficult. As shown in Table 4, MC-AI only marginally contributes to accuracy enhancement and expected accuracy targets (≥90%) cannot be accomplished. From these analyses, we found that MC-AI is not feasible in some cases, and its accuracy in realistic environments is below par.
The AI models developed with the MC-AI mindset can have stability issues. For example, a conversational application generating multiple sentences from one sentence can produce related sentences but without the underlying context. Furthermore, an image closely resembling a cat beside a dog can be interpreted as either two cats or two dogs. Similarly, a man with long hair can be classified as a woman if their shape resembles a female. Considering these aspects, stability issues can occur with far-reaching implications for human beings when personal data are used in model training. From the conceptual soundness perspective, most of the models and their parameters lack empirical evidence and documentation. The design of AI models and the corresponding parameters are either domain-specific or application-specific. In addition, data bugs are fixed through models (e.g., averaging, or using the optimized learning rate). These heuristically determined parameter values and a lack of empirical evidence make AI systems less conceptually sound. Most of the current AI systems lack robustness and can be molded by active adversaries to their advantage. For instance, an AI-based traffic guidance system trained during the summer may yield inconsistent performance during the winter when there is heavy fog/smog. In addition, intrusions into AI systems, and data protection from an adversarial lens are the biggest problems with the conventional MC-AI. Most of the AI-deployed systems often face technical failures due to gigantic requests from users, malicious attacks, and/or an increase in service to consumers. Hence, recoverability from failures or excess service loads is tricky, and re-training is often needed to make the system operational again. The above-cited performance-related problems cause unexpected behaviors in AI systems when used in realistic environments.
Drift problems: These problems are related to the reliability of AI technology operational results while solving real-world problems. Once a model is deployed, it experiences different conditions, some of which can degrade the model’s performance. This phenomenon is called drift, in which there is a huge imbalance in performance between the time of training and after deployment. This work pinpoints three kinds of drift problems in the conventional MC-AI approach (data, concept, and data scarcity). In data drift, data distribution changes with time. In model/concept drift, the performance of an AI model declines due to drifts in deployment conditions, characteristics of the dependent variable, changes in features or their values, and/or unpredictable circumstances. Furthermore, a model trained with limited or scarce data can lead to drift issues in practical scenarios. Recently, some approaches have been developed to address these types of drift in AI systems [82]. In MC-AI, not all aspects concerning data are taken into account, and therefore, the possibility of drift occurring is huge. For example, data might not be engineered properly before being fed into the AI model, or data assessment might not be carried out in each phase of deployment. Furthermore, MC-AI only improves architectural aspects that may lead to marginal improvements in some metrics (e.g., accuracy) at the expense of others. For example, MC-AI is concerned with a certain level of accuracy (e.g., 90%) without analyzing whether the underlying data are timely and complete. Ignoring such important aspects concerning data quality can lead to a higher probability of drift in real-world cases. MC-AI does not re-evaluate data during the lifecycle of AI systems. Hence, MC-AI is prone to model, data, and upstream drift in practical settings (e.g., a feature is no longer required, or the measurement unit changed). Although model re-training can help overcome such drift, the possibility of fully overcoming drift is low. In the worst cases, models are rebuilt from scratch, leading to a waste of effort and money. Some real-life scenarios such as electricity load, stock market, and solar irradiance are influenced by some cyclic elements [83]. Therefore, some variables/features are depreciate and some new variables/features emerge, necessitating fresh dataset curation or resampling of the existing data. To this end, fine-tuning code alone may not be sufficient and the model cannot adapt to changing circumstances, where cyclic elements occur frequently or the logic deployed changes frequently.
Sustainability problems: These problems are related to the effects of AI technology on environmental degradation, specifically from the perspective of CO₂ emissions, pollution, and energy waste. Consequently, these environmental hazards affect human health and quality of life. This paper pinpoints four sustainability-related problems (CO₂ emissions, human-made disasters, climate change, and material flows) of MC-AI that have not been identified in the latest research. For example, improving code and rebuilding models on large datasets require graphical processing units (GPUs) and high-performance centers (HPCs), leading to negative impacts on the environment (e.g., carbon emissions). In recent times, AI has been chasing human-level intelligence, and therefore, the parameter size of the DL model is increasing exponentially. For example, GPT-4 has about 1.76 trillion parameters, which can lead to higher energy costs as well as a lot of CO₂ emissions, impacting environmental health. In some cases, training a large AI model can create a CO₂ footprint roughly equivalent to five cars [72]. In some cases, the CO₂ emissions from HPCs are roughly equal to emissions by the aviation industry (https://www.cfr.org/blog/artificial-intelligences-environmental-costs-and-promise, accessed on 15 February 2024). Considering these issues, fiddling with complex AI models and excessive rebuilding introduces environmental challenges. To compensate for this, substantial efforts are required to address the sustainability problems of AI technology. In MC-AI, data heterogeneity is not properly modeled in the lifecycle, which can lead to human-made disasters in industry settings as well. For example, AI use can be remodeled to change the behavior of AI systems, and therefore, AI can become dangerous (https://www.cio.com/article/190888/5-famous-analytics-and-ai-disasters.html, accessed on 15 February 2024). Although AI has contributed to addressing the major global challenges of climate change, due to a lack of complete and representative data, MC-AI alone cannot solve this longstanding problem. From the material flow perspective, it is very challenging to select appropriate models for a given task in industrial environments [84]. To that end, MC-AI cannot yield desirable results because hit-and-trial procedures to optimize material flow adversely affect the environment. The current MC-AI approaches cannot make the production environment more sustainable (e.g., reducing the carbon footprint, ensuring efficient use of raw materials, reducing energy waste), and therefore, their adoption negates the role from a sustainability perspective.
Affordance problems: These are related to the acceptance and adoption of AI technology by most businesses/environments, regardless of the business nature/size in real-world scenarios. Furthermore, the solutions depend on whether AI technology brings more harm or good when adopted in a particular scenario (e.g., the retail industry). Furthermore, humans can conveniently interact with this technology (or not) to enhance business performance. This paper pinpoints four problems (lack of domain experts, fragmented tools, limited knowledge of AI governance, and the autonomous nature of AI) with MC-AI concerning the affordances of AI in realistic scenarios. The proper solutions to these problems are imperative because AI has been increasingly integrated with many businesses and sectors around the world. MC-AI trains complex models on large-scale datasets. Subsequently, testing of the model to verify its efficacy is performed with unknown data. However, few experts are familiar with the lifecycle of AI systems, and most companies cannot invest in a different phase of the system. Hence, AI systems developed on MC-AI cannot be widely accessible, and become a major bottleneck when it comes to small-scale businesses. Similarly, only commercial tools are used to pre-process data in most scenarios without considering the domain and target users. Therefore, these fragmented tools cannot be easily accessed by small businesses for cleaning/improving the data. Hence, the performance of MC-AI is low because it does not focus much on data quality. Furthermore, it is unclear to many individuals what and how AI can contribute to a business, and therefore, many organizations are reluctant to convert conventional approaches to AI-based systems. MC-AI further complicates the problem by working in a black-box manner, where only a few people can understand the workings of a CNN model [85]. MC-AI does not yield higher adoption or affordability owing to the inherent risks. With time, AI applications are becoming more autonomous in many aspects, which can bring inherent risks to the safety and security of nations. For example, AI can make decisions based on past data and learned experiences. If we train a model with the wrong data, however, then the AI system can exploit its autonomous working ability to create unknown risks to humanity [86]. There is a possibility of fully autonomous AI system development, which can lead to unintended consequences in the future when it comes to critical safety applications [87]. All these challenges can seriously impact AI affordances in real-world scenarios, leading to poor adoption (or value generation). According to the National AI Research Resource Task Force (https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf, accessed on 21 May 2024), the access to high computing infrastructure and big data are mostly limited to high-resource organizations [88]. In contrast, low-resource organizations cannot harness the potential of AI, leading to biased and unfair AI models and applications [88]. To that end, one concrete example is the lack of federated resources like the National AI Research Resource (NAIRR) for democratizing AI research and development in different countries of the world.
Controllability problems: These problems are related to humans losing control of AI technology, leading to unintended consequences to human health/life. The controllability of AI is a major issue and requires urgent attention from the research community to harness the benefits of AI while avoiding pitfalls. This article identifies five types of controllability problems (explicit, implicit, aligned, delegated, and hybrid controls) of MC-AI that can lead to negative consequences if due attention is not paid to them. The controllability problems need an effective solution to lower their consequences on human safety and power. We describe the controllability issues of AI technology by taking the example of an autonomous and AI-powered car on the busiest highway. The five types of controls concerning AI technology are explained as follows. Explicit control is related to immediate action based on user comments. For example, in autonomous driving, the command, Stop the car!?, can result in the car stopping in the middle of the road due to hard-coded logic, leading to catastrophic consequences on a busy highway. Similarly, the lack of controls in the chatbot can result in racist comment generation (https://www.theguardian.com/world/2021/jan/14/time-to-properly-socialise-hate-speech-ai-chatbot-pulled-from-facebook, accessed on 25 March 2024), leading to either financial or trust losses. In contrast, an implicit control can attempt to safely stop the car on the shoulder of the road by using some common sense when issuing the command. In that case, the consequences are not catastrophic. However, augmenting such intelligence is challenging, and the current MC-AI lacks these fundamental controls. Under aligned control, the car can stop close to a restroom using cognitive capabilities about a human looking for a restroom. In this case, AI understands the intention behind the command and acts accordingly. With delegated control, the AI-powered car does not wait for a command but stops close to a gym assuming that humans want a workout or some exercise in the gym. Similarly, hybrid control uses a mix of AI capabilities and the human brain to keep humans safe and happy. All these controls are imperative in AI-based systems. The conventional MC-AI-based approach lacks these fundamental controls, leading to poor adoption/use in practice.

The above-cited crucial problems of MC-AI can limit the adoption and acceptance of AI technology. The solutions to these problems can enhance the success rate of AI technology from a commercialization perspective, which is only currently at 12% (https://fortune.com/2022/06/21/andrew-ng-data-centric-ai/, accessed on 25 March 2024). The above-cited crucial problems and sub-problems were identified through five ways: (i) analysis of SOTA developments in the AI field, (ii) detailed examination of the methodological and experimental contributions of AI-related research articles, (iii) analysis of some publicly available datasets, (iv) critiques published in review/survey papers, and (v) initiatives that are recently launched to either navigate the potential harms of AI technology or to advance the AI systems. In the next section, we introduce DC-AI and explore opportunities (i.e., dedicated properties) for solving these crucial MC-AI problems by using DC-AI.

6. Data-Centric AI: A Solution for the Crucial Problems of the Model-Centric AI Approach

In this section, we explain DC-AI and its key components, and we discuss DC-AI as a pertinent solution to longstanding problems of the conventional MC-AI approach.

6.1. The Data-Centric AI Approach

Data-centric AI is all about systematically engineering data to successfully build high-quality AI systems that can solve many real-world problems, specifically in industrial settings [89]. In most practical AI applications, it is more important to improve the data rather than the AI model (i.e., the network structure) because only fine-tuning the AI model can bring marginal improvements in AI system performance [81]. The essence of DC-AI is to shift the focus from acquiring voluminous amounts of data every time to getting limited but good-quality data. Collecting a limited amount of data or enhancing data quality can augment the performance of many AI systems in many real-world applications [90]. The basic key components of the DC-AI approach are given in our recent works [12,32]. Ensuring all these components concerning data meticulously supports the bright side of the DC-AI approach. The three mainstream benefits (https://landing.ai/data-centric-ai/, accessed on 29 March 2024) of DC-AI are as follows:

Model building speed: 10× higher.
In certain cases, it can help to reduce the time from development to deployment by up to 65%, similar to quality assurance (QA) in software engineering.
Significant enhancement in accuracy in certain cases.

Apart from the benefits cited above, DC-AI can be widely applicable to any situation when getting more data is not possible, or good-quality data do not exist. A workflow diagram of the DC-AI approach is given in Figure 3. As shown in Figure 3, DC-AI puts more emphasis on data quality from the beginning of the AI project lifecycle, leading to the development of high-quality AI systems afterward. By employing rigorous checks and all steps of DC-AI, data can be systematically engineered, resolving many performance bottlenecks in MC-AI, either fully or partially.

6.2. The Introduction of DC-AI for Specific Problems and Major Benefits Compared with Traditional Methods

The DC-AI concept is useful for specific problems in which good data are either absent or cannot be curated owing to small data collection budgets. DC-AI can also be useful when an organization lacks the technical expertise, computational infrastructure, and funding to develop the data engineering pipelines. The DC-AI concept is also applicable to scenarios in which data are scattered, sparse, noisy, incomplete, and may emerge from diverse/heterogeneous sources. In data-driven solutions/products, the data can emerge from different sensors/actuators, where each device can have its own data modality, and creating a unified template for data is tricky. In these circumstances, the DC-AI approach can be handy. For example, in the predictive maintenance scenario, the data coming from different sensors can be imbalanced because machinery counts faults on some occasions only. The classifiers trained with the imbalanced datasets cannot be used in real-life scenarios owing to poor prediction/classification of minority classes. In this scenario, the data can also come in different modalities (images, tables, readings, etc.) from each sensor device, and creating an alignment of data requires sophisticated techniques like DC-AI. Similarly, many other problems require DC-AI approaches to make the best use of AI technology in real scenarios. The other specific problems that require DC-AI are epidemic handling systems (in these systems, working with heterogeneous data is inevitable to better control the epidemic), time series analysis (sophisticated expertise is needed to generate a uniform picture of data, and combining data from different sources), equity and inclusion (impartial decisions are needed to prevent negative consequences of AI technology), human activity recognition (data are originating from different sensors, and augmentation and filtering are needed to improve data), anomaly detection (domain knowledge and data engineering are required to separate normal events from the abnormal events), etc. Lastly, DC-AI can be introduced for specific problems when the traditional approaches mentioned below either do not yield satisfactory results or are likely prone to deficient performance owing to data-related problems:

Changing of AI models (e.g., LSTM → CNN)
Tuning of or adding hyper-parameters to an AI model
Use of the supportive techniques (e.g., sampling, SMOTE, weighting, etc.)
Increasing data exponentially (e.g., multiplying data by 2)
Data sources and data modalities are large in numbers

The essence of DC-AI is to improve data, give top priority to data, and optimize data without compromising performance. Therefore, it can be used for most problems that require AI-based solutions. Figure 7 demonstrates the benefits of DC-AI compared to traditional methods.

As shown in Figure 7, DC-AI can contribute to enhancing the adoption, development, use, and governance of AI technology, leading to benefiting humankind in various ways. Apart from the benefits listed in Figure 7, DC-AI ensures data alignment, consistency, freshness, and reliability, leading to the guarantee that final data are dependable, robust, and representative of the problem under investigation. Since DC-AI performs multiple checks on data, its use in safety-critical applications (e.g., medical diagnosis, drug discovery, medicine development, hazardous events prediction, etc.) poses fewer risks than traditional methods.

6.3. How DC-AI Can Solve Many Crucial Problems of MC-AI

In this section, we explore and present the opportunities for solving crucial problems in MC-AI with DC-AI. Specifically, we explore the components of DC-AI and map them to MC-AI in ways that can likely solve the longstanding problems in MC-AI. In the AI-driven era, the solution to these problems has become imperative, considering the technological advancements as well as the market penetration of AI in many sectors (healthcare, smart cities, etc.). A schematic of how problems in the MC-AI approach can be solved utilizing DC-AI is shown in Figure 8. Technical and detailed descriptions of the solutions to crucial problems in the MC-AI approach via the DC-AI approach are given in Section 7.

From the analysis in Figure 8, we can see that DC-AI offers valuable solutions to the problems with MC-AI because it entails debugging the data as well as the model itself. Soon, DC-AI will be extensively investigated to solve the problems with MC-AI in many domains. For example, during the COVID-19 pandemic, AI model results were less trustworthy because they were obtained mainly through the MC-AI approach, and when it comes to the deployment of most techniques, they do not yield reliable results due to a lack of good-quality data [91]. COVID-19 is a perfect scenario where the application of DC-AI is of paramount importance. For example, there is a huge imbalance in the test data (positive and negative cases); decisions were unfair for some minorities (or for identifying super-spreaders); the accuracy of most models was low; privacy and security issues were serious (most people around the world did not install and use digital tools); data flow was not transparent; pre-trained models were not effectively used as data changed owing to the wearing of masks; a voluminous amount of data was unstructured; conclusions/inference results varied greatly from region to region. To solve the above-cited pandemic-related problems, DC-AI can be a promising solution. Furthermore, the solutions to ethical and societal issues with AI are urgent because they hinder AI adoption in many practical scenarios around the globe [16,92]. Our analysis sheds light on ways to rectify most parts of AI technology development, which in return, can make AI technology more trustworthy and advanced, compared to the recent past.

6.4. Algorithm Framework of DC-AI for Practical Applications

The overall DC-AI algorithm framework along with the necessary implementation steps is demonstrated in Figure 9. It is worth noting that the workflow of the DC-AI algorithm framework can vary from problem to problem and from domain to domain. Therefore, we present the generic framework of DC-AI that can be applied in most practical applications in Figure 9. In the beginning, the industry and academia experts define the problem/task to be solved with the help of AI. Later, the suitable data collection method is identified depending on the problem. In some cases, data can be collected from relevant sources at once, and we refer to this situation as static in Figure 9. In contrast, some scenarios require continuous data collection with the help of sensors (or wearable devices) and are denoted as dynamic in Figure 9. After identifying the data collection method, appropriate data preparations are made for each scenario. In the static case, the file types are determined to store the data. In contrast, the dynamic case may require the parser or pooling schemes to fetch data from the relevant source/device. Subsequently, data engineering is performed with the help of sophisticated techniques on the collected data. The main purpose of the data engineering technique is to improve the datasets systematically/algorithmically to yield stable results with the AI models. It is worth noting that data engineering techniques applied here are more extensive in numbers compared to pre-processing used in conventional ML pipelines, as listed in Table 3. Furthermore, it is important to note that data engineering techniques may vary depending on the data collection method as well as the application domain. After applying data engineering techniques, the quality of data is gauged with the help of pre-determined questions/requirements. The quality checks are usually performed by data engineers and domain experts. If a dataset clears all the quality-related checks, the dataset is partitioned into training and testing data, respectively. In most real-world scenarios, the training dataset is

\frac{2}{3}

of the whole data, and the testing data are

\frac{1}{3}

of the whole data. After partitioning the data into two parts, a suitable AI model is chosen depending on the nature of the problem, and training is performed with the training data. Subsequently, the training process is analyzed to verify the list of requirements concerning the training process. If the training conditions/criteria are satisfied, the testing of the model is performed with the help of unseen data. Lastly, if a model yields satisfactory results, it is deployed to real-world settings, and performance is continuously monitored. In some cases, the model is re-trained after a certain period to yield stable performance. In this framework, if the model yields substandard performance, then data are significantly improved along with the minor tweaks in the code of the AI model.

Seedat et al. [93] recently highlighted the DC-AI techniques that are required at different stages of AI project development and are very relevant in the context of the DC-AI approach. The authors developed a checklist for DC-AI implementation at four different stages (data preparation, training, testing, and deployment stage) of ML pipeline development and designed a set of questions to ensure systematic data-centric curation of training data. The authors provided the example tools that are currently available for each stage. We believe those techniques and questions are vital in implementing the DC-AI approach in any practical application. The considerations proposed for different stages can serve as baseline implementation of DC-AI and can be further customized to any real-world scenarios (e.g., anomaly detection, medical analysis, image segmentation, etc.). We refer interested researchers and practitioners who aim to implement DC-AI in specific applications to [93] for more details. Data constitute the main element in AI and specific guidance provided in this subsection in the form of an algorithmic framework and Seedat’s study can contribute to the development of DC-AI-based practical applications.

7. Detailed Analysis of Potential Solutions of the MC-AI Approach’s Crucial Problems with the Data-Centric AI Paradigm

7.1. Significance and Activities of Data-Centric AI

DC-AI is expected to bring a huge revolution in the AI domain from its data-first strategy because many AI models trained on flawed data cannot yield trustworthy results in practice. Furthermore, in many businesses/applications, acquiring more data is not possible, and therefore, DC-AI is a practical approach involving less data, but yielding better results. Furthermore, DC-AI can be superior to MC-AI in many ways when it comes to the development of sophisticated AI systems for production/industry. In Table 5, we highlight how DC-AI is a potential solution to MC-AI problems from an accuracy enhancement perspective.

As shown in Table 5, DC-AI yields more promising results compared to MC-AI from the perspective of accuracy. In Table 5, the baseline approach refers to the conventional approach of AI model training (e.g., without improving data quality). DC-AI improves data quality by repairing problematic parts of the data. The new values denote the final accuracy that is achieved by improving data quality. These results prove the superiority of DC-AI over MC-AI in many industrial settings. Considering these benefits, DC-AI is expected to be adopted on a wider scale to combat the problems of conventional MC-AI.

Lastly, DC-AI also emphasizes how data collection should be performed in the right way, leading to sufficient and AI-friendly data preparation. DC-AI enables error-free and relevant data collection from diverse providers that can contribute to effectively solving the problems at hand. To yield better performance with DC-AI, the dataset should be consistent, relevant, comprehensive, diversified, and uniform. Furthermore, customized data collection procedures can be adopted to tackle diverse data types (e.g., audio, video, text, etc.), and to prepare high-quality data for production-ready AI systems.

When constructing an entire AI system to solve some real-world problems, multiple DC-AI activities can be used at each stage. These activities iteratively improve data and other core aspects of the underlying AI model, leading to effectively solving the problem at hand [27,60,94,95,96,97]. In Figure 10, we identify and map various DC-AI activities that can be applied to different stages of AI-based systems. The contents enclosed in Figure 10 can pave the way for understanding key activities of DC-AI and their systematic mapping to different stages of an AI-based system. It is important to note that some of the activities may not be required in each project, and therefore, only relevant activities can be chosen depending on the problem.

7.2. Prospects of Data-Centric AI in Solving Crucial MC-AI Problems

Below, we describe key properties/components of the DC-AI that can be employed to solve the crucial problems in MC-AI:

Addressing societal problems: Fairness in AI decisions can be guaranteed by utilizing a diverse dataset (e.g., collected from diverse populations/groups), and ensuring diversity in the training data is one of the recommended practices in the DC-AI. To this end, DC-AI can likely assist in overcoming bias issues in MC-AI. Trustworthiness can be achieved by managing relevant data and controlling data-specific bias in the lifecycle of AI development, which is a core part of DC-AI. Transparency can be achieved by monitoring data flows and providing proper versions of training data. Ethics-related issues can be solved by using proper data representations and human involvement. Privacy and security issues with MC-AI can be solved by utilizing laws and regulations that are devised focusing on DC-AI. Accountability problems can be solved by ensuring a chain of custody with information, tracing data flows, and using responsible data science practices. Data-tailored actions such as ensuring data diversity, responsible data management, strict control of data, diligent data flow management, better representation of data, tracing of data flows, and responsible data use/governance practices come through DC-AI, and it is fair to say that DC-AI can play a significant role in addressing most societal problems of the conventional MC-AI.
Addressing performance problems: To solve the accuracy problem of MC-AI, various data-focused techniques of DC-AI are applied, such as data completeness, diversity, labeling, augmentation, discarding misleading examples, and identifying and substituting missing samples. These techniques rigorously improve data quality, leading to highly accurate AI systems. As shown in Table 5, DC-AI can significantly improve accuracy while solving industrial problems where data quality is poor, or the amount of data is limited. The stability problem is solved by assuring the availability of relevant data in each phase of the lifecycle. Furthermore, training AI models with relevant and sound data contributes to designing stable AI models. All these techniques are part of DC-AI. Conceptual soundness can be assured by documenting all necessary aspects concerning both data and the model. Furthermore, involving multiple experts to jointly perform labeling is another promising solution providing better conceptual soundness. Robustness can be provided by implementing relevant laws and regulations from the perspective of DC-AI. In most real-world applications, AI systems/methods need to show robustness in two settings: (i) known unknowns, and (ii) unknowns unknowns [98]. In the former setting, the computer can explicitly reason about the uncertain aspects of real-world situations. In the latter setting, the computer cannot explicitly reason about certain aspects of real-world situations. In high-stakes or safety-critical applications (e.g., autonomous driving, stock exchanges, a cancer diagnosis, etc.), AI methods need to be robust for both settings. Robustness is an important property of AI systems, and it is required to yield reliable results and to increase confidence in this technology. Many AI developers are focusing on improving the robustness of AI methods in different ways across the world. For example, in the Google speech engine, the word error rate was about 23 % in 2013, which was subsequently lowered to 8% in 2015. Thomas [98] proposed eight ideas to achieve robustness in AI systems. In this work, we propose a solution for those ideas from the perspective of DC-AI to enhance robustness. There are four ideas under the knowns unknowns settings: robust optimization, regularization, risk-sensitive objectives, and robust inference. Below, we provide DC-AI based solution for each idea to address robustness problems in MC-AI:
–
Robust optimization is related to modeling uncertainties in data/models with the help of objective functions. In DC-AI, the visibility to all parts of the training data is provided, and therefore, an affordable uncertainty budget $ϕ$ can be determined conveniently. Determining the appropriate value of $ϕ$ , and solving the problem by choosing subsets of data denoted with $σ_{i}$ , where $i \in {a, b, \dots, s}$ , can assist in mapping the trade-off between objectives and robustness of solution. Since data quality is improved significantly at the data preparation stage, the optimization problem can be solved for some values of parameters, leading to higher robustness than MC-AI, which solves similar problems against relatively large values of parameters. The DC-AI techniques such as active learning and confident learning can also assist in guaranteeing the robustness of solutions under different uncertainty budgets.
–
Regularization is related to the ability of any AI algorithm/model to generalize well on new/unseen data. In simple classifiers, generalization is modeled with the help of a hypothesis that measures generalization through loss functions as expressed below:

$C = L (\hat{y}, y)$

(2)

wehere $^$ is the predicted value, and y is the true value. The hypothesis is expressed either in the form of conditional probabilities or additional parameters like $λ$ in the objective functions. The optimal values of the $λ$ are determined through cross-validation or a representative subset of data. In DC-AI, techniques such as core-set selection (e.g., reducing the size of the data to a representative subset only) are applied to both the training and testing data, and therefore, regularization is accomplished effectively, leading to higher robustness. Furthermore, the diversity of the data is guaranteed at the collection time, which prevents the possibility of overfitting while guaranteeing balanced learning. The techniques like curriculum learning are also applied to enhance the learning ability of the model, consequently, the trained model remains applicable to simple as well as complex examples. The extensive use of advanced techniques such as feature stores also contributes to improving the generalization of the AI models, leading to significantly enhanced robustness.
–
Risk-sensitive objectives are related to the downside risk of the outcome of an AI model. For example, in reinforcement learning, a rational policy is required to find the reward when an agent interacts with the environment. In many real-world situations, the outcomes when an AI model keeps making correct predictions are desirable and have lower risk or downside risk. In contrast, when AI models make incorrect predictions/classifications, given that the correct inputs are given, has higher risk or downside risk. How to maintain the balance for conditional value at risk (CVaR) is a challenging problem. In DC-AI, domain experts are kept in the loop throughout the development of systems, and therefore, the balance for CVaR is achieved, leading to better robustness. For instance, in the medical domain, the domain experts are aware of the risk of misclassifying a cancer patient as normal and vice versa. Therefore, optimal policies can be developed and governed to keep the downside risk small. Furthermore, domain experts are aware of the social implications of AI decisions, and therefore, outcomes can be adjusted in high-stakes applications. In DC-AI, the influence modeling of data is also performed, which can contribute to lowering risks from AI models. DC-AI techniques like encoding human priors, model-aware cleaning, and noise removal from data contribute to modeling risk-sensitive objectives, leading to the significantly enhanced robustness of AI models.
–
Robust inference can be accomplished by tailoring the inference process to the nature of the application. In normal applications, curating test data that is sound from most perspectives, so that repetitive tests are not needed, can contribute to the robustness. In safety-critical or high-stakes applications, creating both balanced and imbalanced distributions of test data to monitor performance drifts, etc., can contribute to achieving robustness in AI systems. In some cases, training can be performed on real data, and inference can be performed via synthetic tools to enhance the robustness of AI systems. Lastly, some values that encompass more risk can be given preference in test data to verify the decisions by exploiting probabilistic reasoning or weighting concepts.
There are four ideas under unknowns unknowns: detecting model failures, use of casual models, portfolio methods, and expanding the AI models. The AI model can yield deficient performance when there is a huge difference between training and test data. However, DC-AI curates data of the best quality, and ensures higher alignment between testing and training data, leading to higher robustness. In addition, DC-AI evaluates data N times rather than one time, and therefore, the ambiguous representations are pruned beforehand. Causal models help identify the features that are casually connected with the target class rather than just co-related. DC-AI sorts the examples complexity-wise, and therefore, it is easier to identify the causal relationship between features, leading to higher robustness. Portfolio methods (also known as ensemble methods) are used to improve prediction/classification performance via multifaceted learning. Some AI models can learn less knowledge and be incomplete, leading to poor robustness. DC-AI employs domain experts to determine the suitable models, and majority consensus is used to pick the model; therefore, the robustness can be enhanced. Just like the SATzilla pipeline [99], DC-AI involves data screening from multiple domain experts, and the chance of imbalanced learning is restrained. In addition, data reduction and optimization techniques are used to fine-tune data which is more suitable for portfolio methods. Lastly, most portfolio methods draw samples from the data, and they can be more balanced owing to DC-AI applications (e.g., retaining higher diversity, lower imbalance, and large number of facets), leading to higher robustness. In the last idea (e.g., model expansion), DC-AI can help in providing a knowledge base and feature stores which can enable broader learning of models, leading to higher robustness. For example, in normal settings, the limited features are used only in training, and AI models produce wrong inference results even in small changes in test data. In contrast, the feature store encompasses a variety of features, and all those features can be reused by other models simultaneously, reducing the time of model training and operational costs. Furthermore, the AutoML techniques are amalgamated with DC-AI, leading to suitable model selection and expansion that increase robustness [100,101]. All of the above-cited DC-AI techniques and concepts can pave the way to enhancing the robustness of the MC-AI approach. Recoverability can be achieved by performing error analysis and modeling multiple corner cases. Through such data-based actions, DC-AI can successfully resolve many performance-related problems of MC-AI.
Addressing drift problems: To solve data drift, DC-AI enforces proper data versioning and completeness via augmentation, and therefore, the possibility of drift can be reduced. In addition, data quality is guaranteed and evaluated at multiple stages during the lifecycle of an AI system. Hence, data drift can be significantly reduced in practical real-world cases. Concept drift can be addressed by choosing the most suitable AI models based on data types. Higher visibility in training data, control and understandability of data, and making data accessible throughout the AI lifecycle can lower the possibility of concept/model drift. In addition, sending high-quality data to the training process and modeling all corner cases that can lead to concept drift beforehand can help prevent drift issues. Data scarcity is resolved by imputing missing features in the data from samples generated synthetically. Furthermore, paying ample attention to data from all aspects prevents all these drift-related problems. Providing visibility, availability, versioning, and control in all parts of the data during AI system development comes through DC-AI, and therefore, it is fair to say that DC-AI can overcome drift-related problems encountered by MC-AI.
Addressing sustainability problems: CO₂ emissions can be significantly lowered by curating better data instead of needlessly rebuilding AI models after minor modifications in hyper-parameters. Furthermore, the excessive rebuilding of AI models can be reduced by inspecting data in cases of poor performance instead of modifying some parts of the code and rebuilding the model. In addition, pre-trained models are widely adopted in DC-AI, and therefore, the energy cost and corresponding environmental concerns are lower. The CO₂ emissions problem can be resolved by building a model only a limited number of times, compared to MC-AI that keeps recompiling code after minor modifications. Man-made disasters can be minimized by generating alarms when there are ambiguities in the data. In addition, periodic data assessments help prevent human error. The challenge of climate change can be addressed using data that are complete from most perspectives and that are obtained from diverse territories/regions. To that end, creating data strews and data stores can contribute to overcoming the challenges of climate change. Providing data relevance, seamless access, and ensuring data availability at the right place enable optimized material flows, leading to a lower impact on environmental health. Approaches such as needs-based model development, using data fairly, identifying data problems and issuing alerts, obtaining only necessary data, and providing better access to data are part of DC-AI. Hence, DC-AI can assist in addressing most sustainability-related problems found in MC-AI, overcoming the negative consequences of AI technology on the environment.
Addressing affordance problems: DC-AI enables data-tailored actions and strategies to make the AI paradox simpler, which in return, can contribute to overcoming the affordance-related problems of MC-AI for small businesses. For example, DC-AI stresses the creation of data engineering jobs and training in those professions, which, in turn, can lower barriers to the adoption of AI technology in the commercial sector. To solve fragmented tools problems, DC-AI focuses on developing new data-enhanced pipelines and prototypes that can assist small companies to benefit from such developments. In addition, DC-AI is more suitable for industrial problems where data scarcity exists, and therefore, it can effectively solve affordance-related problems. DC-AI takes benefits from existing AI technology development that can enhance AI governance/use in multiple sectors. Developers can be empowered with sophisticated knowledge of data used in AI systems and the prioritization of AI use, which can address problems stemming from the autonomous nature of AI technology. Furthermore, improving most parts of the data contributes to unlocking the potential of DC-AI in the next era, and can overcome the affordance problems of using MC-AI in commercial sectors. Furthermore, DC-AI can expand AI applicability to many sectors/problems where MC-AI was not applicable before, and therefore, AI technology can become more affordable and easily accessible worldwide.
Addressing controllability problems: Five types of control are vital to preventing negative consequences of AI technology on human beings. Explicit control can be provided by adding the result-verifiability concept to AI technology. Since DC-AI inspects data multiple times in the AI development lifecycle, the results are verifiable at each step. To achieve better explicit control, DC-AI provides various data-focused rules and, based on those rules, AI decisions can be modified to prevent catastrophic results. Furthermore, DC-AI can identify vulnerabilities in data, leading to the better control of AI results generated from those data. Implicit control can be achieved by repurposing AI decisions via data-driven rules and human knowledge, and by combining precomputed decisions. In this regard, certain practical rules can be created by exploiting the fine-grained knowledge from underlying data that is offered by DC-AI. Aligned control can be achieved by training AI models with diverse data concerning a particular problem. For instance, in the previous example, the autonomous car can stop close to any convenience store, because the driver might be in search of food rather than a toilet. Hence, aligned control can be accomplished via tailored DC-AI data practices and data-driven knowledge. Delegated control can be provided by DC-AI offering a set of candidate solutions to different combinations of data. DC-AI also combines data from multiple contexts, and can therefore assist in achieving effective delegated control. Hybrid control can be achieved by fulfilling the properties of all the above-mentioned controls. Based on our analysis, we conclude that DC-AI provides higher controllability of AI technology compared to MC-AI, which overlooks important data-based aspects.

The analyses cited above can help rectify AI technology development for the benefit of humankind in various ways. The rationale behind DC-AI solutions for MC-AI problems is the characteristics of either the whole training data or parts of data that can lead to them. For example, fairness can be guaranteed when training data has information on diverse contingent/subjects as it cannot be achieved solely by improving the AI model’s code. Similarly, higher accuracy can be accomplished only when training data has adequate variables/features or data quality (e.g., size, resolution, diversity, etc.) is high. It is important to note that sometimes the parameter tuning can enhance the accuracy of the DL/ML model, but such a model cannot generalize to unseen data [102]. Concept and data drifts are protected when features that are likely to depreciate are identified beforehand rather than optimizing AI model code. Carbon emissions can be reduced by using less data via condensation methods or adopting strategies like curriculum learning. DC-AI tools with fewer commands or lines of code can be used to fix data quality-related problems and to make them affordable for low-resourced organizations. Similarly, the control can be added to AI models by acquiring data from different contexts/situations rather than only modifying the code of models. Since most of the solutions are derived from data characteristics or composition, and therefore, they can assist in solving MC-AI problems.

To further highlight the promises of DC-AI, in Table 6 we map three major components of the DC-AI approach: the data-first strategy (DFS), the intelligent data architecture (IDA), and data compliance (DC) to the six crucial problems of MC-AI. The second column in Table 6 highlights how the identified crucial problems are impacted by the MC-AI approach. For example, societal problems occur due to the hidden processing of data and ignoring its critical components such as diversity, completeness, timeliness, etc., before building AI models. Furthermore, in MC-AI, the data are cleaned/pre-processed only one time and the rest of the development process usually focuses on code optimization. The performance-related problems are caused by the limited valuation of data and instead, more focus is on model-related aspects, as shown in Figure 1. In some cases, parameter optimization or data augmentation can solve performance-related problems, but the underlying AI model cannot generalize well to unseen data/observations. The drift-related problems are caused by MC-AI as it does not maintain the versions of data and does not use a ’datasheet for dataset’ [60]. The sustainability problems are caused by extensive re-training of the AI model after minor tweaks in the code and paying less attention to data optimizations, as shown in Figure 2. The repetitive training of AI models degrades environmental health and also costs higher utility bills during the development of large-scale models like LLMs. The affordance problem is caused by solving the data-related problem via adjustments (i.e., learning rate, loss functions, averaging concepts, etc.) in the code. The black-box nature is also a main cause for poor affordance in some cases. The controllability problems are due to poor inductive bias and static decision-making rules. The controllability problems also occur when the AI model is not exposed to data from different situations/scenarios. Based on the above analysis, it is fair to say that all these identified problems are either fully or partially caused/impacted by the MC-AI approach. The information in Table 6 can assist in grasping the efficacy of DC-AI in each problem domain of the conventional MC-AI. From the analysis in Table 6, it is fair to say that DC-AI can indeed solve many problems found in the conventional MC-AI because data play an indispensable role in the lifecycle of AI systems. Furthermore, many innovative concepts such as diverse data collection, aggregation, cleaning, and labeling, plus quality assurance, debugging, valuation, and data analysis (accuracy/error) make DC-AI most suitable for AI applications worldwide. This analysis can pave the way toward understanding the hot and sustainable research trajectories concerning DC-AI around the world.

7.3. Insight into suggested solutions and their feasibility in addressing the identified problems

In this subsection, we delve deeper into each solution and demonstrate their feasibility in addressing MC-AI crucial problems. It is worth noting that most of the suggested solutions are tailored to data, and therefore, they can help address crucial MC-AI problems listed in the former sections. Effective solutions to societal problems can increase the adoption of AI technology and its applications. The suggested solutions, such as data diversity enhancement, effective data management, data flow monitoring, feature map generation (or data point influence computation), regulatory measure use such as GDPR or privacy mechanisms like differential privacy, and responsible data science, can help overcome the societal risks of MC-AI. Performance-related issues can reduce the costs of AI development and can assist in democratizing AI developments. The suggested solutions such as constrained data augmentation, the right data for the right model, data documentation, encryption and anonymization methods, and corner cases analysis, can help overcome performance-related problems. Avoiding drift problems can enhance people’s trust in AI-based systems and can reduce operational costs. The suggested solutions like data visibility, availability, control, data completeness, freshness, relevance, and error analysis, can help resolve drift-related problems. The solution to sustainability problems stemming from AI technology can improve environmental health, leading to improved quality of life. Solutions like need-based model development, flagging vulnerabilities in data, getting data from all possible cases, and data condensation, can help resolve sustainability problems. Recently, some frameworks such as core-set selection, active learning, knowledge transfer, curriculum learning, data augmentation, depth-wise separable convolution, parameter pruning, weight sharing, etc., have also been suggested to resolve the sustainability problems of AI/ML models [103,104]. Improved affordance can overcome the imbalance of technology adoption between small businesses and big businesses. The suggested solutions like preparing the relevant workforce, developing automated data engineering tools, extensive use and evaluation of developed tools, data customization, and governance, can help address affordance problems. Furthermore, the implementation of NAIRR-like initiatives helps address affordance problems of AI technology for low-resource organizations [88]. Lastly, the solutions for control-related problems in AI can limit unintended consequences/harm to humans from AI technology. Solutions like context addition, data-based intelligence, repurposing of AI decisions, verifiability of AI results, overcoming data purity, resolving overfitting/underfitting issues of AI models, and adding human intelligence, can help resolve controllability problems. The enclosed knowledge sheds light on improving AI technology from most perspectives, which in return, can contribute to unlocking the potential of this technology in modern society.

7.4. Analysis of the Feasibility and Affordability of DC-AI-Based Solutions

Although DC-AI-based solutions are handy in resolving the problems in MC-AI, some of the solutions may not be easy to accomplish and are not reasonably priced. For example, developing AI products by following the DC-AI paradigm requires domain experts, which can be costly compared to the MC-AI approach. Similarly, DC-AI requires data collection from diverse domains/environments which may increase data collection budgets (in some cases, the data curation costs can be out of control). In some cases, the rigorous application of DC-AI techniques may increase the development to deployment time if evaluation criteria(s) are not well defined. In some cases, the domain experts are not skillful in either diverse domains or different data modalities, which may lead to marginal improvements in results. On the other hand, a lot of efforts have been recently devoted to increasing data quality with the help of automated pipelines, data engineering frameworks, and prototypes. The implementation of technical solutions like data lakes [105], low-latency data infrastructure [106], feature stores [107], data warehouses [108,109], data branching [110], AutoML for data management [111], data strew-ships [112], data fusion techniques [113], data taxonomies [114], data-quality enhancement pipelines [115], data mesh and fabric [116], addressing imbalances in data [117], smart bots for data quality enhancement [118], data ontologies [119], data quality evaluation metrics [120], synthetic data generation tools [120], data profiling [121], reference stores for data quality [122], and data validation pipelines [123,124], to name a few, are vastly contributing in the feasibility and affordability of DC-AI-based solutions. In the future, more developments are expected in data quality enhancement, leading to the realization of DC-AI across many enterprises.

Based on the five criteria and detailed investigation, we provide an analysis of the feasibility and affordability of DC-AI-based solutions discussed in this article. The criteria were (i) technical papers published in years

\geq 2021

centering DC-AI-based solutions for MC-AI problems, (ii) priority areas(s), AI adoption methods, and computing infrastructure enhancement plans mentioned in the national AI strategies of the top 10 countries of the world, (iii) critiques/viewpoints/future scope published in review articles about DC-AI, (iv) conferences organized under the theme of DC-AI or a special section organized in conferences with a DC-AI theme, and (v) initiatives taken around the world to increase AI adoption and governance. For instance, many papers have already been published centering on the explainability of AI either improving data or algorithmic aspects of AI models, and therefore, it is fair to say that solutions proposed to enhance the explainability of AI are feasible and affordable [125]. Similarly, the USA is the leading country in AI technology development according to the global AI index (https://www.tortoisemedia.com/intelligence/global-ai/, accessed on 5 April 2024). They have the world’s best talent, research and development environment, and commercialization of AI products. Therefore, they can implement any DC-AI-based solution to solve MC-AI problems with the least difficulty. As a result, DC-AI techniques can contribute widely to solving some of the DC-AI problems, leading to higher feasibility and affordability of the DC-AI approach. Thus, far, most of the papers published on DC-AI are surveys, reviews, or perspective articles with clear indications about the scope of DC-AI [126]. Therefore, one can adopt knowledge from those papers to develop DC-AI-based products/solutions, leading to higher feasibility and affordability of DC-AI-based solutions. Recently, many conferences (https://nips.cc/Conferences/2021/Schedule?showEvent=21860, accessed on 5 April 2024) have been organized with the theme of DC-AI for the academic community to explore the hidden potentials of DC-AI. Interestingly many competitions have also been organized in DC-AI, which can increase awareness about DC-AI, leading to the higher feasibility and affordability of DC-AI.

Lastly, many countries have taken cross-border initiatives (https://www.climatechange.ai/, accessed on 10 April 2024) to apply AI to global problems, transfer AI technologies, exchange data, and train the workforce. In some countries, DC-AI has also become part of the curriculum in data science and AI majors. Therefore, a huge change in the understanding, development, adoption, and governance of AI technology is expected. Table 7 presents the analysis in terms of the feasibility and affordability of DC-AI-based solutions for MC-AI problems. Specifically, we analyze the feasibility in terms of complexity, and affordability in terms of cost. Moreover, the ‘yes’ in Table 7 means that solutions are feasible/affordable.

From the results given in Table 7, it can be noticed that most DC-AI-based solutions discussed in this paper are feasible and affordable, and can greatly address the problems faced by the MC-AI approach. For example, the DC-AI-based solutions discussed in this paper are handy for solving performance, social, and drift-related problems of MC-AI by optimizing data, providing better visibility to all parts of data, and ensuring data completeness. However, some of the solutions suggested for controllability, sustainability, and affordance of AI technology may not be fully feasible/affordable, owing to less investment in terms of both money and research. However, many DC-AI-powered developments are expected for all six types of problems, and therefore, it is fair to say that most solutions are feasible and affordable. Lastly, some of the solutions discussed in this paper highlight the future research trajectories in DC-AI, which can pave the way for realizing the DC-AI paradigm.

7.5. Existing DC-AI Implementations or Success Stories

Of late, many efforts have been devoted to bringing DC-AI into practice and harnessing its potential. Detailed information related to the generic implementation of DC-AI can be learned from the GitHub repository (https://github.com/Data-Centric-AI-Community/awesome-data-centric-ai, accessed on 22 May 2024). This repository encompasses valuable resources such as tutorials, open-source libraries, and small-scale tools for DC-AI developments. Most of the tools and libraries have been developed in the Python language and can be adapted to curate good data for AI/ML models. The resources in this repository are categorized into data profiling, synthetic data generation, data labeling, and preparation. There are nine, eight, six, and one successful implementations in each category. The successful implementations in data profiling are YData Profiling (https://github.com/ydataai/ydata-profiling, accessed on 22 May 2024), SweetViz (https://github.com/fbdesignpro/sweetviz, accessed on 22 May 2024), DataPrep.EDA: (https://github.com/sfu-db/dataprep, accessed on 22 May 2024), AutoViz (https://github.com/AutoViML/AutoViz, accessed on 22 May 2024), Lux (https://github.com/lux-org/lux, accessed on 22 May 2024), Great Expectations (https://github.com/great-expectations/great_expectations, accessed on 22 May 2024), D-Tale (https://github.com/man-group/dtale, accessed on 22 May 2024), Data Profiler (https://github.com/capitalone/DataProfiler, accessed on 22 May 2024), and whylogs (https://github.com/whylabs/whylogs, accessed on 22 May 2024). Most of these tools are very easy to use and require few commands. The successful implementations in synthetic data generation are YData Synthetic (https://github.com/ydataai/ydata-synthetic, accessed on 22 May 2024), Synthpop (https://cran.r-project.org/web/packages/synthpop/index.html, accessed on 22 May 2024), DataSynthesizer (https://github.com/DataResponsibly/DataSynthesizer, accessed on 22 May 2024), SDV (https://github.com/sdv-dev/SDV, accessed on 22 May 2024), Pomegranate (https://github.com/jmschrei/pomegranate, accessed on 22 May 2024), Gretel Synthetics (https://github.com/gretelai/gretel-synthetics, accessed on 22 May 2024), Time-Series-Generator (https://github.com/Nike-Inc/timeseries-generator, accessed on 22 May 2024), and Zpy (https://github.com/ZumoLabs/zpy/, accessed on 22 May 2024). The successful implementations in data labeling are LabelImg (https://github.com/HumanSignal/labelImg, accessed on 22 May 2024), LabelMe (https://github.com/labelmeai/labelme, accessed on 22 May 2024), TagAnamoly (https://github.com/Microsoft/TagAnomaly, accessed on 22 May 2024), EchoML (https://github.com/ritazh/EchoML, accessed on 22 May 2024), LabelStudio (https://github.com/HumanSignal/label-studio, accessed on 22 May 2024), and Awesome Open Source Data Annotation & Labeling Tools (https://github.com/zenml-io/awesome-open-data-annotation, accessed on 22 May 2024). DataFix (https://github.com/AI-sandbox/DataFix, accessed on 22 May 2024) is a very helpful tool for data preparation as it can detect shifts, identify the relevant features that cause the shift, and efficiently correct the shifts. The above-mentioned tools can assist in preparing the best quality data for AI/ML models and fixing quality-related problems in data of diverse types. Some insight into practical developments can also be gained from the Data-centric AI Resource Hub (https://datacentricai.org/, accessed on 22 May 2024).

The DC-AI tool implementation for unstructured data is available at Github repository (https://github.com/Renumics/awesome-open-data-centric-ai, accessed on 23 May 2024). The tools are categorized into eleven different categories such as data visualization and interaction, data versioning, outlier and noise detection, embeddings and pre-trained models, data explainability, active learning, bias and fairness, uncertainty quantification, observability and monitoring of data, security and robustness, and augmentation and synthetic data. The four success stories for DC-AI are Landing AI (https://github.com/HazyResearch/data-centric-ai/blob/main/case-studies/landingai.md, accessed on 23 May 2024), Snorkel AI (https://github.com/HazyResearch/data-centric-ai/blob/main/case-studies/snorkelai.md, accessed on 23 May 2024), Gmail Extraction (https://github.com/HazyResearch/data-centric-ai/blob/main/case-studies/gmail_extraction.md, accessed on 23 May 2024), and drift controls [127]. Apart from these tools, some libraries like Influenciae [61] have also been developed to identify the under-performing sub-populations from training data. Some recently developed tools like dcbench (https://github.com/data-centric-ai/dcbench, accessed on 23 May 2024) have the ability to perform many data-related operations (e.g., feature cleaning, core-set selection, and slice discovery) across the ML lifecycle. Recently, some tools for dataset condensation (https://github.com/justincui03/dc_benchmark, accessed on 23 May 2024) have also been developed to yield comparable performance on small datasets [128]. Lastly, the two working examples of DC-AI, fire risk prediction, and diabetic retinopathy detection, underscore the success stories of the DC-AI paradigm [129]. We believe that more successful implementations and success stories can emerge as researchers/practitioners delve into this new paradigm in the near future.

8. Enabling Technologies and Noteworthy Use Cases of DC-AI

In this section, we present DC-AI-enabling technologies, noteworthy use cases, and drawbacks, and the future scope of DC-AI.

8.1. DC-AI Enabling Technologies

The DC-AI approach is an emerging paradigm within the AI community, and is still in its infancy, providing much room for technical as well as theoretical developments. For example, there is a serious lack of practical tools/technologies in DC-AI compared to MC-AI. Recently, some existing technologies have been adopted to realize DC-AI. Some representative and enabling technologies that have proven successful in DC-AI are highlighted in Figure 11.

Apart from the enabling technologies cited above, conventional pre-processing techniques such as feature engineering, exploratory data analysis, outlier detection, missing-value imputation, consistency analysis, variation modeling, and domain-specific techniques can be adapted. Furthermore, data literacy techniques can be used as enabling technologies in DC-AI [130]. Lastly, large-scale databases such as Elastic Search, neo4j, and MongoDB can be used as enabling technologies for storing clean versions of data.

The DC-AI approach is an emerging paradigm within the AI community, and is still in its infancy, providing much room for technical as well as theoretical development. There is a serious lack of technology in DC-AI compared to MC-AI. Recently, some new tools have been developed, and existing technologies have been adopted to comprehend DC-AI. However, a substantial number of technologies are required to unlock its full potential. Some representative and enabling technologies that have proven successful in DC-AI are discussed below.

Synthetic data: Data generated with the help of AI tools plays a vital role in many AI applications, such as healthcare, finance, and biomedical research [131]. Synthetic data can be generated using generative adversarial networks (GANs) and statistical models by mirroring the properties of the original data. Synthetic data can potentially be used as an enabling technology in DC-AI. For example, they can be used to overcome the imbalance and scarcity issues in data [132] and can be used for data augmentation by generating the required data samples. They can also be used as a replacement for the original data when privacy requirements are strict [133]. They can be used to increase diversity in the data, which can lead to bias-free, equity-aware, and reliable decision-making using AI approaches; they can address issues stemming from noisy data, and can assist in realizing DC-AI on a wide scale.
Transfer learning: Transfer learning is a state-of-the-art approach that exchanges knowledge across related domains with promising applications in healthcare, recommender systems, biomedical informatics, transportation, and urban computing, to name a few [134]. It aligns with fundamental concepts of DC-AI (taking as many advantages as possible from rich data sources and AI models) and can play a vital role as an enabling technology. For example, there is no need to develop a language model from scratch if we already have a developed model (e.g., BERT). Instead, the developed model can be included in any Python program using one command, import BERT as bt. Hence, a significant amount of time and effort can be saved. Transfer learning has been successfully used in many real-world applications, such as text analysis [135]. In some cases, minor modifications in the already-developed model can do the job. In DC-AI, more emphasis is on data; the model can simply be used/imported, and therefore, transfer learning can play a vital role in this new paradigm.
Data-enhanced pipelines (or instruments): To solve noisy data and scarcity issues, many sophisticated pipelines have been developed recently to enhance the performance of machine learning techniques [136]. These pipelines can significantly enhance the accuracy of ML methods by utilizing a small portion of data. Mazumder et al. [38] developed the DataPerf tool (a scientific instrument) for addressing the data fragmentation problem and gauging the quality of testing and training data in real-world ML applications. The tool performs various data-focused operations to develop high-quality ML systems. Huang et al. [137] developed a platform that performs multiple operations on data by using open APIs to make data well-suited to computer vision tasks. This open-source platform (YMIR) has shown reliable results in many real-world applications. Eyuboglu et al. [138] developed an instrument named dcbench for improving the quality of training data. Specifically, these authors introduced a data cycle and model cycle to successfully build a data-centric application. Furthermore, dcbench can assist in evaluating whole systems for DC-AI development. These data-enhanced pipelines (or instruments) can act as enabling technologies for further developments in DC-AI.
Visualization tools: Since DC-AI often resembles the pre-processing step performed in most conventional MC-AI approaches, visualization tools can play a vital role in exploratory data analysis [139]. Visualization tools are good at identifying mislabeled data, missing values, and incomplete categories [139]. Sharma et al. [140] developed a pluggable and standalone visual tool for DC-AI. The proposed tool can assist in solving many ambiguities stemming from data. Similarly, Jupiter notebook visual tools can be used to analyze image quality before feeding them into AI models [141]. Visual tools are handy in drawing the attention of the researchers/practitioners where most labels are noisy and need to be fixed. Since visual tools can easily be used by non-AI experts, they can therefore be effectively integrated with AI developments. Paiva et al. [142] developed a visual tool to gather useful and important insights (i.e., comprehensive observations) about the training data and an AI model’s performance throughout its lifecycle. Visualization tools can help developers in debugging, and therefore, can be used as enabling technologies for DC-AI.
Pre-trained AI models: In many real-world AI applications, the same model is employed to solve a particular problem. For example, a CNN model is used to identify the gender and age of a person from a facial image. Similarly, a CNN model can also be used to recognize emotions from facial images. Because both problems are somewhat related, a pre-trained model can be used with minor modifications [143]. The main focus of DC-AI is on data, and therefore, the potential of pre-trained models can be exploited to realize DC-AI. For example, AI models that are used to diagnose one disease can be adapted to another with slight modifications. Similarly, AI models/methods to identify suitable drugs for one disease can be adapted to another disease with slight fine-tuning of the code. The same applies to natural language processing and text analysis [144]. In most cases, the pre-trained model can outperform models without pre-training and is therefore very useful in many domains [145]. Recently, pre-trained AI models have demonstrated effectiveness in resource-constrained devices [146]. Considering the benefits of pre-trained AI models, they can significantly help to accomplish some DC-AI-related tasks (fixing the code or fine-tuning the data).
Data compilers and debuggers: To unlock the upcoming era of DC-AI, sophisticated tools that can compile data to guide developers to the flaws in data are of pivotal importance. In addition, allowing developers to debug data as needed is a handy step toward obtaining the full potential out of the data. To this end, data compilers and debuggers with tailored functionalities can play a vital role in enabling DC-AI. Ziogas et al. [147] developed a data-centric Python that enables a high-performance workflow across different architectures by exploiting DC-AI features. The proposed workflow yielded excellent results in terms of scalability, computing time, and accuracy. Karla et al. [148] developed a data debugging tool named DataScope for ML applications. DataScope is four times faster than existing SOTA workflows when it comes to data debugging for the ML pipeline. Furthermore, recent experiments suggest that a closer look at data (i.e., discarding misleading examples) can enhance the performance of existing AI technologies by forwarding relevant data points to the training process [149]. The tools for data distribution analysis in an automated way assist in ML technology development [150]. Hence, the data compilers and debuggers that spot vulnerabilities in data can play a vital role in DC-AI as an enabling technology.

Apart from the enabling technologies cited above, developing appropriate data pruning strategies while sustaining accuracy is another promising avenue to be explored in the near future [151]. Lastly, upgrading the conventional pre-processing techniques to enhance data quality and consistency is required to unlock the full potential of DC-AI approaches in future endeavors.

8.2. Noteworthy Use Cases for DC-AI

Although DC-AI is applicable to any real-world problem involving limited- or low-quality data, it is highly suitable for some sectors due to the failure of the MC-AI approach. Therefore, we identified multiple use cases where the application of DC-AI can yield more fruitful results than the MC-AI approach alone. These use cases are pandemic/epidemic management, fault analysis in machinery, scenarios involving noise during data collection and aggregation, analysis of plant diseases, monitoring illegal parking in congested urban areas, behavior analysis, vaccine side-effect analysis, medical applications, automated decision-making, chatbots, mobile doctors, fintech, extraction of textual data from images, activity recognition, the automotive industry, predictive maintenance, rare disease analysis, stock market analysis, data-powered streaming applications, and time-series analysis, to name a few. Furthermore, DC-AI can be utilized when greater changes are expected in data over a short period (drift issues are likely to occur).

9. Discussion

In this section, we discuss the main results obtained, limitations, main trends, and challenges identified related to the DC-AI paradigm. We also provide promising future directions concerning DC-AI research and developments.

9.1. Main Results Obtained

In this paper, we discussed the two main tracks of AI technology development: MC-AI and DC-AI. The former is well-known in the AI community and has been extensively researched. In contrast, the latter is a relatively new topic and gained momentum in March 2021, when it was envisioned by Andrew Ng that 50 thoughtfully engineered images/examples are sufficient to train a neural network [141]. Since then DC-AI approaches have been tested in different domains and it has become a fascinating research topic. Though MC-AI is beneficial in some scenarios, DC-AI is an ideal solution when data are limited or curating more/fresh data is challenging. We discussed the workflows of both these tracks that can help researchers/practitioners to understand them. We identified different problems of MC-AI and categorized them into six broad categories: social, performance, drifts, sustainability, affordance, and controllability. Later, we delved into each problem and identified 7, 5, 3, 4, 4, and 5 sub-problems in each category (as given in Figure 6b). Later, we discussed the DC-AI paradigm including workflow, benefits, tools/techniques, implementation framework, best practices, and datasets used in DC-AI research. Afterward, we explored features/characteristics of DC-AI that might solve the problems of conventional MC-AI, as shown in Figure 8. We also provided the feasibility and affordability analysis of proposed solutions in Table 7. It is worth noting that most of DC-AI solutions proposed for solving MC-AI problems are theoretical and experimental evaluation/analysis is yet to be done. We also discussed the use cases and enabling technologies concerning DC-AI. We also listed some valuable sources from which insight into the successful implementation of DC-AI techniques can be gained. Lastly, important challenges/issues with DC-AI and avenues for future research were discussed. The presented analysis can be very useful for researchers and practitioners who intend to explore this emerging paradigm.

9.2. Limitations/Drawbacks of DC-AI

It is worth noting that DC-AI has certain limitations/drawbacks as well that require the immediate attention of the AI community. For example, DC-AI ensures diversity in the data but sometimes, it is very difficult to achieve it, owing to data collection from similar groups of people or domains. Similarly, data completeness checks are necessary in DC-AI, but this can be a very complex task, particularly when the data engineers are not familiar/experts in the field of the application. Effective data utilization and quality enhancement including data augmentation, preparation, and cleaning in DC-AI are not possible without sophisticated expertise and skills [93]. Furthermore, handling the massive amount of data in which most records/images are manually labeled poses stringent challenges in the data auditing/governance process. Furthermore, fixing data quality issues by identifying bias in the massive amount of data is challenging in the absence of automated tools. Lastly, identifying biases and fixing them in all data modalities is tricky because data engineers may not be familiar with all data modalities/formats. It is worth noting that there exists a misconception in the AI community about DC-AI and pre-processing, and many AI practitioners regard DC-AI as barely an alternative for pre-processing, which may limit DC-AI adoption worldwide. In the absence of practical tools, DC-AI may not be effective in improving the quality of real-time data stemming from heterogeneous sources. Therefore, these challenges like the misconception between DC-AI and pre-processing, such as DC-AI simply being a complementary approach to MC-AI, lack of DC-AI workforce, lack of practical tools for real-time data, and inherent fulfillment of DC-AI in some cases remain but will be conquered as DC-AI development increases soon.

9.3. Main Trends in DC-AI Research

DC-AI has been gaining compelling interest from both academia and industry as MC-AI alone is insufficient to solve all problems. Through a detailed analysis of recently published studies, we identified twelve main trends in DC-AI research/development: (i) accuracy enhancement to optimal limit (e.g., ∼100%), (ii) transitioning the structure of confusion metrics from skewed to balanced, (iii) training models with less data without losing accuracy, (iv) modifying the structure of AI models based on data characteristics, (v) enhancing the generalizability and robustness of AI models through data manipulation, (vi) overcoming bias and unfairness in AI decisions, (vii) applying AI methods to unexplored topics/domains, (viii) benchmark dataset preparation for AI research, (ix) conducting competitions on data optimization and quality assessment, (x) tool development for addressing data quality-related issues, (xi) synthetic data generation to resolve class imbalance problems, and (xii) low-cost strategies to fuse synthetic data with the real data. Apart from these trends, some studies are focusing on making AI models more interpretable and explainable. Lastly, some studies are focusing on lowering the harms of AI through this paradigm and extending AI benefits to marginalized groups of our society.

9.4. Key Challenges in DC-AI

Although some real-world examples of DC-AI’s successful implementation exist, it is still very challenging to apply this paradigm in real settings. In this work, we identify seven crucial challenges/issues of the DC-AI paradigm, which are discussed below:

Identification of pertinent DC-AI scenarios: In practical AI applications, there exist many scenarios in which large-scale data are imperative to serve users. For example, in large language models (LLMs), a large amount of data is required to train LLMs and to generate answers against users’ prompts. In contrast, some scenarios like stroke prediction (or anomaly detection) may require small-scale data of better quality. Hence, is very challenging to identify the pertinent scenarios in which DC-AI is more suitable than MC-AI.
Time-consuming operations: In DC-AI, the inspection of each data point (or observation) is required to identify faulty parts of data. Furthermore, extensive collaboration is required between AI experts and domain experts to prepare reliable data, which can prolong the development lifecycle of AI projects. Furthermore, some operations like data labeling, correcting wrong labels, and correcting redundancies are very time-consuming, particularly when automated tools are unavailable.
Lack of benchmark scores and practical tools: Thus, far, DC-AI has been applied to only a few scenarios, and its full potential in diverse scenarios is yet to be explored. Hence, there is a lack of benchmark scores in most cases. In addition, there exist only a few practical tools that can assist practitioners in curating data of the best quality. Hence, in the absence of benchmark scores and practical tools, it is very challenging to verify the effectiveness of DC-AI approaches and to lower the difficulty of applying DC-AI to real-world problems.
Lack of generalized solutions: Most of the available DC-AI solutions are specific to some scenarios only, and cannot be generically applied to different scenarios. Hence, applying existing DC-AI techniques/solutions to diverse scenarios is very challenging. Furthermore, selecting optimized combinations of DC-AI solutions for different AI models is very challenging.
Application of DC-AI to heterogeneous sources data: Applying DC-AI techniques to heterogeneous data stemming from diverse sources is very challenging. Integrating heterogeneous sources of data requires consistency, reliability, synergism, and alignment. This requires careful attention to data pre-processing, cleaning, formatting, and integration. By using a data-centric approach, we can identify and address data quality issues, such as missing or inconsistent data, and ensure that the final merged data product is accurate, robust, and dependable. DC-AI can also contribute to curating fused data that can be used to develop data-driven solutions for industrial applications. However, it can be challenging to curate good-quality data when different sources are involved in the data generation process. For example, autonomous vehicles rely on multiple sources of data such as cameras, lidars, radars, and GPS to perceive their environment and make decisions. However, applying DC-AI to resolve all kinds of data-related problems is challenging. Similarly, DC-AI approaches are imperative in industrial applications to identify the faults in machines or predict the remaining useful life of machines by exploiting the information of multiple sensors. In these scenarios, DC-AI application is challenging due to a higher imbalance in the data and noises of various kinds.
Lack of domain experts: DC-AI encloses a broad range of operations and steps, which can vary from domain to domain and from data to data. For instance, DC-AI techniques that are well suited to tabular data may not be good for image data. Therefore, domain experts may lack expertise in dealing with data of diverse modalities. Therefore, it is very challenging to engage domain experts having deep knowledge of DC-AI tools and techniques. In some cases, the domain experts may not be aware that using technical tools that can lead to longer time in developing AI products.
Optimal combinations of DC-AI techniques for enhancing model robustness and generalizability: In recent times, the AI model of higher generalizability and robustness has been in huge demand, and DC-AI can contribute to achieving these goals. However, choosing optimal combinations of DC-AI techniques is very challenging as some combinations may lead to deficient performance. For example, the concept of augmentation in tabular data and image data are quite different, and distinct techniques are required while dealing with these two modalities. To this end, the selection of optimal combinations of DC-AI techniques for enhancing model robustness and generalizability is very challenging.

Apart from the above-cited challenges, it is very difficult to separate the good and faulty parts of data, particularly when the data size is large. Lastly, ensuring the diversity and quality of data in some scenarios might be very challenging because of limited data collection budgets or low-fidelity synthetic data.

9.5. Future Directions

In this subsection, we list ten emerging avenues for future research focusing on DC-AI:

Developing synthetic data generation methods/techniques that can generate high-fidelity data for data diversity and quantity enhancement is an important topic for future research. Furthermore, developing tools that can figure out data quality-related problems at low cost is a promising avenue for future research.
The development of learning algorithms for fixing problems in training data is an emerging area of research in the coming years. Recently, active learning, confident learning, isolation forest, etc., are helping in addressing data quality problems. However, more comparable tools are needed to address data quality-related problems in diverse datasets.
Development of generalized DC-AI solutions that can work with diverse AI models is a vibrant area of research. Most of the current solutions are data-type- or model-specific, and therefore, it is imperative to make existing tools more generic and model-agnostic.
Developing low-code or no-code tools like KNIME (https://www.knime.com/, accessed on 23 May 2024) for reducing the time of data engineers in data preparation and quality enhancement. These tools are mainly GUI-based and can be leveraged to prepare good data for training diverse AI models.
Developing a new set of DC-AI methods that can open the black-box nature of AI models and can assist in enhancing the explainability and interpretability of AI model results. The current DC-AI solutions provide limited aspects concerning XAI/interpretation, and therefore, enhancements are needed in such tools.
Development of low-cost methods to pinpoint and address data quality problems/ vulnerabilities in large-scale and heterogeneous datasets. Most of the existing developments are applied to small-scale datasets or simple data modality (e.g., tabular, time series, etc.), and therefore, the promises of DC-AI application to complex datasets remain unexplored. To this end, it is vital to develop DC-AI techniques for large-scale and complex datasets.
Exploring the potentials of DC-AI-enabling technologies (e.g., transfer learning, pre-trained models, data compilers, knowledge distillation, etc.) is another important avenue for future research. The utilization of advanced computing infrastructure such as GPUs, TPUs (tensor processing units), accelerators, etc., for DC-AI implementation is an important direction for future research, particularly when the data size is large.
The development of end-to-end pipelines for data quality enhancement is an important area for future research. These pipelines identify and address data quality problems emerging at different stages (data preparation, model training, model testing, model deployment, etc.) of the AI model development.
Exploring the use of generative AI for identifying and solving data quality-related problems in complex scenarios, such as in the areas of smart cities, urban planning, healthcare, autonomous systems, etc., is an important topic for future research. Similarly, developing small-scale prototypes for fixing data vulnerabilities by relying on assistance from generative AI is another promising area for future research.
The optimization of the existing DC-AI techniques and extending their applicability to other data modalities or diverse scenarios is an important avenue for future research. Most of the existing tools are time-consuming or address only limited aspects related to data quality enhancement. Hence, the optimization of these methods by introducing new methods/procedures is a vibrant area of research.

Apart from the above-cited solutions, exploring practical ways to visualize equitable data utilization in the AI model training through feature maps or similar methods is an important area for future research [152,153]. Exploring ways to reduce the complexity of AI models or parameter size by exploiting characteristics/knowledge of data is a very hot research topic nowadays. Similarly, reducing data dimensions and size without compromising the performance of AI models is a promising area of research in the ML community. In addition, developing compiler-like tools to find problems in data encompassed in different modalities at the lowest possible computing cost is a vibrant area of research. Lastly, evaluating the training data used in AI/ML models is very challenging, and therefore, developing comprehensive methodologies/strategies is a vibrant area of research.

10. Conclusions

This work has presented a detailed analysis of two main tracks used in AI technology development. Specifically, we presented different topics associated with model-centric AI (MC-AI) and data-centric AI (DC-AI) that can help researchers/practitioners understand the changing landscape of AI developments. We analyzed various crucial problems of MC-AI and classified those problems into six different categories. Through an analysis of published literature, we found that some of these problems are directly caused, but others are worsened, by MC-AI. Later, we discussed the DC-AI paradigm and associated topics and found that DC-AI can likely solve the crucial problems of MC-AI. We suggested potential solutions, case by case, by discussing the features/characteristics of this new paradigm called DC-AI. We also provided different DC-AI techniques, implementation frameworks, practical guidelines, DC-AI activities for different stages of project development, and sources of datasets used in DC-AI research. To foster DC-AI, we provided various enabling technologies with procedures to bring DC-AI into practice. Furthermore, we highlighted the challenges/issues with DC-AI and suggested future avenues for research. Lastly, we also explored the possibilities of solving global challenges such as climate change and supply chain management, which are not yet addressable with conventional MC-AI. The potential value of DC-AI is to combat some dominant research trends in AI, which may not yield desirable results. For example, it has been suggested to limit the use of the oversampling method for imbalanced learning problems, as it is unreliable in most cases [154]. It can contribute to lowering the energy cost of training large models by training them with less but good-quality data. It can contribute to the trust enhancement of the general public in AI systems by solving bias and unfairness issues. It can also foster technological development for fixing errors in data, which can contribute to advancing AI systems from a technical perspective. It can assist in accomplishing the task of data governance and responsible use of AI systems, which is a growing demand in modern times. Lastly, it can enhance generalization in AI models, which is a vital step toward the realization of artificial general intelligence (AGI). However, to realize the possibilities of DC-AI, more work is needed in solving MC-AI problems as well as global problems. It is important to note that DC-AI is not a silver bullet, and MC-AI does yield good results in specific areas. Therefore, multidisciplinary collaborations are vital in identifying suitable application areas, and finally applying the relevant AI approach (DC-AI or MC-AI) in the coming years.

To navigate the potential harms of AI and to advance AI solutions, interdisciplinary collaborations and teamwork are imperative. The examples of successful interdisciplinary collaborations in the AI field are GPAI (https://gpai.ai/, accessed on 22 May 2024), Oxford Insight (https://oxfordinsights.com/, accessed on 22 May 2024), center for AI and digital policy (CAIDP) (https://www.caidp.org/reports/aidv-2023/aidv-maps-2023/, accessed on 22 May 2024), NCSC (https://www.ncsc.gov.uk/, accessed on 22 May 2024), etc. In all these initiatives, experts from diverse fields discuss and propose ways to advance AI while lowering its unintended consequences on humans. For example, GPAI is a global initiative, where experts from different regions of the world collaborate and develop plans for responsible AI. There are many international projects under the theme of GPAI such as data governance, responsible AI, and algorithm transparency, which are shaping the policies by taking into consideration the rapidly changing AI landscape. Oxford Insight discusses and proposes ways to harness the potential of AI technology through teamwork. The CAIDP analyzes the national AI strategies of different countries and develops the ranking of countries through national and international collaborations. The CAIDP visualizes the status of AI development on geographic maps. The NCSC develops policies for securing AI systems without compromising the quality of service (QoS) through collaborations with associated states. All these examples indicate the value of interdisciplinary collaborations and teamwork concerning AI to either lower its harms or advance the existing solutions for better QoS. Additionally, some practical DC-AI tools have also been developed through teamwork and interdisciplinary collaborations to advance AI solutions [155]. Recently, a global AI summit (https://www.reuters.com/technology/south-korea-uk-co-host-second-global-ai-summit-boom-fans-risks-2024-05-20/, accessed on 23 May 2024) under the theme of ‘AI safety’ was arranged through an international network of ten countries and the European Union to highlight the AI models’ risks, opportunities, and limitations. Based on the above analysis, it is fair to say that teamwork and interdisciplinary collaborations are vital and significantly contribute to either advancing AI solutions or curbing undesirable outcomes. Our detailed analysis of the recent development in the DC-AI paradigm while keeping MC-AI in the loop will be very helpful in understanding these two main tracks from a broader perspective. Lastly, DC-AI will have a very big impact on our society [22], and therefore, our work can lay a solid foundation for further studies in this line of work.

Author Contributions

All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (Ministry of Science and ICT) (RS-2024-00340882).

Data Availability Statement

Data, URLs, and supportive studies are contained within this article.

Acknowledgments

The authors thank the four anonymous reviewers who examined this article and provided very constructive feedback, which significantly enhanced the quality of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kumar, S.; Datta, S.; Singh, V.; Singh, S.K.; Sharma, R. Opportunities and Challenges in Data-Centric AI. IEEE Access 2024, 12, 33173–33189. [Google Scholar] [CrossRef]
Motamedi, M.; Sakharnykh, N.; Kaldewey, T. A data-centric approach for training deep neural networks with less data. arXiv 2021, arXiv:2110.03613. [Google Scholar]
Schmarje, L.; Grossmann, V.; Zelenka, C.; Dippel, S.; Kiko, R.; Oszust, M.; Pastell, M.; Stracke, J.; Valros, A.; Volkmann, N.; et al. Is one annotation enough? A data-centric image classification benchmark for noisy and ambiguous label estimation. arXiv 2022, arXiv:2207.06214. [Google Scholar]
Kumar, P.; Chauhan, S.; Awasthi, L.K. Artificial intelligence in healthcare: Review, ethics, trust challenges & future research directions. Eng. Appl. Artif. Intell. 2023, 120, 105894. [Google Scholar]
Moor, M.; Banerjee, O.; Abad, Z.S.H.; Krumholz, H.M.; Leskovec, J.; Topol, E.J.; Rajpurkar, P. Foundation models for generalist medical artificial intelligence. Nature 2023, 616, 259–265. [Google Scholar] [CrossRef] [PubMed]
Rajaraman, S.; Zamzmi, G.; Yang, F.; Xue, Z.; Antani, S.K. Data Characterization for Reliable AI in Medicine. In Recent Trends in Image Processing and Pattern Recognition: Proceedings of the 5th International Conference, RTIP2R 2022, Kingsville, TX, USA, 1–2 December 2022; Revised Selected Papers; Springer: Berlin/Heidelberg, Germany, 2023; pp. 3–11. [Google Scholar]
Nevarez, Y.; Beering, A.; Najafi, A.; Najafi, A.; Yu, W.; Chen, Y.; Krieger, K.L.; Garcia-Ortiz, A. CNN Sensor Analytics with Hybrid-Float6 Quantization on Low-Power Embedded FPGAs. IEEE Access 2023, 11, 4852–4868. [Google Scholar] [CrossRef]
Jin, H.; Wu, D.; Zhang, S.; Zou, X.; Jin, S.; Tao, D.; Liao, Q.; Xia, W. Design of a Quantization-based DNN Delta Compression Framework for Model Snapshots and Federated Learning. IEEE Trans. Parallel Distrib. Syst. 2023, 34, 1–15. [Google Scholar] [CrossRef]
Liang, Y.; Wu, C.; Song, T.; Wu, W.; Xia, Y.; Liu, Y.; Ou, Y.; Lu, S.; Ji, L.; Mao, S.; et al. Taskmatrix.AI: Completing tasks by connecting foundation models with millions of apis. arXiv 2023, arXiv:2303.16434. [Google Scholar] [CrossRef]
Houston, A.; Cosma, G. A genetically-optimised artificial life algorithm for complexity-based synthetic dataset generation. Inf. Sci. 2023, 619, 540–561. [Google Scholar] [CrossRef]
Li, M.; Zhuang, D.; Chang, J.M. MC-GEN: Multi-level clustering for private synthetic data generation. Knowl.-Based Syst. 2023, 264, 110239. [Google Scholar] [CrossRef]
Majeed, A.; Hwang, S.O. Data-Centric Artificial Intelligence, Preprocessing, and the Quest for Transformative Artificial Intelligence Systems Development. Computer 2023, 56, 109–115. [Google Scholar] [CrossRef]
Kreuzberger, D.; Kühl, N.; Hirschl, S. Machine learning operations (mlops): Overview, definition, and architecture. IEEE Access 2023, 11, 31866–31879. [Google Scholar] [CrossRef]
Steidl, M.; Felderer, M.; Ramler, R. The pipeline for the continuous development of artificial intelligence models—Current state of research and practice. J. Syst. Softw. 2023, 199, 111615. [Google Scholar] [CrossRef]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Liang, W.; Tadesse, G.A.; Ho, D.; Fei-Fei, L.; Zaharia, M.; Zhang, C.; Zou, J. Advances, challenges and opportunities in creating data for trustworthy AI. Nat. Mach. Intell. 2022, 4, 669–677. [Google Scholar] [CrossRef]
Jakubik, J.; Vössing, M.; Kühl, N.; Walk, J.; Satzger, G. Data-centric artificial intelligence. Bus. Inf. Syst. Eng. 2024, 1–9. [Google Scholar] [CrossRef]
Clemente, F.; Ribeiro, G.M.; Quemy, A.; Santos, M.S.; Pereira, R.C.; Barros, A. ydata-profiling: Accelerating data-centric AI with high-quality data. Neurocomputing 2023, 554, 126585. [Google Scholar] [CrossRef]
Luley, P.P.; Deriu, J.M.; Yan, P.; Schatte, G.A.; Stadelmann, T. From concept to implementation: The data-centric development process for AI in industry. In Proceedings of the 2023 10th IEEE Swiss Conference on Data Science (SDS), Zurich, Switzerland, 22–23 June 2023; pp. 73–76. [Google Scholar]
Holstein, J. Bridging Domain Expertise and AI through Data Understanding. In Proceedings of the IUI’24 Companion: 29th International Conference on Intelligent User Interfaces, Greenville, SC, USA, 18–21 March 2024; pp. 163–165. [Google Scholar]
Angelakis, A.; Rass, A. A data-centric approach to class-specific bias in image data augmentation. arXiv 2024, arXiv:2403.04120. [Google Scholar]
Kumar, S.; Sharma, R.; Singh, V.; Tiwari, S.; Singh, S.K.; Datta, S. Potential Impact of Data-Centric AI on Society. IEEE Technol. Soc. Mag. 2023, 42, 98–107. [Google Scholar] [CrossRef]
Zha, D.; Lai, K.H.; Yang, F.; Zou, N.; Gao, H.; Hu, X. Data-centric AI: Techniques and Future Perspectives. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 5839–5840. [Google Scholar]
Huynh, N.; Berrevoets, J.; Seedat, N.; Crabbé, J.; Qian, Z.; van der Schaar, M. DAGnosis: Localized Identification of Data Inconsistencies using Structures. arXiv 2024, arXiv:2402.17599. [Google Scholar]
Ilager, S.; De Maio, V.; Lujic, I.; Brandic, I. Data-centric Edge-AI: A Symbolic Representation Use Case. In Proceedings of the 2023 IEEE International Conference on Edge Computing and Communications (EDGE), Chicago, IL, USA, 2–8 July 2023; pp. 301–308. [Google Scholar]
Elhefnawy, M.; Ouali, M.S.; Ragab, A.; Amazouz, M. Fusion of heterogeneous industrial data using polygon generation & deep learning. Results Eng. 2023, 19, 101234. [Google Scholar]
Zha, D.; Bhat, Z.P.; Lai, K.H.; Yang, F.; Jiang, Z.; Zhong, S.; Hu, X. Data-centric artificial intelligence: A survey. arXiv 2023, arXiv:2303.10158. [Google Scholar]
Whang, S.E.; Roh, Y.; Song, H.; Lee, J.G. Data collection and quality challenges in deep learning: A data-centric ai perspective. VLDB J. 2023, 32, 791–813. [Google Scholar] [CrossRef]
Zeiser, A.; Özcan, B.; Kracke, C.; van Stein, B.; Bäck, T. A data-centric approach to anomaly detection in layer-based additive manufacturing. AT-Automatisierungstechnik 2023, 71, 81–89. [Google Scholar] [CrossRef]
Hamid, O.H. Data-Centric and Model-Centric AI: Twin Drivers of Compact and Robust Industry 4.0 Solutions. Appl. Sci. 2023, 13, 2753. [Google Scholar] [CrossRef]
Hamid, O.H. From Model-Centric to Data-Centric AI: A Paradigm Shift or Rather a Complementary Approach? In Proceedings of the 2022 8th International Conference on Information Technology Trends (ITT), Dubai, United Arab Emirates, 25–26 May 2022; pp. 196–199. [Google Scholar]
Majeed, A.; Hwang, S.O. Technical Analysis of Data-Centric and Model-Centric Artificial Intelligence. IT Prof. 2023, 25, 62–70. [Google Scholar] [CrossRef]
Hegde, C. Anomaly Detection in Time Series Data using Data-Centric AI. In Proceedings of the 2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 8–10 July 2022; pp. 1–6. [Google Scholar]
Zaharia, M.; Chen, A.; Davidson, A.; Ghodsi, A.; Hong, S.A.; Konwinski, A.; Murching, S.; Nykodym, T.; Ogilvie, P.; Parkhe, M.; et al. Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull. 2018, 41, 39–45. [Google Scholar]
Barron-Lugo, J.A.; Gonzalez-Compean, J.; Lopez-Arevalo, I.; Carretero, J.; Martinez-Rodriguez, J.L. Xel: A cloud-agnostic data platform for the design-driven building of high-availability data science services. Future Gener. Comput. Syst. 2023, 145, 87–103. [Google Scholar] [CrossRef]
Morcillo-Jimenez, R.; Gutiérrez-Batista, K.; Gómez-Romero, J. TSxtend: A Tool for Batch Analysis of Temporal Sensor Data. Energies 2023, 16, 1581. [Google Scholar] [CrossRef]
Erden, C. Machine Learning Experiment Management with MLFlow. In Encyclopedia of Data Science and Machine Learning; IGI Global: Hershey, PA, USA, 2023; pp. 1215–1234. [Google Scholar]
Mazumder, M.; Banbury, C.; Yao, X.; Karlaš, B.; Rojas, W.G.; Diamos, S.; Diamos, G.; He, L.; Kiela, D.; Jurado, D.; et al. DataPerf: Benchmarks for Data-Centric AI Development. arXiv 2022, arXiv:2207.10062. [Google Scholar]
Seedat, N.; Crabbé, J.; van der Schaar, M. Data-SUITE: Data-centric identification of in-distribution incongruous examples. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 19467–19496. [Google Scholar]
Jarrahi, M.H.; Memariani, A.; Guha, S. The Principles of Data-Centric AI. Commun. ACM 2023, 66, 84–92. [Google Scholar] [CrossRef]
Huang, Y.; Zhang, H.; Li, Y.; Lau, C.T.; You, Y. Active-Learning-as-a-Service: An Efficient MLOps System for Data-Centric AI. arXiv 2022, arXiv:2207.09109. [Google Scholar]
Barati, R.; Safabakhsh, R.; Rahmati, M. On Continuity of Robust and Accurate Classifiers. arXiv 2023, arXiv:2309.17048. [Google Scholar]
Zeiser, A.; Özcan, B.; van Stein, B.; Bäck, T. Evaluation of deep unsupervised anomaly detection methods with a data-centric approach for on-line inspection. Comput. Ind. 2023, 146, 103852. [Google Scholar] [CrossRef]
Zaidi, F.S.; Dai, H.L.; Imran, M.; Tran, K.P. Analyzing abnormal pattern of hotelling T2 control chart for compositional data using artificial neural networks. Comput. Ind. Eng. 2023, 180, 109254. [Google Scholar] [CrossRef]
Dhar, T.; Dey, N.; Borra, S.; Sherratt, R.S. Challenges of Deep Learning in Medical Image Analysis-Improving Explainability and Trust. IEEE Trans. Technol. Soc. 2023, 4, 68–75. [Google Scholar] [CrossRef]
Abdelaal, M.; Hammacher, C.; Schoening, H. Rein: A comprehensive benchmark framework for data cleaning methods in ML Pipelines. arXiv 2023, arXiv:2302.04702. [Google Scholar]
Fries, J.; Weber, L.; Seelam, N.; Altay, G.; Datta, D.; Garda, S.; Kang, S.; Su, R.; Kusa, W.; Cahyawijaya, S.; et al. Bigbio: A framework for data-centric biomedical natural language processing. Adv. Neural Inf. Process. Syst. 2022, 35, 25792–25806. [Google Scholar]
Wan, Z.; Wang, Z.; Chung, C.; Wang, Z. A Survey of Data Optimization for Problems in Computer Vision Datasets. arXiv 2022, arXiv:2210.11717. [Google Scholar]
Zhou, L.; Rudin, C.; Gombolay, M.; Spohrer, J.; Zhou, M.; Paul, S. From Artificial Intelligence (AI) to Intelligence Augmentation (IA): Design Principles, Potential Risks, and Emerging Issues. AIS Trans.-Hum.-Comput. Interact. 2023, 15, 111–135. [Google Scholar] [CrossRef]
Zhang, B.; Zhu, J.; Su, H. Toward the third generation artificial intelligence. Sci. China Inf. Sci. 2023, 66, 1–19. [Google Scholar] [CrossRef]
Chen, Y.; Jin, C.; Li, G.; Li, T.H.; Gao, W. Mitigating Label Noise in GANs via Enhanced Spectral Normalization. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 3924–3934. [Google Scholar] [CrossRef]
Hashmi, A.A.; Agafonov, A.; Zhumabayeva, A.; Yaqub, M.; Takáč, M. In Quest of Ground Truth: Learning Confident Models and Estimating Uncertainty in the Presence of Annotator Noise. arXiv 2023, arXiv:2301.00524. [Google Scholar]
Cordeiro, F.R.; Sachdeva, R.; Belagiannis, V.; Reid, I.; Carneiro, G. Longremix: Robust learning with high confidence samples in a noisy label environment. Pattern Recognit. 2023, 133, 109013. [Google Scholar] [CrossRef]
Zhang, L.; Gao, G.; Zhang, H. Towards Data-Efficient Continuous Learning for Edge Video Analytics via Smart Caching. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems, Boston, MA, USA, 6–9 November 2022; pp. 1136–1140. [Google Scholar]
Gangadharan, K.; Zhang, Q. Deep Transferable Intelligence for Spatial Variability Characterization and Data-efficient Learning in Biomechanical Measurement. IEEE Trans. Instrum. Meas. 2023, 72, 2509812. [Google Scholar] [CrossRef]
Ge, X.; Fang, C.; Liu, J.; Qing, M.; Li, X.; Zhao, Z. An unsupervised feature selection approach for actionable warning identification. Expert Syst. Appl. 2023, 227, 120152. [Google Scholar] [CrossRef]
McGregor, S.; Hostetler, J. Data-Centric Governance. arXiv 2023, arXiv:2302.07872. [Google Scholar]
Bruendl, S.A.; Fang, H.; Ngo, H.; Boyer, E.W.; Wang, H. A new emulation platform for real-time machine learning in substance use data streams. In Proceedings of the 2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI), Las Vegas, NV, USA, 11–13 August 2020; pp. 325–332. [Google Scholar]
Zhu, H.; Zhou, M.; Liu, G.; Xie, Y.; Liu, S.; Guo, C. NUS: Noisy-Sample-Removed Undersampling Scheme for Imbalanced Classification and Application to Credit Card Fraud Detection. IEEE Trans. Comput. Soc. Syst. 2023, 11, 1793–1804. [Google Scholar] [CrossRef]
Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.W.; Wallach, H.; Iii, H.D.; Crawford, K. Datasheets for datasets. Commun. ACM 2021, 64, 86–92. [Google Scholar] [CrossRef]
Picard, A.M.; Hervier, L.; Fel, T.; Vigouroux, D. Influenciæ: A Library for Tracing the Influence Back to the Data-Points. 2023. Available online: https://pasteur.hal.science/IRT_SAINT-EXUPERY/hal-04284178v1 (accessed on 16 May 2024).
Zhang, T.; Wang, D.; Lu, Y. A data-centric strategy to improve performance of automatic pavement defects detection. Autom. Constr. 2024, 160, 105334. [Google Scholar] [CrossRef]
Wasatkar, N.N.; Chavhan, P.G. Case Study Medical Images Analysis and Classification with Data-Centric Approach. In Data-Centric Artificial Intelligence for Multidisciplinary Applications; Chapman and Hall/CRC: Boca Raton, FL, USA, 2024; pp. 79–87. [Google Scholar]
Cao, P.; Li, D.; Ma, K. Image Quality Assessment: Integrating Model-Centric and Data-Centric Approaches. PMLR 2024, 234, 529–541. [Google Scholar]
Zhong, Y.; Wu, L.; Liu, X.; Jiang, J. Exploiting the potential of datasets: A data-centric approach for model robustness. arXiv 2022, arXiv:2203.05323. [Google Scholar]
Sharma, R.; Ahmad, N.; Ali, S.; Bilal, A.; Fatima, S.; Kshetri, N.; Salem, F.A.; Sibal, P. Technomoral Affordances of Artificial Intelligence in Data-Driven Systems. Computer 2022, 55, 76–81. [Google Scholar] [CrossRef]
Fatima, S.; Desouza, K.C.; Dawson, G.S. National strategic artificial intelligence plans: A multi-dimensional analysis. Econ. Anal. Policy 2020, 67, 178–194. [Google Scholar] [CrossRef]
Zhang, J.; Budhdeo, S.; William, W.; Cerrato, P.; Shuaib, H.; Sood, H.; Ashrafian, H.; Halamka, J.; Teo, J.T. Moving towards vertically integrated artificial intelligence development. NPJ Digit. Med. 2022, 5, 1–9. [Google Scholar] [CrossRef] [PubMed]
Adadi, A.; Lahmer, M.; Nasiri, S. Artificial Intelligence and COVID-19: A Systematic umbrella review and roads ahead. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 5898–5920. [Google Scholar] [CrossRef] [PubMed]
van Wynsberghe, A. Sustainable AI: AI for sustainability and the sustainability of AI. AI Ethics 2021, 1, 213–218. [Google Scholar] [CrossRef]
Leal Filho, W.; Wall, T.; Mucova, S.A.R.; Nagy, G.J.; Balogun, A.L.; Luetz, J.M.; Ng, A.W.; Kovaleva, M.; Azam, F.M.S.; Alves, F.; et al. Deploying artificial intelligence for climate change adaptation. Technol. Forecast. Soc. Chang. 2022, 180, 121662. [Google Scholar] [CrossRef]
Patterson, D.; Gonzalez, J.; Hölzle, U.; Le, Q.; Liang, C.; Munguia, L.M.; Rothchild, D.; So, D.R.; Texier, M.; Dean, J. The carbon footprint of machine learning training will plateau, then shrink. Computer 2022, 55, 18–28. [Google Scholar] [CrossRef]
Sodhi, M.S.; Seyedghorban, Z.; Tahernejad, H.; Samson, D. Why emerging supply chain technologies initially disappoint: Blockchain, IoT, and AI. Prod. Oper. Manag. 2022, 31, 2517–2537. [Google Scholar] [CrossRef]
Yampolskiy, R.V. On Controllability of AI. arXiv 2020, arXiv:2008.04071. [Google Scholar]
Barbosa, G.D.J.; Barbosa, S.D.J. You should not control what you do not understand: The risks of controllability in AI. In Human Computer Interaction and Emerging Technologies: Adjunct Proceedings from; Cardiff University Press: London, UK, 2020; pp. 231–236. [Google Scholar]
Abiodun, K.M.; Awotunde, J.B.; Aremu, D.R.; Adeniyi, E.A. Explainable AI for fighting COVID-19 pandemic: Opportunities, challenges, and future prospects. In Computational Intelligence for COVID-19 and Future Pandemics; Springer: Berlin/Heidelberg, Germany, 2022; pp. 315–332. [Google Scholar]
Sovrano, F.; Vitali, F. Explanatory artificial intelligence (YAI): Human-centered explanations of explainable AI and complex data. Data Min. Knowl. Discov. 2022, 1–28. [Google Scholar] [CrossRef]
Baeza-Yates, R. Ethical Challenges in AI. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Virtual, 21–25 February 2022; pp. 1–2. [Google Scholar]
Polonetsky, J.; Sparapani, T. A review of the privacy-enhancing technologies software market. IEEE Secur. Priv. 2021, 19, 119–122. [Google Scholar] [CrossRef]
Malgieri, G.; Pasquale, F.A. From Transparency to Justification: Toward Ex Ante Accountability for AI. Brooklyn Law School, Legal Studies Paper. 2022. Available online: https://ssrn.com/abstract=4099657 (accessed on 16 April 2024).
Ng, A. MLOps: From Model-centric to Data-centric AI-DeepLearning. AI. IEEE Spectr. 2021. [Google Scholar]
Chi, S.; Tian, Y.; Wang, F.; Zhou, T.; Jin, S.; Li, J. A novel lifelong machine learning-based method to eliminate calibration drift in clinical prediction models. Artif. Intell. Med. 2022, 125, 102256. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Yang, X.; Liu, W.; Xia, Y.; Bian, J. DDG-Da: Data distribution generation for predictable concept drift adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 4092–4100. [Google Scholar]
Li, J.; Lim, K.; Yang, H.; Ren, Z.; Raghavan, S.; Chen, P.Y.; Buonassisi, T.; Wang, X. AI applications through the whole life cycle of material discovery. Matter 2020, 3, 393–432. [Google Scholar] [CrossRef]
Durán, J.M.; Jongsma, K.R. Who is afraid of black box algorithms? On the epistemological and ethical basis of trust in medical AI. J. Med. Ethics 2021, 47, 329–335. [Google Scholar] [CrossRef] [PubMed]
Wang, Y. AI vs. NI (Natural Intelligence): How will Brain-Inspired Systems Lead to Autonomous AI and Cognitive Computers? In Proceedings of the 13th International Conference on Brain-Inspired Cognitive Architectures for AI, Guadalajara, Mexico, 22–25 September 2022. [Google Scholar]
Totschnig, W. Fully autonomous AI. Sci. Eng. Ethics 2020, 26, 2473–2485. [Google Scholar] [CrossRef]
Parashar, M.; DeBlanc-Knowles, T.; Gianchandani, E.; Parker, L.E. Strengthening and democratizing artificial intelligence research and development. Computer 2023, 56, 85–90. [Google Scholar] [CrossRef]
Hu, H.; Cui, Y.; Liu, Z.; Lian, S. A Data-Centric AI Paradigm Based on Application-Driven Fine-grained Dataset Design. arXiv 2022, arXiv:2209.09449. [Google Scholar]
Liu, X.; Wang, H.; Zhang, Y.; Wu, F.; Hu, S. Towards efficient data-centric robust machine learning with noise-based augmentation. arXiv 2022, arXiv:2203.03810. [Google Scholar]
Khan, M.; Mehran, M.T.; Haq, Z.U.; Ullah, Z.; Naqvi, S.R.; Ihsan, M.; Abbass, H. Applications of artificial intelligence in COVID-19 pandemic: A comprehensive review. Expert Syst. Appl. 2021, 185, 115695. [Google Scholar] [CrossRef] [PubMed]
van de Poel, I.; de Wildt, T.; Oosterlaken, E.; van den Hoven, M. Ethical and Societal Challenges of the Approaching Technological Storm; Think Tank European Parliamentary: London, UK, 2022. [Google Scholar]
Seedat, N.; Imrie, F.; van der Schaar, M. Navigating Data-Centric Artificial Intelligence with DC-Check: Advances, Challenges, and Opportunities. IEEE Trans. Artif. Intell. 2023, 1–15. [Google Scholar] [CrossRef]
Pan, I.; Mason, L.R.; Matar, O.K. Data-centric Engineering: Integrating simulation, machine learning and statistics. Challenges and opportunities. Chem. Eng. Sci. 2022, 249, 117271. [Google Scholar] [CrossRef]
Liu, X.Y.; Xia, Z.; Yang, H.; Gao, J.; Zha, D.; Zhu, M.; Wang, C.D.; Wang, Z.; Guo, J. Dynamic Datasets and Market Environments for Financial Reinforcement Learning. arXiv 2023, arXiv:2304.13174. [Google Scholar] [CrossRef]
Zahid, A.; Poulsen, J.K.; Sharma, R.; Wingreen, S.C. A systematic review of emerging information technologies for sustainable data-centric health-care. Int. J. Med. Inform. 2021, 149, 104420. [Google Scholar] [CrossRef] [PubMed]
Emmert-Streib, F.; Dehmer, M. Taxonomy of machine learning paradigms: A data-centric perspective. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2022, 12, e1470. [Google Scholar] [CrossRef]
Dietterich, T.G. Steps toward robust artificial intelligence. AI Mag. 2017, 38, 3–24. [Google Scholar] [CrossRef]
Scott, J.; Niemetz, A.; Preiner, M.; Nejati, S.; Ganesh, V. Algorithm selection for SMT: MachSMT: Machine Learning Driven Algorithm Selection for SMT Solvers. Int. J. Softw. Tools Technol. Transf. 2023, 25, 219–239. [Google Scholar] [CrossRef]
Liuliakov, A.; Hermes, L.; Hammer, B. AutoML technologies for the identification of sparse classification and outlier detection models. Appl. Soft Comput. 2023, 133, 109942. [Google Scholar] [CrossRef]
Jin, H.; Chollet, F.; Song, Q.; Hu, X. AutoKeras: An AutoML Library for Deep Learning. J. Mach. Learn. Res. 2023, 24, 1–6. [Google Scholar]
Bian, K.; Priyadarshi, R. Machine learning optimization techniques: A Survey, classification, challenges, and Future Research Issues. In Archives of Computational Methods in Engineering; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–25. [Google Scholar]
Salehi, S.; Schmeink, A. Data-centric green artificial intelligence: A survey. IEEE Trans. Artif. Intell. 2023, 5, 1973–1989. [Google Scholar] [CrossRef]
Barbierato, E.; Gatti, A. Towards Green AI. A methodological survey of the scientific literature. IEEE Access 2024, 12, 23989–24013. [Google Scholar] [CrossRef]
Kumar, A.; Chundi, P. Data Lakes. In Encyclopedia of Data Science and Machine Learning; IGI Global: Hershey, PA, USA, 2023; pp. 410–424. [Google Scholar]
Chen, F.; Yan, Z.; Gu, L. Towards Low-Latency Big Data Infrastructure at Sangfor. In Emerging Information Security and Applications: Third International Conference, EISA 2022, Wuhan, China, 29–30 October 2022; Springer: Berlin/Heidelberg, Germany, 2023; pp. 37–54. [Google Scholar]
Cvetkov-Iliev, A.; Allauzen, A.; Varoquaux, G. Relational data embeddings for feature enrichment with background information. In Machine Learning; Springer: Berlin/Heidelberg, Germany, 2023; pp. 1–34. [Google Scholar]
Lacroix, S.; Ostermeyer, E.; Le Duigou, J.; Bornard, F.; Rival, S.; Mary, M.F.; Eynard, B. Lessons learnt in industrial data platform integration. Procedia Comput. Sci. 2023, 217, 1660–1669. [Google Scholar] [CrossRef]
Taherdoost, H. Machine Learning Algorithms: Features and Applications. In Encyclopedia of Data Science and Machine Learning; IGI Global: Hershey, PA, USA, 2023; pp. 938–960. [Google Scholar]
Kolukuluri, M.; Devi, V.K.; Tejaswini, S.S.; Anusha, K. Business Intelligence Using Data Mining Techniques Furthermore, Predictive Analytics. J. Pharm. Negat. Results 2023, 13, 6923–6932. [Google Scholar]
Mengi, G.; Singh, S.K.; Kumar, S.; Mahto, D.; Sharma, A. Automated Machine Learning (AutoML): The Future of Computational Intelligence. In Proceedings of the International Conference on Cyber Security, Privacy and Networking (ICSPN 2022), Bangkok, Thailand, 9–11 September 2022; Springer: Berlin/Heidelberg, Germany, 2023; pp. 309–317. [Google Scholar]
Schultes, E. Data Stewardship Plan templates designed to support the FAIR principles. Fair Connect 2023, 1, 1–3. [Google Scholar] [CrossRef]
Fawzy, D.; Moussa, S.M.; Badr, N.L. An IoT-based resource utilization framework using data fusion for smart environments. Internet Things 2023, 21, 100645. [Google Scholar] [CrossRef]
Quindroit, P.; Fruchart, M.; Degoul, S.; Perichon, R.; Martignène, N.; Soula, J.; Marcilly, R.; Lamer, A. Definition of a Practical Taxonomy for Referencing Data Quality Problems in Health Care Databases. Methods Inf. Med. 2023, 62, 19–30. [Google Scholar] [CrossRef] [PubMed]
Gounaris, A.; Michailidou, A.V.; Dustdar, S. Toward building edge learning pipelines. IEEE Internet Comput. 2023, 27, 61–69. [Google Scholar] [CrossRef]
Hechler, E.; Weihrauch, M.; Wu, Y. Terminology: Data Fabric and Data Mesh. In Data Fabric and Data Mesh Approaches with AI: A Guide to AI-based Data Cataloging, Governance, Integration, Orchestration, and Consumption; Springer: Berlin/Heidelberg, Germany, 2023; pp. 17–42. [Google Scholar]
Dablain, D.; Krawczyk, B.; Chawla, N.V. DeepSMOTE: Fusing deep learning and SMOTE for imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 6390–6404. [Google Scholar] [CrossRef]
Singh, P.K.; Verma, R.K.; Krishna Prasad, P. IoT-based smartbots for smart city using MCC and big data. In Smart Intelligent Computing and Applications: Proceedings of the Second International Conference on SCI 2018; Springer: Berlin/Heidelberg, Germany, 2019; Volume 1, pp. 525–534. [Google Scholar]
Arora, M.; Sharma, R.L. Artificial intelligence and big data: Ontological and communicative perspectives in multi-sectoral scenarios of modern businesses. Foresight 2023, 25, 126–143. [Google Scholar] [CrossRef]
Kiran, A.; Kumar, S.S. Synthetic Data and Its Evaluation Metrics for Machine Learning. In Information Systems for Intelligent Systems: Proceedings of ISBM 2022; Springer: Berlin/Heidelberg, Germany, 2023; pp. 485–494. [Google Scholar]
Ruddle, R.A.; Cheshire, J.; Fernstad, S.J. Tasks and Visualizations Used for Data Profiling: A Survey and Interview Study. IEEE Trans. Vis. Comput. Graph. 2023, 1–12. [Google Scholar] [CrossRef] [PubMed]
Wéber, A.; Mery, L.; Nagy, P.; Polgár, C.; Bray, F.; Kenessey, I. Evaluation of data quality at the Hungarian National Cancer Registry, 2000–2019. Cancer Epidemiol. 2023, 82, 102306. [Google Scholar] [CrossRef] [PubMed]
García-Peñalvo, F.; Vázquez-Ingelmo, A.; García-Holgado, A.; Sampedro-Gómez, J.; Sánchez-Puente, A.; Vicente-Palacios, V.; Dorado-Díaz, P.I.; Sánchez, P.L. KoopaML: A graphical platform for building machine learning pipelines adapted to health professionals. Int. J. Interact. Multimed. Artif. Intell. 2023, in press. [Google Scholar] [CrossRef]
Diamantopoulos, A.; Schlegelmilch, B.B.; Halkias, G. Taking the Fear out of Data Analysis; Edward Elgar Publishing: Cheltenham, UK, 2023. [Google Scholar]
Berenji, A.; Nowaczyk, S.; Taghiyarrenani, Z. Data-Centric Perspective on Explainability Versus Performance Trade-Off. In Advances in Intelligent Data Analysis XXI: 21st International Symposium on Intelligent Data Analysis, IDA 2023, Louvain-la-Neuve, Belgium, 12–14 April 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 42–54. [Google Scholar]
Lau, S.L. Towards a sustainable future through data-centric solutions: Potentials and challenges. Procedia Comput. Sci. 2023, 216, 2. [Google Scholar] [CrossRef]
Oala, L.; Aversa, M.; Nobis, G.; Willis, K.; Neuenschwander, Y.; Buck, M.; Matek, C.; Extermann, J.; Pomarico, E.; Samek, W.; et al. Data models for dataset drift controls in machine learning with optical images. arXiv 2022, arXiv:2211.02578. [Google Scholar]
Cui, J.; Wang, R.; Si, S.; Hsieh, C.J. DC-BENCH: Dataset Condensation Benchmark. arXiv 2022, arXiv:2207.09639. [Google Scholar]
Seedat, N.; Imrie, F.; van der Schaar, M. Dc-check: A data-centric ai checklist to guide the development of reliable machine learning systems. arXiv 2022, arXiv:2211.05764. [Google Scholar]
Abedjan, Z. Enabling data-centric AI through data quality management and data literacy. IT-Inf. Technol. 2022, 64, 67–70. [Google Scholar] [CrossRef]
Rajotte, J.F.; Bergen, R.; Buckeridge, D.L.; El Emam, K.; Ng, R.; Strome, E. Synthetic data as an enabler for machine learning applications in medicine. Iscience 2022, 25, 105331. [Google Scholar] [CrossRef]
Ferreira, F.; Lourenço, N.; Cabral, B.; Fernandes, J.P. When Two are Better Than One: Synthesizing Heavily Unbalanced Data. IEEE Access 2021, 9, 150459–150469. [Google Scholar] [CrossRef]
Hu, L.; Li, J.; Lin, G.; Peng, S.; Zhang, Z.; Zhang, Y.; Dong, C. Defending against Membership Inference Attacks with High Utility by GAN. IEEE Trans. Dependable Secur. Comput. 2022, 20, 2144–2157. [Google Scholar] [CrossRef]
Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A comprehensive survey on transfer learning. Proc. IEEE 2020, 109, 43–76. [Google Scholar] [CrossRef]
Bashath, S.; Perera, N.; Tripathi, S.; Manjang, K.; Dehmer, M.; Streib, F.E. A data-centric review of deep transfer learning with applications to text data. Inf. Sci. 2022, 585, 498–528. [Google Scholar] [CrossRef]
Lee, Y.; Kwon, O.J.; Lee, H.; Kim, J.; Lee, K.; Kim, K.E. Augment & Valuate: A Data Enhancement Pipeline for Data-Centric AI. arXiv 2021, arXiv:2112.03837. [Google Scholar]
Huang, P.X.; Hu, W.; Brendel, W.; Chandraker, M.; Li, L.J.; Wang, X. YMIR: A Rapid Data-centric Development Platform for Vision Applications. arXiv 2021, arXiv:2111.10046. [Google Scholar]
Eyuboglu, S.; Karlaš, B.; Ré, C.; Zhang, C.; Zou, J. dcbench: A benchmark for data-centric AI systems. In Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning, Philadelphia, PA, USA, 12 June 2022; pp. 1–4. [Google Scholar]
Patel, H.; Guttula, S.; Mittal, R.S.; Manwani, N.; Berti-Equille, L.; Manatkar, A. Advances in exploratory data analysis, visualisation and quality for data centric AI systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 4814–4815. [Google Scholar]
Sharma, P.; Kurban, H.; Dalkilic, M. DCEM: An R package for clustering big data via data-centric modification of Expectation Maximization. SoftwareX 2022, 17, 100944. [Google Scholar] [CrossRef]
Strickland, E. Andrew Ng, AI Minimalist: The Machine-Learning Pioneer Says Small is the New Big. IEEE Spectr. 2022, 59, 22–50. [Google Scholar] [CrossRef]
Paiva, P.Y.A.; Smith-Miles, K.; Valeriano, M.G.; Lorena, A.C. PyHard: A novel tool for generating hardness embeddings to support data-centric analysis. arXiv 2021, arXiv:2109.14430. [Google Scholar]
Kim, S.; Cho, S.; Cho, K.; Seo, J.; Nam, Y.; Park, J.; Kim, K.; Kim, D.; Hwang, J.; Yun, J.; et al. An Open Medical Platform to Share Source Code and Various Pre-Trained Weights for Models to Use in Deep Learning Research. Korean J. Radiol. 2021, 22, 2073. [Google Scholar] [CrossRef]
Agarwal, O.; Nenkova, A. Temporal effects on pre-trained models for language processing tasks. Trans. Assoc. Comput. Linguist. 2022, 10, 904–921. [Google Scholar] [CrossRef]
Salza, P.; Schwizer, C.; Gu, J.; Gall, H.C. On the effectiveness of transfer learning for code search. IEEE Trans. Softw. Eng. 2022, 49, 1804–1822. [Google Scholar] [CrossRef]
Profentzas, C.; Almgren, M.; Landsiedel, O. MicroTL: Transfer Learning on Low-Power IoT Devices. In Proceedings of the 2022 IEEE 47th Conference on Local Computer Networks (LCN), Edmonton, AB, Canada, 26–29 September 2022; pp. 1–8. [Google Scholar]
Ziogas, A.N.; Schneider, T.; Ben-Nun, T.; Calotoiu, A.; De Matteis, T.; de Fine Licht, J.; Lavarini, L.; Hoefler, T. Productivity, portability, performance: Data-centric Python. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, MO, USA, 14–19 November 2021; pp. 1–13. [Google Scholar]
Karlaš, B.; Dao, D.; Interlandi, M.; Li, B.; Schelter, S.; Wu, W.; Zhang, C. Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines. arXiv 2022, arXiv:2204.11131. [Google Scholar]
Jain, S.; Salman, H.; Khaddaj, A.; Wong, E.; Park, S.M.; Madry, A. A Data-Based Perspective on Transfer Learning. arXiv 2022, arXiv:2207.05739. [Google Scholar]
Grafberger, S.; Groth, P.; Stoyanovich, J.; Schelter, S. Data distribution debugging in machine learning pipelines. VLDB J. 2022, 31, 1103–1126. [Google Scholar] [CrossRef]
Sorscher, B.; Geirhos, R.; Shekhar, S.; Ganguli, S.; Morcos, A.S. Beyond neural scaling laws: Beating power law scaling via data pruning. arXiv 2022, arXiv:2206.14486. [Google Scholar]
Bello, M.; Nápoles, G.; Sánchez, R.; Bello, R.; Vanhoof, K. Deep neural network to extract high-level features and labels in multi-label classification problems. Neurocomputing 2020, 413, 259–270. [Google Scholar] [CrossRef]
Roman-Rangel, E.; Marchand-Maillet, S. Inductive t-SNE via deep learning to visualize multi-label images. Eng. Appl. Artif. Intell. 2019, 81, 336–345. [Google Scholar] [CrossRef]
Tarawneh, A.S.; Hassanat, A.B.; Altarawneh, G.A.; Almuhaimeed, A. Stop oversampling for class imbalance learning: A review. IEEE Access 2022, 10, 47643–47660. [Google Scholar] [CrossRef]
Patel, H.; Guttula, S.; Gupta, N.; Hans, S.; Mittal, R.S. A Data-centric AI Framework for Automating Exploratory Data Analysis and Data Quality Tasks. ACM J. Data Inf. Qual. 2023, 15, 1–26. [Google Scholar] [CrossRef]

Figure 1. Overview of AI technology development following the model-centric approach. This schematic demonstrates the various aspects of AI technology development (research trajectories) by following the MC-AI approach, which is widely used to solve real-world problems around the globe.

Figure 2. Overview of the conventional model-centric AI. This schematic shows the workflow of the conventional MC-AI approach widely used to solve real-world problems (Adapted from [32]).

Figure 3. Overview of data-centric AI. Steps 3 and 8 highlight the essence and difference in DC-AI compared to the conventional MC-AI (Adapted from [32]).

Figure 4. Implementation of DC-AI-based system in specific problems (e.g., stroke prediction from demographic features). This schematic depicts the implementation of DC-AI based system in the medical domain along with the supportive strategies.

Figure 5. Six main principles of DC-AI. This schematic lists the six key principles that have been suggested to improve data for AI developments (adapted from [40]).

Figure 6. Components of AI quality, and crucial problems of conventional MC-AI. (a) The four components upon which the quality of any AI system depends, and (b) crucial problems of the MC-AI approach (along with their subtypes) to meticulously report the ill side of the MC-AI approach. The plus sign (+) indicates the relationship between both parts to show problems resulting from MC-AI-based systems.

Figure 7. Major benefits of DC-AI compared to traditional methods. This schematic shows the benefits of DC-AI in some specific problems compared to traditional methods along with examples.

Figure 8. DC-AI provides solutions to problems in the MC-AI approach. This schematic demonstrates the key problems with MC-AI and the corresponding solutions offered by DC-AI. This schematic highlights necessary rectification and modifications in the MC-AI approach to enhance the quality of AI systems.

Figure 9. Overview of algorithmic framework of DC-AI. This schematic demonstrates the generic workflow of the DC-AI in the form of an algorithmic framework that can be applied to any real-world problem with slight modifications.

Figure 10. DC-AI activities that can be applied to different stages of an AI system. This schematic shows key activities of DC-AI and their mapping to different stages of an entire AI system.

Figure 11. Enabling technologies for the DC-AI approach. This schematic demonstrates the promising technologies that can assist in realizing the DC-AI approach in real-world scenarios.

Table 1. Basic comparison between DC-AI and MC-AI.

Comparison Criteria	MC-AI Approach	DC-AI Approach
Main focus	Code	Data
Researchers focus	90%	<10%
Research span	3 decades	∼3 years
Data analysis	One time	Continuous (N times)
Accuracy	Low	High
Quality assurance	No	Yes
Practices	Code-first	Data-first

Table 2. Technical comparison between DC-AI and MC-AI.

Comparison Criteria	MC-AI Approach	DC-AI Approach
Drift susceptibility	Both concept and data	None
Data checks	Before training only	In a whole lifecycle
Feedback	Slow and inadequate	Timely
Explainability of results	Complex	Easy
Steps in data preparation	Limited	Comprehensive

Table 3. Summary of the important techniques that are vital in DC-AI-based developments.

Technique(s)	Significance/Utility of the Technique in the Context of DC-AI
Data discovery	Searching and indexing datasets that are scattered across many sources or platforms (e.g., Goods system)
Outlier detection	Identifying abnormal samples (or examples) from the data (e.g., min–max analysis)
Outlier elimination	Removing abnormal samples (or examples) from the data
Data augmentation	Curating more data to compensate for data deficiency (e.g., augmented data = real data + new data)
Confident learning	A technical solution for label errors, characterizing noise related to labels, learning with noisy labels, etc.
Domain randomization	A simulator for generating data that are very close to realistic data (e.g., Amazon Mechanical Turk)
Data labeling	Adding labels to the data using existing information from other available labels (Google Cloud Labeling)
Error detection	Addressing issues related to wrong labels or incorrect feature values in the dataset
Data re-labeling	Improving the quality of the labels via re-labeling (e.g., taking a majority vote for each sample)
Python labeling functions	Accelerating the process of data labeling by validating the majority voting or generative model process
Feature engineering	Enhancement of data quality by using cleaning, pre-processing, and wrangling techniques
Feature selection	Determining the set of candidate features that improve the accuracy of AI models
Curriculum learning	Ordering and arranging the examples complexity-wise in the dataset (e.g., easiest to hardest)
Data validation	Detecting and fixing the errors in a dataset using platforms like TensorFlow Extended (TFX)
Noise removal	Removing redundant samples/examples in the dataset (e.g., coin-throwing algorithm [59])
Data cleaning	Checking the relevance of data with the underlying problem and removing bias from them (e.g., ActiveClean)
Data sanitization	Addressing the problem of data poisoning, especially when data are crawled from the web
Active learning	Selecting highly informative labels from data in order to label next examples/samples
Data integration	Enhancing data quality when data emerge from multiple sources (e.g., sensor network)
Consensus labels	Finding true labels from crowd-sourced annotations to improve the learning ability of models
Feature stores	Creating unified data/features that are generally usable on many AI/ML models
Data pipelines	Determining the correct order of DC-AI technique application in diverse domains
Data influence modeling	Analyzing the impact of individual data points on model performance to prune less informative points
Core-set selection	Reducing the size of the data to a representative subset, only in modern ML systems
Encoding human priors	Including human knowledge while improving data to prevent imbalanced/wrong learning
Data sampling	Identifying and removing the class imbalance problem by curating more data using GANs or existing samples
Data distribution shift	Analyzing the statistical properties of the data over time and re-training models
Uncertainty estimation	Identifying ontological and label issues to clean both training and test sets (e.g., pervasive label errors)
Data imputation	Enhancing data quality by imputing missing values or outliers with mean values or less representative value
Model-aware cleaning	Improving quality of data by considering the characteristics of the underlying model to be used subsequently

Table 4. Marginal/zero enhancement in accuracy while fiddling with complex AI code in the MC-AI approach (adapted from [81]).

Approach	Defect Detection in Steel	Solar Panel	Inspection of Surface
Baseline	76.20%	75.68%	85.05%
MC-AI	+0.00%	+0.04%	+0.00%
New values	76.20%	+75.72%	+85.05%

Table 5. Accuracy enhancements from using DC-AI (adapted from [81]).

Approach	Defect Detection in Steel	Solar Panel	Inspection of Surface
Baseline	76.20%	75.68%	85.05%
DC-AI	+16.9%	+3.06%	+0.40%
New values	93.1%	+78.74%	+85.45%

Table 6. Analysis of the main problems in MC-AI, their root causes, and the relevant DC-AI components to solve them.

Crucial Problems	Main Cause(s) of the Problem	Recommended DC-AI Component
Societal	Black-box nature of AI models; poor data quality	DFS & IDA
Performance	Incomplete, imbalanced, and incorrectly labeled data	DFS & IDA & DC
Drift	Changes in data over time; poor model design	IDA & DC
Sustainability	Fine-tuning code; extensive model (re)building	DFS & IDA
Affordances	Curiosity about how AI works; lack of domain knowledge	IDA & DC
Controllability	Inadequate knowledge about how AI gives results; skill gap	DFS & IDA & DC

Abbreviations: DFS: data-first strategy, DC: data compliance, IDA: intelligent data architecture.

Table 7. Analysis of DC-AI solutions in terms of feasibility and affordability to solve MC-AI problems.

MC-AI Problems	MC-AI Sub-Problems	Analysis of DC-AI-Powered Solutions
MC-AI Problems	MC-AI Sub-Problems	Feasibility	Affordability
Social problems	Fairness Trustworthiness Transparency Explainability Unethical AI Privacy and security Accountability	Yes Yes High complexity Medium complexity Medium complexity Yes Yes	Yes Yes Yes Expensive Yes Yes Yes
Performance problems	Accuracy Stability Conceptual soundness Robustness Recoverability	Yes Yes Yes Yes Yes	Yes Yes Yes Yes Yes
Drift problems	Concept drift Data drift Data scarcity	Yes Yes Medium complexity	Yes Yes Least expensive
Sustainability problems	CO₂ emissions Man-made disasters Climate change Material flows	Yes High complexity High complexity High complexity	Yes Expensive Expensive Expensive
Affordance problems	Fewer domain experts Fragmented tools Limited knowledge about AI Autonomous AI	Yes Medium complexity High complexity High complexity	Expensive Least expensive Expensive Very expensive
Controllability problems	Explicit control Implicit control Aligned control Delegated control Hybrid control	Medium complexity High complexity Medium complexity Medium complexity Medium complexity	Least expensive Least expensive Least expensive Expensive Yes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Majeed, A.; Hwang, S.O. A Data-Centric AI Paradigm for Socio-Industrial and Global Challenges. Electronics 2024, 13, 2156. https://doi.org/10.3390/electronics13112156

AMA Style

Majeed A, Hwang SO. A Data-Centric AI Paradigm for Socio-Industrial and Global Challenges. Electronics. 2024; 13(11):2156. https://doi.org/10.3390/electronics13112156

Chicago/Turabian Style

Majeed, Abdul, and Seong Oun Hwang. 2024. "A Data-Centric AI Paradigm for Socio-Industrial and Global Challenges" Electronics 13, no. 11: 2156. https://doi.org/10.3390/electronics13112156

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Data-Centric AI Paradigm for Socio-Industrial and Global Challenges

Abstract

1. Introduction

2. Background and Related Work

2.1. Background

2.2. Related Work

3. Working Mechanisms of Model-Centric AI and Data-Centric AI

3.1. Workflow and Comparisons between Model-Centric AI and Data-Centric AI

3.2. Implementation of DC-AI Approach

3.3. Supportive Techniques for DC-AI Development/Implementation

3.4. Practical Guidelines for DC-AI Paradigm

3.5. Datasets Used in DC-AI Research

4. Crucial Problems of the Conventional MC-AI Approach in Real-World Scenarios

5. Technical and Detailed Descriptions of Crucial Problems with the Conventional MC-AI Approach in Real-World Scenarios

6. Data-Centric AI: A Solution for the Crucial Problems of the Model-Centric AI Approach

6.1. The Data-Centric AI Approach

6.2. The Introduction of DC-AI for Specific Problems and Major Benefits Compared with Traditional Methods

6.3. How DC-AI Can Solve Many Crucial Problems of MC-AI

6.4. Algorithm Framework of DC-AI for Practical Applications

7. Detailed Analysis of Potential Solutions of the MC-AI Approach’s Crucial Problems with the Data-Centric AI Paradigm

7.1. Significance and Activities of Data-Centric AI

7.2. Prospects of Data-Centric AI in Solving Crucial MC-AI Problems

7.3. Insight into suggested solutions and their feasibility in addressing the identified problems

7.4. Analysis of the Feasibility and Affordability of DC-AI-Based Solutions

7.5. Existing DC-AI Implementations or Success Stories

8. Enabling Technologies and Noteworthy Use Cases of DC-AI

8.1. DC-AI Enabling Technologies

8.2. Noteworthy Use Cases for DC-AI

9. Discussion

9.1. Main Results Obtained

9.2. Limitations/Drawbacks of DC-AI

9.3. Main Trends in DC-AI Research

9.4. Key Challenges in DC-AI

9.5. Future Directions

10. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI