Re-Thinking Data Strategy and Integration for Artificial Intelligence: Concepts, Opportunities, and Challenges

Aldoseri, Abdulaziz; Al-Khalifa, Khalifa N.; Hamouda, Abdel Magid

doi:10.3390/app13127082

Open AccessReview

Re-Thinking Data Strategy and Integration for Artificial Intelligence: Concepts, Opportunities, and Challenges

by

Abdulaziz Aldoseri

,

Khalifa N. Al-Khalifa

and

Abdel Magid Hamouda

^*

Engineering Management Program, College of Engineering, Qatar University, Doha P.O. Box 2713, Qatar

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(12), 7082; https://doi.org/10.3390/app13127082

Submission received: 3 May 2023 / Revised: 30 May 2023 / Accepted: 7 June 2023 / Published: 13 June 2023

(This article belongs to the Special Issue AI Applications in the Industrial Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

The use of artificial intelligence (AI) is becoming more prevalent across industries such as healthcare, finance, and transportation. Artificial intelligence is based on the analysis of large datasets and requires a continuous supply of high-quality data. However, using data for AI is not without challenges. This paper comprehensively reviews and critically examines the challenges of using data for AI, including data quality, data volume, privacy and security, bias and fairness, interpretability and explainability, ethical concerns, and technical expertise and skills. This paper examines these challenges in detail and offers recommendations on how companies and organizations can address them. By understanding and addressing these challenges, organizations can harness the power of AI to make smarter decisions and gain competitive advantage in the digital age. It is expected, since this review article provides and discusses various strategies for data challenges for AI over the last decade, that it will be very helpful to the scientific research community to create new and novel ideas to rethink our approaches to data strategies for AI.

Keywords:

Artificial Intelligence (AI); data strategies and learning approaches; challenges and opportunities

1. Introduction

Artificial Intelligence (AI) refers to the ability of machines to mimic human intelligence and perform tasks that typically require human intelligence, such as learning, problem-solving, decision-making, and natural language understanding [1]. Figure 1 depicts AI technologies including machine learning, natural language processing, robotics, and computer vision. Machine learning is a subset of AI that involves training computer algorithms to learn patterns in data and make predictions or decisions based on the data [2]. Deep learning is a type of machine learning that uses neural networks with multiple layers to process complex data such as images or speech [3]. Natural language processing is the ability of computers to understand, interpret, and generate human language, including speech and text [4]. Computer vision is the ability of computers to analyze and interpret visual information such as images and videos [5].

AI is a rapidly expanding field with the potential to revolutionize the way we live and work. From healthcare to finance and transportation, AI has the potential to transform a wide range of industries, creating new opportunities for businesses and organizations. AI has been transforming various sectors, including healthcare, finance, and transportation, with significant advancements in machine learning and deep learning techniques [6,7]. The heart of this transformation is data, which are essential for training and testing the AI models. AI models rely on large datasets to identify patterns and trends that are difficult to detect using traditional data-analysis methods. This allows them to learn and make predictions based on the data on which they have been trained.

However, using AI data is challenging. Data quality, quantity, diversity, and privacy are critical components of data-driven AI applications, and each presents its own set of challenges. Poor data quality can lead to inaccurate or biased AI models, which can have serious consequences in areas such as healthcare and finance. Insufficient data can lead to models that are too simplistic and incapable of accurately predicting real-world outcomes. A lack of data diversity can also lead to biased models that do not accurately represent the population they are designed to serve. Lastly, data privacy is a major concern, as AI models may require access to sensitive data, which raises concerns about data privacy and security.

In this article, we address the challenges of using data for AI and offer recommendations for companies seeking to address them. To address these challenges, businesses and organizations need to develop strategies and frameworks that promote data quality, quantity, diversity, and privacy. This may involve implementing data cleaning and validation processes to ensure data quality, collecting and managing large quantities of diverse data, and implementing data privacy policies and procedures to protect the sensitive data. By focusing on these challenges, businesses and organizations can leverage the power of data to create accurate, effective, and fair AI applications that benefit society.

2. Materials and Methods

2.1. Data for AI

Data are critical for AI because they are the foundation upon which machine learning algorithms learn, make predictions, and improve their performance over time. To train an AI model, large amounts of data are required to enable the model to recognize patterns, make predictions, and improve its performance over time.

2.1.1. Data Learning Approaches

AI algorithms require data to learn patterns and make predictions or decisions based on the data. AI machine learning techniques are algorithms that allow machines to learn patterns and make predictions from data without explicit programming [8]. These techniques are widely used in a variety of applications, such as natural language processing, image and speech recognition, and recommendation systems. In general, the more data available for an AI algorithm to learn, the more accurate its predictions or decisions will be. There are several data-learning approaches to building AI systems [8,9]; for the comprehensiveness of the article, we include the following, as shown in Figure 2.

Supervised Learning: In supervised learning, an AI system is trained on a labeled dataset, where each data point is associated with a label or a target variable. The goal is to develop a model that can accurately predict the label or target variable for new data points. This approach is commonly used for tasks such as image classification, speech recognition, and natural language processing [10].

Unsupervised Learning: In unsupervised learning, an AI system is trained on an unlabeled dataset where there is no target variable to predict. The goal is to identify the patterns, relationships, and structures in the data. This approach is commonly used for tasks such as clustering, anomaly detection, and dimensionality reduction [11].

Reinforcement Learning: In reinforcement learning, an AI system learns to make decisions based on feedback from the environment. The system receives rewards or penalties based on its actions and adjusts its behavior accordingly. This approach is commonly used for tasks such as gaming, robotics, and autonomous driving [12].

Transfer Learning: In transfer learning, an AI system leverages the knowledge gained from one task to improve the performance in another related task. The system is pre-trained on a large dataset and then fine-tuned on a smaller dataset for a specific task at hand. This approach can help to reduce the amount of data required to train an AI model and improve its accuracy and performance [13].

Deep Learning: Deep learning is a type of neural-network-based machine learning that is particularly effective for tasks involving large amounts of data and complex relationships. Deep learning models are composed of multiple layers of interconnected nodes that can learn increasingly complex representations of data. This approach is commonly used for tasks such as image and speech recognition, natural language processing, and computer vision [14].

Ensemble Learning: Ensemble learning is a technique in which multiple models are trained and combined to make predictions or decisions. Combining the predictions of multiple models can improve the accuracy and reliability of the final output [15].

Overall, the choice of the data learning approach depends on the specific task, data, and resources available. It is important to carefully evaluate the benefits and limitations of each approach and select the one that best fits the requirements of the AI application being developed.

2.1.2. Data-Centric and Data-Driven AI

Data-centric and data-driven are two related but distinct concepts in the world of data analysis and decision making. By leveraging data, organizations can gain a deeper understanding of their operations, customers, and markets and make more informed decisions based on data-driven insights. Data-centric approaches are commonly used in industries such as finance, healthcare, and retail, where accurate and timely data are critical for decision making. For example, in the healthcare industry, data-centric approaches are used to analyze patient data to improve outcomes, identify disease patterns, and optimize treatment plans. Data-centric and data-driven approaches are two approaches for building AI systems that rely on data [16,17].

Data-Centric Approach: This refers to an approach in which data are the central focus of a system or process [16,18]. A data-centric approach involves a relatively fixed model that prioritizes the collection, storage, and analysis of high-quality data to train AI algorithms, improve their performance, and leverage data to inform decision-making and problem-solving processes [16,18,19]. This approach often involves using advanced analytics such as machine learning or artificial intelligence to uncover patterns, trends, or insights that may not be immediately apparent from the data [19]. The data-driven approach focuses on building robust and reliable data infrastructure that can support a wide range of AI applications. The goal is to create a centralized data repository that can serve as a single source of truth for all AI applications within an organization [20]. This approach is particularly useful when there is a large volume of data from different sources, or when the data are complex and difficult to work with.

In recent years, the rise of big data and advanced analytics has led to a growing emphasis on data-centric approaches across various industries, from healthcare to finance and to retail [21]. By adopting a data-centric approach, organizations can gain a competitive advantage by improving decision making, increasing efficiency, and reducing costs [22]. A data-centric approach is particularly important in the context of big data, where the 7 Vs of big data (velocity, volume, value, variety, veracity, volatility, and validity) can make it challenging to extract meaningful insights [23]. AI algorithms must be designed to handle large volumes of data, which must be carefully curated to ensure accuracy and relevance [24]. A data-centric approach can lead to improved decision-making, increased efficiency, reduced costs, improved customer experience, competitive advantage, and risk mitigation [25,26]. It requires a strong data management infrastructure, a skilled workforce, and advanced analytics and AI techniques to extract valuable insights from the data.

Overall, a data-centric approach is essential for effective AI decision-making and problem solving. By placing data at the center of the AI system and following best practices for data quality, processing, governance, and integration, organizations can unlock the full potential of AI and drive better outcomes.

Data-Driven Approach: This focuses on building AI models that are specifically designed to make predictions or decisions based on data. This approach emphasizes the selection, processing, and analysis of data to identify patterns, relationships, and insights that can be used to improve the accuracy and performance of an AI model [27]. The goal was to develop an AI model that can learn and adapt to new data without being constrained by a predefined set of rules or assumptions. This approach is particularly useful when data are relatively homogeneous or when the goal is to automate a specific decision-making process [28]. A data-driven approach to AI involves the use of data as the primary source of information for training and improving AI models. In this approach, the AI system learns directly from the data rather than being programmed by humans [29]. Data-driven AI involves several key steps, as illustrated in Figure 3.

Data Collection: Collecting relevant data from various sources is the first step in a data-driven approach. This may involve capturing data from internal systems, external sources, or user-generated content [30].

Data Preparation: Once the data have been collected, they need to be cleaned, preprocessed, and transformed to make them suitable for analysis. This may involve data cleansing, normalization, and feature engineering [31].

Machine Learning: Machine learning algorithms are applied to preprocessed data to develop predictive models that can be used to make decisions or automate processes [32].

Model Validation: The models developed through machine learning are validated using various techniques to ensure accuracy and reliability [33].

Model Deployment: Once models have been validated, they are deployed in production environments to automate the decision-making processes or provide insights [34].

Continuous Improvement: A data-driven approach involves continuous improvement, with feedback from the models used to refine the data collection, analysis, and decision-making processes [35].

A key advantage of a data-driven approach to AI is that it allows the AI model to adapt and improve over time as new data become available. This means that the model can continue to learn and refine its predictions and performance, leading to better outcomes over time. Data-driven AI has several advantages over other approaches, including the ability to learn from large amounts of data, detect complex patterns and relationships, and adapt to changing conditions [36]. However, it also requires careful attention to data quality, privacy, and ethical considerations.

Overall, a data-driven AI approach emphasizes the importance of data at every stage of the AI development and decision-making process, from data collection to model deployment. This approach can help organizations make informed decisions and improve the accuracy and effectiveness of their AI models.

Generally, both data-centric and data-driven approaches are important for building effective AI systems. A strong data-centric approach can help ensure that the data used to train AI models are of high quality, whereas a data-driven approach can help identify patterns and insights that can improve the accuracy and performance of the AI model. The choice between these two approaches depends on the specific needs and requirements of the organization and AI application being developed.

2.2. Dimensions of Data Challenges for AI

Several data challenges are associated with AI. Figure 4 depicts the dimensions of the data challenges for AI. This section covers the most important and essential and major ones, as illustrated in Figure 4.

2.2.1. Dimension I: Data Quality

Data quality is a critical aspect of AI. The accuracy, completeness, and consistency of the data used for training and testing AI models directly affects the performance and effectiveness of the AI system. Low-quality data can lead to biased, inaccurate, or irrelevant results, negatively affecting decision-making processes based on AI outputs. Therefore, ensuring the high quality of data is crucial for AI systems to produce reliable and valuable results. This may include data cleansing, validation, enrichment, and management. AI applications require high-quality relevant, representative, and reliable data to produce optimal outcomes. AI systems also require ongoing monitoring and maintenance to ensure that data quality is consistent over time. The performance of AI systems is heavily reliant on the quality of the data used for training and validation [37]. Data quality is a multidimensional concept that encompasses factors such as accuracy, completeness, consistency, and timeliness [38]. Ensuring data quality is a challenging task given the vast amount of data generated daily and the inherent complexity of data structures [39]. Figure 5 shows the challenging elements of data quality.

Challenging Measures of Data Quality and Implications on AI Systems

This section presents different measures to ensure the data quality for AI applications. Figure 6 represents the challenging measures of Data Quality.

Accuracy: Data accuracy is critical for AI to function effectively. Accuracy refers to the degree to which data are correct and error-free. In other words, it aims to describe the degree to which data correctly represent real-world phenomena [40]. AI systems require accurate data for training and validation to ensure accurate predictions and decisions [41]. Inaccurate data can lead to biased or erroneous outcomes, undermining the reliability and usefulness of AI systems [42].

Completeness: Completeness refers to the extent to which all relevant and sufficient data coverage is present in the dataset to provide insight and meaningful results for AI [37]. Incomplete data can lead to biased or unrepresentative AI models because the algorithms may not have sufficient information to learn the underlying patterns and relationships [43]. Missing data can be attributed to various factors such as data collection or data entry errors [44].

Consistency: Consistency refers to the uniformity of data representations and formats across a dataset [38]. In other words, this is the extent to which the data are free from conflicts, inaccuracies, or discrepancies when compared to other sources or systems. Inconsistent data can lead to confusion and misinterpretation by AI algorithms, resulting in suboptimal performance [45]. Ensuring consistency requires the standardization and harmonization of data formats, units, and terminologies [46].

Timeliness: Timeliness refers to the degree to which data are updated and relevant to the current context [37]. AI systems require timely data to adapt to dynamic environments and provide accurate predictions [47]. Outdated data may lead to poor performance and even harmful consequences because AI systems may not account for recent changes in the underlying phenomena [48]. Timeliness is particularly important in domains such as finance, healthcare, and transportation, where real-time insights offer significant advantages. For instance, if the data used to build a weather forecasting model are outdated, the model might not be able to make accurate predictions. Similarly, if the data used for training a stock market predictor are not timely, the model can make decisions based on outdated information that may not be relevant to the current state of the market. Therefore, data timeliness is an important dimension of AI data quality.

Integrity: Data integrity refers to the maintenance of data accuracy and consistency throughout its lifecycle, including during storage, retrieval, and processing [46]. In other words, it is the degree to which data are reliable and trusted to be correct. Compromised data integrity can result in AI systems making decisions based on corrupt or inconsistent data, leading to unreliable or flawed outcomes.

Relevance: Relevant data refer to the degree to which the data used for training and building machine learning models are appropriate and applicable to the task or problem being addressed [37]. This is directly related to a specific problem or task being addressed by an AI system. Irrelevant data can introduce noise or bias into a system, thereby reducing their performance and effectiveness [39].

By considering these dimensions of data quality and their implications for AI systems, organizations can better understand the challenges they face in maintaining high-quality data for AI applications. This understanding can inform the development of strategies and best practices to address data quality issues, thereby ensuring that AI systems can deliver accurate, reliable, and valuable insights and outcomes.

Challenges in Data Collection, Pre-Processing, and Management

Data Collection: Data collection for AI applications is often driven by the need to solve the problem of ensuring that relevant data are collected, which is a challenge because of the sheer volume of data, the diversity of data sources, and the need for representative samples [49]. Data collection to ensure data quality is determined by data requirements and identifying the types of data needed for the application, data sources, and data quantity [50]. To ensure data quality in the data collection phase, the following considerations must be considered:

Data Pre-Processing: Data pre-processing is a crucial step in ensuring data quality, as it involves cleaning, transforming, and integrating the data to facilitate analysis [51]. Pre-processing can be time-consuming and resource-intensive, given the need to handle missing values, outliers, inconsistencies, and other data quality issues [52]. Moreover, pre-processing decisions can have significant implications for AI model performance, as they influence the characteristics of the input data [53].

Data Quality Management: Effective data management is essential for maintaining data quality and ensuring that AI systems can access and process data efficiently [41]. Data management challenges include maintaining data storage and retrieval systems, implementing version control, and ensuring data security and privacy [46].

The Role of Data Governance in Ensuring Data Quality

Data governance plays a critical role in maintaining, ensuring, and enhancing data quality in organizations [54]. It encompasses the processes, policies, standards, and technologies that manage the availability, usability, integrity, and security of data [55]. Effective data governance helps organizations make better decisions, optimize operations, comply with regulations, and create a competitive advantage [56]. Implementing a comprehensive data governance framework is essential for addressing data quality challenges in AI [57].

Data governance includes several aspects such as data stewardship, data quality management, data privacy and security, and data architecture [58]. Data stewardship involves assigning responsibility and accountability for data quality to designated data stewards to ensure that data meet organizational standards [59]. Data quality management refers to the processes and tools used to measure, monitor, and improve data quality, such as data profiling, cleansing, and enrichment [60]. Data privacy and security are concerned with protecting sensitive information and ensuring compliance with relevant regulations [61]. Data architecture includes the design, organization, and management of data structures, storage systems, and data integration technologies [57].

Implementing an effective data governance framework requires a clear understanding of the organization’s goals, data quality requirements, and existing data management practices [62]. Organizations must establish data quality metrics, set data quality targets, and monitor data quality performance regularly [63]. In addition, organizations should invest in data governance technologies such as data catalogs, data lineage tools, and data quality management systems to support the data governance process [54]. By adopting a robust data governance framework, organizations can significantly improve data quality and unleash the full potential of AI.

Furthermore, effective data governance is essential for fostering a data-driven culture within an organization. By promoting collaboration and communication between different departments and stakeholders, data governance helps to break down data silos and facilitates the sharing of data assets [54]. This enables organizations to leverage their data more effectively and gain valuable insights into strategic decision-making [58].

Training and education are critical components of data governance [60]. Ensuring that employees have a solid understanding of data quality concepts, tools, and best practices helps create a shared vision and commitment to maintaining high-quality data. This can lead to more accurate and reliable AI models that drive innovation and create a competitive advantage [59].

In addition to internal data governance efforts, organizations should consider the importance of external data quality. As AI systems often rely on data from various sources, including third-party providers and public datasets, ensuring the quality of external data is crucial for the success of AI initiatives [57]. Collaborating with data providers and establishing data quality agreements can help mitigate potential data quality issues stemming from external sources [58].

In summary, implementing a comprehensive data governance framework is vital for addressing data quality challenges in AI. Organizations that prioritize data governance can enhance their decision making, optimize operations, and unlock the full potential of AI technologies. By fostering a data-driven culture, investing in data governance technologies, and ensuring both internal and external data quality, organizations can build a solid foundation for AI success.

Addressing Data Quality Challenges: Techniques and Strategies

This section proposes solutions to address the challenges in data quality. The results are summarized in Figure 7.

Data Cleaning: Data cleaning is an essential step in improving data quality for AI. This involves identifying and correcting errors, missing values, inconsistencies, and outliers in data. Techniques such as data profiling and validation can help identify areas of data that require cleaning.

Data Profiling and Data Preparation: Data profiling involves analyzing datasets to identify data quality issues such as missing data, duplicate records, and inconsistent values. Data preparation involves cleaning and transforming the raw data into a usable format for AI algorithms. These processes are essential to ensure that the data used to train the AI models are accurate, complete, and consistent.

Data Labeling: Data labeling involves tagging data with relevant metadata that describe its characteristics, which can help ensure that AI models are trained with high-quality data. For example, in image recognition, data labeling may involve identifying objects in the images and adding descriptive labels to the data.

Imputation Techniques for Missing Data: Missing data are a significant challenge in ensuring the data quality of AI systems. Various imputation techniques have been proposed to handle missing data, including mean imputation, regression imputation, and multiple imputation [54]. Advanced techniques, such as matrix completion methods, have been explored in recent years [55]. These methods aim to provide reasonable estimates of the missing values and ensure the completeness of the dataset.

Feature Selection and Engineering: Feature selection and engineering play a crucial role in addressing data quality challenges, as they involve identifying relevant features and transforming raw data into a format suitable for analysis [64]. Techniques such as Recursive Feature Elimination (RFE), LASSO, and principal component analysis can be employed to reduce the dimensionality of the data and eliminate noise [65]. Moreover, domain knowledge can be leveraged to create new features that better capture the underlying patterns and relationships.

Data Augmentation: Data augmentation techniques can be used to address the data quality challenges related to limited or unbalanced datasets. These techniques generate synthetic data samples by applying various transformations such as rotation, scaling, and flipping to the original data [66]. Data augmentation has been particularly successful in improving the performance of deep learning models in computer vision and natural language processing tasks [67].

Active Learning: Active learning is an approach that can help address data-quality challenges by guiding the data-collection process. In active learning, AI models iteratively select the most informative samples to be labeled and added to the training set, thereby reducing the amount of labeled data required and improving model performance [68]. Active learning has shown promise in various applications, including text classification and object recognition [69,70].

Data Validation and Testing: Data validation involves checking data for accuracy and completeness, whereas testing involves assessing the performance of AI models using various metrics. These processes can help identify and address data quality issues that may affect the accuracy and effectiveness of AI models.

Algorithmic Fairness: Algorithmic fairness involves ensuring that AI models are not biased towards specific groups or individuals. This can be achieved by carefully selecting training datasets and implementing algorithms designed to reduce bias.

Data Bias Mitigation: Data bias can lead to inaccurate or unfair AI predictions and decisions. Mitigating data bias involves identifying and addressing bias in the training data through techniques such as dataset balancing, which involves adjusting the distribution of data to reduce bias.

Continuous Monitoring and Maintenance: AI models need to be continuously monitored and maintained to ensure that they remain accurate and effective. This involves ongoing data-quality checks, updating models with new data, and retraining models as needed.

Data Lineage: Data lineage involves tracking the history of the data and ensuring that it is used appropriately. This can help prevent issues such as data drift, where the quality of data changes over time.

2.2.2. Dimension II: Data Volume

Challenging Elements of Data Volume

The data volume challenge is a key aspect of AI research and application. Large datasets are critical for training AI models, and continue to grow in size and complexity. This growth brings with it some challenges that must be addressed to ensure the effective use of AI in various domains [71]. The challenging elements of data volume are presented in Figure 8.

Data Deluge: A double-edged sword. Exponential data growth is the driving force behind the success of AI, particularly in deep learning techniques [72]. However, the massive amount of data poses several challenges, including in storing, processing, and managing data [73].

Storage Challenges: The huge amount of data generated today requires more efficient storage solutions to support artificial intelligence applications [74]. Traditional storage architectures may not be able to meet the scalability, performance, and cost requirements of AI workloads [75]. New storage technologies, such as nonvolatile memory (NVM) and distributed storage systems, have been proposed as possible solutions [76].

Processing Challenges: AI models, particularly deep learning algorithms, require enormous computing resources to process large datasets [77]. This has led to an increased need for specialized hardware such as GPUs and TPUs to accelerate AI training and inference [78]. In addition, new techniques such as model compression, pruning, and quantization have been explored to optimize AI models for more efficient processing [79].

Data Management Challenges: From the perspective of big data volume, effective data management is critical for AI systems to handle massive amounts of data. This includes data cleaning, preprocessing, labeling, and curation [80]. Techniques such as active learning, weak supervision, and transfer learning have been proposed to alleviate the burden of manual data annotation [81].

Data Heterogeneity: Large datasets may contain data from multiple sources, which can be challenging to integrate and harmonize, particularly when the data are in different formats or structures.

Data Privacy and Security: Large data volumes can increase the risk of data breaches and privacy violations, particularly when sensitive data are involved. These issues need to be addressed as the amount of data increases [82].

Bias and Representativeness: Large volumes of data do not necessarily guarantee representativeness or lack bias, as they may still contain demographic, cultural, or other biases that can impact the accuracy of AI models.

Data Access: In some cases, organizations may have access to large datasets but may not be able to use them due to legal or regulatory constraints. Organizations must ensure that they have the necessary permissions and licenses to access and use data.

Mitigation Solutions for Challenges of Data Volume

Several solutions have been proposed to address the data volume challenges in artificial intelligence. Figure 9 depicts the proposed solutions that this includes.

(a): Transfer Learning: Transfer learning involves leveraging pre-trained AI models to improve the performance of new models. Using pre-trained models, organizations can reduce the amount of training data required and improve the efficiency of the training process.
(b): Federated Learning: This enables collaborative model training across multiple devices without sharing raw data [83].
(c): Edge Computing: This brings data processing closer to the data source, thereby reducing the network latency and bandwidth usage [84].
(d): Multimodal Learning: This leverages multiple data sources to improve the model performance and reduce reliance on large datasets [85].

Federated Learning as a proposed solution:

Federated learning was introduced as a solution to preserve user privacy while benefiting from the collective knowledge of multiple data sources [86]. In this framework, devices (also known as clients) train machine learning models using local data, and then share model updates with a central server. The server aggregates these updates, improves the global model, and distributes the updated model back to the clients. This process is repeated iteratively until the model converges. The benefits of Federated Learning include several benefits that can help address data volume challenges in AI.

(a): Privacy: Because raw data remain on client devices, federated learning inherently provides a higher level of privacy compared to centralized approaches [83].
(b): Reduced Data Transfer: By sharing only model updates rather than raw data, federated learning can significantly reduce the amount of data that needs to be transferred over the network, reducing bandwidth and latency issues [87].
(c): Scalability: Federated learning can accommodate a large number of client devices, allowing the use of different data sources without overloading the central server [75].
(d): Real-time learning: By allowing clients to learn from local data, federated learning enables real-time adaptation and improves model performance [82].

Challenges and Future Directions: Despite its benefits, federated learning also presents some challenges that need to be addressed.

(a): Heterogeneity: The heterogeneity of client devices and data distribution may lead to an unbalanced contribution to the global model, which may affect convergence and model performance [75].
(b): Communication Overhead: The iterative process of exchanging model updates incurs significant communication overhead and may negate the benefits of reduced data transfer [87].
(c): Security: Federated learning is vulnerable to various security threats, including model poisoning, inference attacks, and Sybil attacks [88].

To overcome these challenges, researchers are exploring various techniques such as weighted averaging to deal with heterogeneity [89], communication-efficient algorithms for reducing overhead [90], and differential privacy for enhancing security [91]. Continued research in these areas will be crucial to fully realize the potential of federated learning in addressing the data-volume challenge in AI.

-

Edge Computing as a Proposed Solution: Edge computing is a distributed computing paradigm that aims to bring computation and data storage closer to the data source, or the “edge” of the network, where the data are generated [84]. By performing data processing on edge devices, such as smartphones, IoT devices, or edge servers, edge computing can reduce the amount of data that must be transmitted to the cloud or a centralized data center. This approach enables real-time data processing, reduces latency, and conserves the bandwidth.

-

Advantages of Edge Computing: Edge computing offers several benefits that can help address the data volume challenge in AI.

(a): Reduced Latency: By processing data closer to the source, edge computing can significantly reduce latency and enable real-time AI applications [84].
(b): Bandwidth Efficiency: Edge computing helps conserve bandwidth by reducing the amount of data transmitted over the network, which is particularly useful in situations where the network bandwidth is limited or expensive [76].
(c): Enhanced privacy and security: Because data are processed and stored locally, edge computing can provide improved data privacy and security compared to centralized approaches [92].
(d): Scalability: Edge computing can support many devices and applications, making it suitable for the growing demands of AI and IoT [84].

Challenges and Future Directions:

Despite its advantages, edge computing presents some challenges that need to be addressed.

(a): Resource Constraints: Edge devices typically have limited computational resources, which may hinder the performance of complex AI models [93].
(b): Model Deployment and Management: Deploying and managing AI models across a large number of edge devices can be challenging because it requires efficient model distribution, updates, and monitoring [94].
(c): Heterogeneity: The heterogeneity of edge devices in terms of hardware, software, and network connectivity can pose challenges for implementing consistent and efficient AI solutions [95].

Researchers have explored various techniques to overcome these challenges, such as model compression and hardware-aware neural architecture searches for resource-constrained devices [79], edge-cloud collaborative learning for model deployment and management [96], and federated edge learning to address heterogeneity [97]. Continued research in these areas will be crucial to fully realize the potential of edge computing in addressing the data-volume challenge in AI.

2.2.3. Dimension III: Data Privacy and Security

The use of personal or sensitive data in AI can raise concerns regarding privacy and security. It is important to ensure that the data are stored and processed securely and that privacy regulations are followed. To address this challenge, businesses should implement data privacy and security policies and procedures such as data encryption and access control.

This section provides a comprehensive review of the challenges associated with data privacy and security in AI, and discusses data collection and sharing, inference attacks, differential privacy, adversarial attacks, data poisoning, and model and data tampering. It also presents state-of-the-art mitigation strategies, such as privacy-preserving AI techniques, robustness and adversarial training, monitoring and anomaly detection, and compliance with data protection regulations. The results are summarized in Figure 10.

Data Privacy Challenges in AI

Data Collection and Sharing: AI systems require large quantities of data to train effectively, often leading organizations to aggregate data from various sources, potentially exposing sensitive user information [98]. Data-sharing agreements and collaborative data analysis projects can exacerbate these concerns, especially when data are shared across international borders with differing privacy regulations [99].

Inference Attacks: AI models can inadvertently reveal sensitive information about training data, even when the data are anonymized [100]. For example, attackers can use model inversion or membership inference attacks to extract private information from a model’s predictions or to learn whether a specific data point is included in the training set [101].

Differential Privacy: Differential privacy (DP) is a popular approach for preserving privacy during data analysis by adding controlled noise to the data [102]. Although DP provides strong privacy guarantees, it can be challenging to implement in practice, especially when balancing privacy protections and model utility [103].

Data Security Challenges in AI

Adversarial Attacks: Adversarial attacks, in which small perturbations are introduced to input data to deceive AI models, pose significant security risks [104]. These attacks can lead to incorrect predictions or classifications, undermining the reliability of AI systems in critical applications, such as healthcare, finance, and autonomous vehicles [105].

Data Poisoning: Data poisoning attacks involve tampering with training data to degrade the performance of an AI model [106]. These attacks can be difficult to detect because they often target a small subset of the training data and require minimal modifications to the poisoned data points [107].

Model and Data Tampering: Attackers can also target AI models and data directly by altering model parameters, weights, or the data itself [108]. Techniques such as backdoor attacks or Trojan neural networks can introduce hidden malicious behavior into AI models, thereby posing significant security risks [109]. In backdoor attacks, an attacker injects malicious code into the model during the training process, causing the model to produce incorrect outputs or exhibit unintended behavior when triggered by a specific input pattern [110]. Trojan neural networks, on the other hand, involve embedding a hidden trigger within the model, which, when activated by a specific input, causes the model to perform unauthorized actions or provide incorrect predictions [111].

Model extraction attacks involve model tampering, in which an attacker seeks to replicate a target model by querying it and learning from the responses [112]. This can lead to intellectual property theft, or even the creation of duplicate models with malicious intent [113].

To defend against model and data tampering, organizations can employ various strategies such as model hardening, secure model storage, and input validation. Model hardening techniques, such as fine pruning, can help remove malicious components from a model while maintaining its overall performance [114]. Secure model storage using encryption and access controls can protect a model from unauthorized modifications [115]. Input validation can be employed to ensure that only legitimate inputs are processed by the AI system, mitigating the risk of triggering hidden backdoors or Trojan networks [116]. By implementing these countermeasures, organizations can enhance the security and trustworthiness of their AI systems in the face of model and data-tampering threats.

Mitigation Strategies for Data Privacy and Security Challenges

Several mitigation techniques are proposed, as depicted in Figure 11.

Privacy-Preserving AI Techniques: To address privacy concerns, organizations can employ privacy-preserving AI techniques, such as federated learning, secure multi-party computation, and homomorphic encryption [117]. These methods allow organizations to train AI models on distributed data without sharing raw data between parties, thereby reducing the risk of data breaches or leakage [118].

Robustness and Adversarial Training: To defend against adversarial attacks and improve model robustness, researchers have developed adversarial training techniques that involve augmenting a training dataset with adversarial examples [119]. By training the model on a combination of clean and adversarial data, it becomes more resilient to adversarial perturbations [120].

Monitoring and Anomaly Detection: To detect data poisoning and model tampering, organizations can employ monitoring and anomaly detection techniques to identify deviations from the expected behavior in model performance, training data, or model parameters [59]. Early detection can help prevent further damage to AI systems and provide valuable insights for improving security measures [121].

Compliance with Data Protection Regulations: Organizations should adhere to data protection regulations, such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States, to ensure that they collect, store, and process data in a secure and compliant manner [122]. Compliance with these regulations can help minimize the risk of data breaches and protect user privacy [123].

Data privacy and security challenges in AI are significant concerns for organizations that develop and deploy AI systems. By understanding these challenges and implementing mitigation strategies, such as privacy-preserving AI techniques, robustness training, and compliance with data protection regulations, organizations can enhance the privacy and security of their AI systems. As AI continues to evolve and impact various industries, it is crucial for researchers, practitioners, and policymakers to work together to address these challenges and ensure that it serves the greater good without compromising user privacy and security.

2.2.4. Dimension IV: Bias and Fairness

AI has seen rapid advancements over the past decade, transforming many aspects of our lives. However, as AI systems have become more prevalent, concerns regarding data bias and fairness have emerged. The performance of AI models depends heavily on the quality of the data used for training, and biased data can lead to biased outcomes [124]. This section provides a comprehensive review of the challenges related to data bias and fairness in AI, with a focus on recent research and solutions.

Bias, in the context of artificial intelligence and machine learning, refers to systematic errors in the algorithms’ predictions or decisions resulting from skewed training data, flawed algorithms, or the influence of pre-existing assumptions. These biases can lead to unfair or discriminatory outcomes, impacting individuals or groups based on attributes such as race, gender, age, or socioeconomic status. Addressing and mitigating bias in AI systems is crucial for ensuring the fair and ethical deployment of these technologies in various domains, from healthcare and finance to criminal justice and social media.

Amazon’s AI Recruiting Tool: In 2018, Amazon discontinued its AI recruiting tool after it was found to be biased against female candidates. The system was designed to review resumes and rank candidates based on their qualifications. However, the model was trained on resumes submitted to the company over a ten-year period, which predominantly belonged to male candidates. Consequently, the AI system preferred male candidates over equally qualified female candidates [125].

Google Photos’ Racial Bias: In 2015, Google Photos was criticized for its image recognition algorithm, which mistakenly labeled African Americans as gorillas. This incident highlighted racial bias in the AI system, which was attributed to the lack of diversity in the training data. Google apologized for this mistake and worked on improving the algorithm to avoid such issues in the future [126].

Microsoft’s Tay Chatbot: In 2016, Microsoft launched a Twitter-based AI chatbot called Tay. The chatbot was designed to learn from user interactions and to mimic human conversations. However, within 24 h of its launch, Tay started posting offensive and racist tweets, as it had learned from malicious users who intentionally fed biased and inappropriate content. Microsoft quickly took Tay offline and apologized for the incident [127].

Apple Card’s Gender Bias: In 2019, Apple faced a backlash when its Apple Card, a credit card service powered by Goldman Sachs, was accused of gender bias. Several users reported that the credit limit offered to male applicants was significantly higher than that offered to their female counterparts despite having similar or even worse financial profiles. This issue raised concerns about the fairness and transparency of AI algorithms used for credit assessment [128].

COMPAS Risk Assessment Tool: The Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) tool is an AI-based system used in the United States to assess the risk of recidivism in criminal defendants. A 2016 investigation by ProPublica revealed that the tool exhibited racial bias, with African American defendants being more likely to be incorrectly labeled as high-risk compared to white defendants with similar criminal records. This controversy led to increased scrutiny of AI-based risk assessment tools in the criminal justice system [129].

Types of Data Bias

Figure 12 represents ten types of data bias.

Measurement Bias: Measurement bias occurs when data collection methods systematically over- or under-represent certain features or aspects of data [130]. This can lead to AI models that generate biased predictions that are not representative of the true population.

Label Bias: Label bias arises when labels assigned to data instances are incorrect or unrepresentative of true outcomes. This can result from human errors, subjective judgments, or systemic issues during the labeling process [131].

Sampling Bias: Sampling bias occurs when the collected data are not representative of the population of interest. This can lead to biased AI models, as they learn from a non-representative sample [132].

Aggregation Bias: Aggregation bias emerges when data are combined from multiple sources with different characteristics or distributions. This can cause AI models to learn patterns that are not generalizable to the entire population [133].

Confirmation Bias: Confirmation bias emerges when data or information are selectively chosen or weighted to support pre-existing beliefs or expectations. This can inadvertently affect AI model outcomes because the training data may disproportionately represent certain aspects or patterns [134].

Group attribution bias: This bias arises when AI systems generalize, or stereotype individual behaviors based on the perceived characteristics of the group to which they belong. This can lead to biased predictions that do not accurately represent an individual’s unique attributes [135].

Temporal Bias: Temporal bias occurs when AI models are trained on historical data that no longer reflects current trends or patterns. This can lead to biased predictions because the models fail to adapt to changes in the underlying data distribution over time [136].

Feature Selection Bias: Feature selection bias emerges when certain features are given greater importance or focus during the model development process, leading to biased outcomes. This can be a result of domain-specific biases or biases inherent to the algorithms used for feature selection [137].

Anchoring Bias: Anchoring bias occurs when AI models rely heavily on initial information or data points to make predictions. This can result in biased outcomes, as the models may not sufficiently consider other relevant factors or adjust their predictions based on new information [138].

Automation Bias: Automation bias refers to the tendency of humans to over-rely on AI system outputs, even when they are flawed or biased. This can exacerbate the consequences of biased AI models because users may not question or scrutinize biased decisions or recommendations generated by these systems [139].

Consequences of Data Bias and Unfairness in AI

Discrimination: Biased AI systems can inadvertently discriminate against certain groups or individuals, thereby leading to unfair treatment. For example, biased facial recognition systems can misidentify individuals from minority groups at a higher rate than those from majority groups [140].

Misinformation: AI systems trained on biased data may propagate false or misleading information, exacerbating existing stereotypes and prejudices [141].

Legal and Ethical Implications: Biased AI systems can pose legal and ethical challenges as they may violate anti-discrimination laws or ethical guidelines [142].

Addressing Data Bias and Fairness in AI

This section proposes a solution to address data bias and fairness in AI. These results are shown in Figure 13.

Data Collection and Pre-processing: Collecting diverse and representative data is crucial for mitigating data bias [143]. Additionally, preprocessing techniques such as resampling, reweighting, or data augmentation can help reduce bias in the dataset [144].

Algorithmic Fairness: Researchers have proposed various fairness-aware machine-learning algorithms that aim to minimize discriminatory outcomes (Friedler et al., 2019). These techniques typically incorporate fairness constraints into the model training process or post-process model predictions to ensure fairness [145].

Fairness Metrics: Developing appropriate fairness metrics is essential for quantifying and comparing the performance of AI models in terms of fairness [146]. Some commonly used metrics include demographic parity, equalized odds, and disparate impact ratios [147].

Explainable AI: Explainable AI (XAI) techniques can provide insights into the decision-making process of AI models, helping identify potential sources of bias and unfairness [148]. Researchers can develop more effective interventions to improve fairness by understanding the underlying reasons for the biased outcomes.

Interdisciplinary Collaboration: Addressing data bias and fairness requires the collaboration of experts from various fields, including computer science, social sciences, and ethics [149]. Interdisciplinary efforts can help develop comprehensive strategies that consider the complex interplay between data, algorithms, and social contexts.

2.2.5. Dimension V: Interpretability and Explainability

AI models can be difficult to interpret and explain, which can make it difficult for organizations to understand how decisions are made. It is important to ensure that AI models are transparent and explainable. To address this challenge, organizations should implement interpretability and explainability controls such as feature importance analysis and model visualization tools. Figure 14 shows the challenging elements of interpretability and explainability.

Artificial intelligence has become an integral part of modern society, with its influence seen in various domains, including healthcare, finance, transportation, and many others [150]. The rise of AI has been largely driven by advances in machine learning and deep learning techniques, which have demonstrated impressive results in solving complex problems [151]. However, these techniques have also given rise to “black-box” models, characterized by their lack of interpretability and explainability [150].

The Necessity of Interpretability and Explainability

The demand for interpretability and explainability in AI systems is driven by the need for trust, accountability, and ethical considerations [152]. Trust is essential for the adoption and successful integration of AI systems, as users need to understand and believe in the decisions made by these systems [153]. Accountability ensures that AI systems comply with legal and ethical standards and can be audited when necessary [154]. Ethical considerations call for AI systems to adhere to the principles of fairness, transparency, and nondiscrimination [155].

Current Techniques for Interpretability and Explainability

Various techniques have been proposed to enhance the interpretability and explainability of AI systems, ranging from inherently interpretable models to post hoc explanations for black-box models [156]. Some of these techniques include the following:

Inherently interpretable models, such as decision trees, linear regression, and rule-based systems, are designed to be easily understood by humans, providing a direct relationship between input features and the model’s output [157].

Local Explanations: Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive explanations (SHAP) are methods that explain individual predictions by approximating a complex model’s behavior using simpler, interpretable models for a specific instance [158,159,160].

Visualization Techniques: Techniques such as t-distributed Stochastic Neighbor Embedding (t-SNE) and activation can provide visual representations of high-dimensional data processed by AI models, enabling human users to understand the relationships and patterns present in the data [161,162].

Remaining Challenges and Future Directions

Despite progress in developing techniques for interpretability and explainability, several challenges still need to be addressed [150]:

Trade-Off between Performance and Interpretability: Highly interpretable models often come at the cost of reduced predictive performance. Future research should focus on developing models that balance interpretability and performance [157].

Evaluation Metrics: The development of standardized evaluation metrics to assess the interpretability and explainability of AI models remains a challenge. Establishing universally accepted metrics will enable researchers to compare different techniques more effectively and drive further innovation [162].

Domain-Specific Solutions: Certain application domains require specific interpretability and explainability techniques. For example, in the medical field, explanations must be tailored to the knowledge and understanding of both clinicians and patients [163]. Further research is required to develop domain-specific solutions that satisfy unique requirements.

Ethical Considerations: As explainable AI techniques become more advanced, there is a risk of generating explanations that may be misleading or biased, leading to potentially harmful consequences [164]. Future research should address the ethical implications of explainability and develop guidelines to ensure that the explanations are accurate, unbiased, and useful.

Thus, interpretability and explainability are crucial components of the successful integration of AI systems into our daily lives. By addressing the challenges related to these concepts, trust can be fostered, accountability ensured, and ethical AI deployment promoted. The development of techniques to enhance interpretability and explainability remains an active area of research, with significant progress already being achieved. However, several challenges still need to be addressed, including balancing performance with interpretability, developing standardized evaluation metrics, creating domain-specific solutions, and considering ethical implications. By addressing these challenges, we can bridge the gap between AI and human understanding, paving the way for a more transparent and trustworthy AI-powered future.

2.2.6. Dimension VI: Technical Expertise

Building and deploying AI models requires technical expertise, which can be challenging for companies that do not have the necessary skills in-house. To address this challenge, organizations can hire data scientists or partners with external vendors who provide the required expertise.

The growth of artificial intelligence (AI) has revolutionized multiple industries, including healthcare, finance, and manufacturing [165]. AI-driven systems have demonstrated exceptional performance in various tasks, such as natural language processing, computer vision, and robotics [166]. However, the increasing complexity and sophistication of AI algorithms have resulted in new challenges in terms of technical expertise [167]. This section investigates these challenges and proposes potential solutions.

Scarcity of Skilled Professionals

One of the primary challenges in the AI domain is the scarcity of skilled professionals [168]. The rapid development of AI technologies has outpaced the growth of a workforce capable of handling the complexity and diversity of AI systems [169]. This talent gap can hinder further progress in AI and impede the adoption of AI technologies in various ways [167]. Potential solutions to address this talent gap include increasing investment in education and training, promoting interdisciplinary collaboration, and developing AI-driven tools for education [169].

Ethical Concerns

AI technologies have raised various ethical concerns that require careful consideration and technical expertise [170]. These concerns include bias, fairness, transparency, accountability, and potential misuse of AI technologies [171]. Addressing these ethical issues necessitates the development of robust AI systems that align with human values and ethical principles, as well as fostering interdisciplinary collaboration between AI researchers, ethicists, and policymakers [170,172].

The Growing Demand for AI-Related Expertise

The increasing adoption of AI technologies across various industries has led to a surge in the demand for AI-related expertise [173]. According to [174], AI and machine learning are among the top 10 emerging professions, with a projected growth rate of 41% between 2020 and 2025. This increasing demand for AI professionals is driven by the need for specialized knowledge in areas such as algorithm development, data analysis, and systems integration [175]. Addressing this demand requires concerted efforts in education and training as well as the fostering of interdisciplinary collaboration to develop a workforce capable of tackling the complex challenges associated with AI technologies [169].

Key Disciplines in High Demand for AI-Related Expertise

As AI technologies continue to evolve, the demand for expertise in various disciplines is expected to grow. Some of the key disciplines that are needed in the AI domain are:

(a): Computer Science and Computer Engineering: Professionals with skills in algorithm development, machine learning, deep learning, natural language processing, and computer vision are essential for designing, building, and maintaining AI systems [175].
(b): Data Science and Analytics: AI systems often rely on large volumes of data. Experts in data science and analytics are required to preprocess, analyze, and interpret data to generate actionable insights and improve AI models [173].
(c): Human–computer Interaction (HCI) and Cognitive Science: As AI technologies become more integrated into our daily lives, understanding how humans interact with these systems is becoming increasingly important. HCI and cognitive science experts can help design AI systems that are intuitive, user-friendly, and adaptable to human needs [176].
(d): Ethics, Philosophy, and Policy: The growing influence of AI technology raises several ethical and philosophical questions. Experts in these fields are needed to address issues related to fairness, transparency, and accountability and to develop policies and frameworks that ensure responsible AI development and deployment [170].
(e): Cybersecurity and Privacy: Protecting sensitive data and maintaining the security of AI systems is a critical concern. Professionals skilled in cryptography, secure multiparty computation, and privacy-preserving machine learning techniques are essential to ensure data privacy and security [177].
(f): Robotics and Autonomous Systems: As AI-powered robotics and autonomous systems become more prevalent, expertise in areas such as control systems, sensor fusion, and robotics software engineering will become increasingly valuable [178].

Collaboration between Humans and AI

AI systems are becoming increasingly more autonomous. As a result, there is a growing need for effective collaboration between humans and AI [178]. This collaboration requires developing AI systems that can understand and adapt to human preferences, communicate effectively, and support human decision making [178]. Technical expertise in human–computer interactions, cognitive science, and explainable AI is crucial for designing AI systems that can seamlessly integrate into human workflows [176].

The references used in this study to address the challenges of data in AI applications are listed in Table 1.

3. Results

In our quest to understand the hurdles to data quality in terms of artificial intelligence, we stumbled upon a few noteworthy insights. Specifically, we noticed that data quality plays a vital role in AI systems, which span multiple dimensions with significant implications. Therefore, ensuring the quality of data requires not only appropriate governance, but also adherence to best practices to ensure optimal results.

3.1. Dimension of Data Quality and Implication for AI systems

Particularly pertinent to AI systems, our analysis disclosed the following dimensions of data quality: Accuracy, Completeness, Consistency, Timeliness, Relevance, and Integrity.

AI predictions and decisions are influenced by crucial dimensions of performance, reliability, and trustworthiness. The quality of data across these dimensions must be diligently maintained to minimize the hazards of distorted or prejudiced outcomes and optimize the efficiency of AI programs.

3.2. The Role of Data Governance in Ensuring Data Quality

Ensuring data quality is a crucial aspect of data governance. It involves managing and monitoring the data to maintain accuracy, completeness, and consistency. Data governance plays a vital role in this process by establishing policies and standards to regulate how data are collected, used, and shared. By implementing effective data governance, organizations can reduce the risk of errors and inconsistencies in their data that could lead to costly mistakes and poor decision making. Overall, data governance is essential for maintaining high-quality data and ensuring their usefulness and reliability for various purposes.

Maintaining the data quality for AI systems can be ensured through data governance, an aspect that our analysis has also highlighted. Organizational benefits from establishing a strong data governance system include the following:

(a): Quality standards and policies for data must be defined and put into action.
(b): Throughout the lifecycle of the data, it is important to keep a close eye on their quality and maintain control.
(c): Quality data and holding ourselves accountable should be part of a culture we strive to create.
(d): The sharing, integration, and management of data can be enhanced through various means. The optimization of data management techniques should be prioritized. Improved data sharing is crucial for seamless exchanges between different systems. The integration of various data types can be achieved using appropriate methods.
(e): Regulations and laws must be followed carefully to maintain compliance.

Systematically addressing data quality challenges and minimizing the risks associated with poor data quality can be achieved by integrating data governance into the AI development process.

3.3. Best Practices to Ensure Data Quality for AI

AI systems’ data quality can be ensured by adopting the following best practices:

(a): Implementing an effective data management strategy that includes data curation and preprocessing before usage.
(b): Fostering transparency and accountability in the data collection process, including defining data sources and conducting regular audits.
(c): Conducting diversity checks on the collected dataset to avoid bias, and making sure that it is representative of the target population.
(d): Ensuring the security and privacy of the data by implementing the necessary security protocols and obtaining consent from the data subjects.
(e): Proactively monitoring and updating the dataset to maintain accuracy and relevance, especially when it comes to dynamic or constantly changing environments.

AI systems can provide reliable data quality if organizations implement certain best practices, and we have identified quality data frameworks and strategies that need to be developed and implemented. The structures and processes for data governance must be established to ensure proper management. Governance data structures and processes provide accountability and responsibility for data management. Establishing governance structures and processes is critical for ensuring the proper use of data. To effectively manage data, clear guidelines and procedures must be in place for those responsible for handling them. Proper governance ensures that data are properly managed, and that rights permissions and access are granted. Without these structures and processes, data can be lost or misused, affecting an organization’s overall performance. Regular assessments and audits must be conducted to ensure data quality. Do not forget to conduct these examinations sporadically. Enrichment tools, coupled with data cleansing and validation, should be used. Traceability solutions and data lineages are means of implementing change. Machine learning and AI-based solutions offer powerful tools for improving data quality; therefore, their adoption is highly recommended. The best data quality practices should be taught to employees through training sessions. To enhance the quality of the data, alliances can be formed with colleagues and associates. Organizations can improve their AI performance and reliability by embracing these best practices, which in turn will increase the quality of the data used. Implementing data governance and best practices across various dimensions is necessary to unlock the full potential of AI and to drive better outcomes. Organizations must address data-quality challenges to ensure the successful development and deployment of AI systems, as evidenced by our analysis. Data quality plays a critical role and should be a top priority for organizations that utilize AI.

4. Discussion

The comprehensive analysis of the data quality challenges for artificial intelligence presented in this article underscores the importance of understanding and addressing data quality issues in the development and deployment of AI systems. In this discussion section, we emphasize the broader implications of our findings, highlight the limitations of the current study, and suggest future research directions.

4.1. Broader Implications

Our analysis has several broader implications for organizations, policymakers, and researchers involved in AI development and deployment. First, by identifying and understanding the key dimensions of data quality and their implications for AI systems, organizations can prioritize their efforts to address data quality challenges, ensuring that their AI systems deliver accurate, reliable, and unbiased results. Second, our analysis highlights the significance of data governance in maintaining and improving data quality, emphasizing the need for organizations to invest in robust data governance structures and processes. Finally, the best practices identified in this study can serve as a practical guide for organizations seeking to enhance their data quality management efforts and optimize the performance and reliability of AI systems.

4.2. Limitations

Despite providing a comprehensive analysis of the data quality challenges for AI, this study had several limitations. First, the scope of our analysis primarily focused on the dimensions of data quality, data governance, and best practices. Additional factors, such as organizational culture and technical infrastructure, could impact data quality and AI performance. Second, although we have drawn upon a wide range of literature sources, there may be other relevant publications that were not considered in this study. Finally, our findings were largely based on a synthesis of the existing literature, and future empirical research is needed to further validate and expand these findings.

4.3. Real-Time Time Environment Challenges

Real-time data pose unique challenges in AI applications because of the need to process and analyze data in near real-time or with minimal delays. Some key challenges associated with real-time data in AI applications are as follows:

(a): Real-Time Data Volume and Velocity: Real-time data often come in high volumes and at high velocities, requiring efficient processing and analysis techniques. AI systems must handle incoming data streams and make timely decisions or predictions based on these data.
(b): Latency and Response Times: Real-time applications demand low latency and fast response times. AI models must be designed and optimized to provide quick insights and actions in real-time scenarios. High computational requirements and complex algorithms can hinder real-time performance.
(c): Real-Time Data Quality and Noise: Real-time data can be noisy, incomplete, or contain outliers. Ensuring data quality becomes challenging as there is limited time for data validation and cleaning. AI models must be robust in order to handle such noisy data and make accurate predictions or decisions.
(d): Scalability and Resource Constraints: Real-time AI applications often require scalability to handle large volumes of incoming data. Scaling AI models and infrastructure to handle the increased workload can be challenging considering resource constraints such as computing power, memory, and network bandwidth.
(e): Data Synchronization: Real-time data may come from multiple sources and must be synchronized for accurate analysis and decision-making. Aligning and integrating data streams from various sources in real time can be complex, particularly when dealing with data in different formats or time zones.
(f): Real-Time Model Training and Adaptation: Updating or retraining AI models in real-time can be challenging. Continuous learning and model adaptation may be required to incorporate new data and adjust model parameters to changing conditions. Balancing the model stability with the need for real-time updates is crucial.
(g): Real-Time Analytics and Visualization: Effectively analyzing and visualizing real-time data to extract meaningful insights and support decision making is a challenge. Real-time analytics techniques and interactive visualization tools are required to process and present data in a timely and actionable manner.

Addressing these challenges requires the careful design, optimization, and integration of AI algorithms, infrastructure, and data-processing pipelines. It often involves a combination of efficient data streaming techniques, distributed computing, real-time analytics frameworks, and scalable AI architectures to enable real-time decision making and insights.

4.4. Future Research Directions

Based on the limitations and findings of this study, we suggest several directions for future research:

(a): Investigate the role of organizational culture, leadership, and technical infrastructure in ensuring data quality for AI systems.
(b): Conduct empirical research to assess the effectiveness of different data governance practices and data quality management strategies in real-world AI applications.
(c): Examine the relationship between specific dimensions of data quality and AI performance across different industries and use cases.
(d): Develop novel AI and machine learning techniques to automatically detect, diagnose, and resolve data quality issues.
(e): Explore the ethical and legal implications of data quality challenges in AI, particularly in relation to privacy, transparency, and fairness.

By addressing these future research directions, we can further deepen our understanding of the challenges of data quality for AI, develop more effective strategies to overcome these challenges, and harness the full potential of AI technologies.

5. Conclusions

This work addresses the challenges surrounding data for AI technology and applications, which businesses and organizations are recommended to develop strategies and frameworks to meet, and handles these challenges in the following different dimensions. (1) Data Quality covers accuracy, completeness, consistency, timeliness, integrity, relevance, data collection, pre-processing, management, data governance, data labeling, etc. (2) Data Volume covers data deluge, storage challenges, processing challenges, data management challenges, data heterogeneity, data privacy and security, bias and representativeness, data access, etc. (3) Data Privacy and Security cover inference attacks, differential privacy, adversarial attacks, data poisoning, model and data tampering, privacy-preserving AI techniques, robustness and adversarial training, monitoring and anomaly detection, and compliance with data protection regulations. (4) Bias and Fairness cover measurement bias, label bias, sampling bias, aggregation bias, confirmation bias, temporal bias, feature selection bias, etc. (5) Interpretability and Explainability cover local explanations, visualization techniques, trade-offs between performance and interpretability, evaluation metrics, domain-specific solutions, and ethical considerations. (6) Technical Expertise covers computer science and computer engineering, data science and analytics, human–computer interactions, ethics, philosophy and policy, cybersecurity and privacy, etc. Technical expertise in AI is essential to address the challenges posed by the rapid development of AI technologies. By fostering interdisciplinary collaboration, investing in education and training, and promoting the development of ethical, secure, and human-centered AI systems, researchers and policymakers can overcome these challenges and pave the way for further advancements in AI.

Author Contributions

Conceptualization, A.A., A.M.H. and K.N.A.-K.; writing—original draft preparation, A.A., A.M.H. and K.N.A.-K.; writing—review and editing, A.M.H. and A.A.; supervision, K.N.A.-K. and A.M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Russell, S.J.; Norvig, P. Artificial Intelligence: A Modern Approach; Pearson Education Limited: London, UK, 2016. [Google Scholar]
Sharma, L.; Garg, P.K. Artificial Intelligence: Technologies, Applications, and Challenges; Taylor & Francis: New York, NY, USA, 2021. [Google Scholar]
Aguiar-Pérez, J.M.; Pérez-Juárez, M.A.; Alonso-Felipe, M.; Del-Pozo-Velázquez, J.; Rozada-Raneros, S.; Barrio-Conde, M. Understanding Machine Learning Concepts. In Encyclopedia of Data Science and Machine Learning; IGI Global: Hershey, PA, USA, 2023; pp. 1007–1022. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA; 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Gumbs, A.A.; Grasso, V.; Bourdel, N.; Croner, R.; Spolverato, G.; Frigerio, I.; Illanes, A.; Abu Hilal, M.; Park, A.; Elyan, E. The advances in computer vision that are enabling more autonomous actions in surgery: A systematic review of the literature. Sensors 2022, 22, 4918. [Google Scholar] [CrossRef] [PubMed]
Enholm, I.M.; Papagiannidis, E.; Mikalef, P.; Krogstie, J. Artificial intelligence and business value: A literature review. Inf. Syst. Front. 2022, 24, 1709–1734. [Google Scholar] [CrossRef]
Wang, Z.; Li, M.; Lu, J.; Cheng, X. Business Innovation based on artificial intelligence and Blockchain technology. Inf. Process. Manag. 2022, 59, 102759. [Google Scholar] [CrossRef]
Dahiya, N.; Sheifali, G.; Sartajvir, S. A Review Paper on Machine Learning Applications, Advantages, and Techniques. ECS Trans. 2022, 107, 6137. [Google Scholar] [CrossRef]
Marr, B. Artificial Intelligence in Practice: How 50 Successful Companies Used AI and Machine Learning to Solve Problems; John Wiley & Sons: New York, NY, USA, 2018. [Google Scholar]
Géron, A. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2019. [Google Scholar]
Liu, X.; Yoo, C.; Xing, F.; Oh, H.; El Fakhri, G.; Kang, J.-W.; Woo, J. Deep unsupervised domain adaptation: A review of recent advances and perspectives. APSIPA Trans. Signal Inf. Process. 2022, 11, e25. [Google Scholar] [CrossRef]
Li, Y. Deep reinforcement learning: An overview. arXiv 2017, arXiv:1701.07274. [Google Scholar]
Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A comprehensive survey on transfer learning. Proc. IEEE 2020, 109, 43–76. [Google Scholar] [CrossRef]
Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.L.; Chen, S.C.; Iyengar, S.S. A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv. (CSUR) 2018, 51, 1–36. [Google Scholar] [CrossRef]
Sun, X.; Liu, Y.; Liu, J. Ensemble learning for multi-source remote sensing data classification based on different feature extraction methods. IEEE Access 2018, 6, 50861–50869. [Google Scholar]
Zha, D.; Bhat, Z.P.; Lai, K.H.; Yang, F.; Jiang, Z.; Zhong, S.; Hu, X. Data-centric artificial intelligence: A survey. arXiv 2023, arXiv:2303.10158. [Google Scholar]
Ntoutsi, E.; Fafalios, P.; Gadiraju, U.; Iosifidis, V.; Nejdl, W.; Vidal, M.E.; Ruggieri, S.; Turini, F.; Papadopoulos, S.; Krasanakis, E.; et al. Bias in data-driven artificial intelligence systems—An introductory survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1356. [Google Scholar] [CrossRef] [Green Version]
Jarrahi, M.H.; Ali, M.; Shion, G. The Principles of Data-Centric AI (DCAI). arXiv 2022, arXiv:2211.14611. [Google Scholar]
Zha, D.; Bhat, Z.P.; Lai, K.-H.; Yang, F.; Hu, X. Data-centric AI: Perspectives and Challenges. arXiv 2023, arXiv:2301.04819. [Google Scholar]
Mazumder, M.; Banbury, C.; Yao, X.; Karlaš, B.; Rojas, W.G.; Diamos, S.; Diamos, G.; He, L.; Kiela, D.; Jurado, D.; et al. Dataperf: Benchmarks for data-centric ai development. arXiv 2022, arXiv:2207.10062. [Google Scholar]
Miranda, L.J. Towards Data-Centric Machine Learning: A Short Review. Available online: https://ljvmiranda921.github.io/notebook/2021/07/30/data-centric-ml/ (accessed on 15 April 2023).
Alvarez-Coello, D.; Wilms, D.; Bekan, A.; Gómez, J.M. Towards a data-centric architecture in the automotive industry. Procedia Comput. Sci. 2021, 181, 658–663. [Google Scholar] [CrossRef]
Uddin, M.F.; Navarun, G. Seven V’s of Big Data understanding Big Data to extract value. In Proceedings of the 2014 Zone 1 Conference of the American Society for Engineering Education, Bridgeport, CT, USA, 3–5 April 2014. [Google Scholar]
O’Leary, D.E. Artificial intelligence and big data. IEEE Intell. Syst. 2013, 28, 96–99. [Google Scholar] [CrossRef]
Broo, D.G.; Jennifer, S. Towards data-centric decision making for smart infrastructure: Data and its challenges. IFAC-Pap. 2020, 53, 90–94. [Google Scholar] [CrossRef]
Jakubik, J.; Vössing, M.; Kühl, N.; Walk, J.; Satzger, G. Data-centric Artificial Intelligence. arXiv 2022, arXiv:2212.11854. [Google Scholar]
Li, X.-H.; Cao, C.C.; Shi, Y.; Bai, W.; Gao, H.; Qiu, L.; Wang, C.; Gao, Y.; Zhang, S.; Xue, X.; et al. A survey of data-driven and knowledge-aware explainable ai. IEEE Trans. Knowl. Data Eng. 2020, 34, 29–49. [Google Scholar] [CrossRef]
Hajian, S.; Bonchi, F.; Castillo, C. Algorithmic bias: From discrimination discovery to fairness-aware data mining. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 2125–2126. [Google Scholar]
Kanter, J.M.; Benjamin, S.; Kalyan, V. Machine Learning 2.0: Engineering Data Driven AI Products. arXiv 2018, arXiv:1807.00401. [Google Scholar]
Xu, K.; Li, Y.; Liu, C.; Liu, X.; Hao, X.; Gao, J.; Maropoulos, P.G. Maropoulos. Advanced data collection and analysis in data-driven manufacturing process. Chin. J. Mech. Eng. 2020, 33, 1–21. [Google Scholar] [CrossRef]
Maranghi, M.; Anagnostopoulos, A.; Cannistraci, I.; Chatzigiannakis, I.; Croce, F.; Di Teodoro, G.; Gentile, M.; Grani, G.; Lenzerini, M.; Leonardi, S.; et al. AI-based Data Preparation and Data Analytics in Healthcare: The Case of Diabetes. arXiv 2022, arXiv:2206.06182. [Google Scholar]
Bergen, K.J.; Johnson, P.A.; de Hoop, M.V.; Beroza, G.C. Machine learning for data-driven discovery in solid Earth geoscience. Science 2019, 363, eaau0323. [Google Scholar] [CrossRef] [PubMed]
Jöckel, L.; Michael, K. Increasing Trust in Data-Driven Model Validation: A Framework for Probabilistic Augmentation of Images and Meta-data Generation Using Application Scope Characteristics. In Computer Safety, Reliability, and Security, Proceeding of the 38th International Conference, SAFECOMP 2019, Turku, Finland, 11–13 September 2019; Springer International Publishing: New York, NY, USA, 2019. [Google Scholar]
Burr, C.; Leslie, D. Ethical assurance: A practical approach to the responsible design, development, and deployment of data-driven technologies. AI Ethics 2023, 3, 73–98. [Google Scholar] [CrossRef]
Lomas, J.; Nirmal, P.; Jodi, F. Continuous improvement: How systems design can benefit the data-driven design community. In Proceedings of the RSD7, Relating Systems Thinking and Design 7, Turin, Italy, 23–26 October 2018. [Google Scholar]
Yablonsky, S. Multidimensional data-driven artificial intelligence innovation. Technol. Innov. Manag. Rev. 2019, 9, 16–28. [Google Scholar] [CrossRef]
Batista, G.E.; Monard, M.C. Data quality in machine learning: A study in the context of imbalanced data. Neurocomputing 2018, 275, 1665–1679. [Google Scholar]
Pipino, L.L.; Lee, Y.W.; Wang, R.Y. Data quality assessment. In Data and Information Quality; Springer: Cham, Switzerland, 2018; pp. 219–253. [Google Scholar]
Halevy, A.; Korn, F.; Noy, N.; Olston, C.; Polyzotis, N.; Roy, S.; Whang, S. Goods: Organizing Google’s datasets. Commun. ACM 2020, 63, 50–57. [Google Scholar]
Redman, T.C. Data Quality for the Information Age; Artech House, Inc.: Norwood, MA, USA, 1996. [Google Scholar]
Juran, J.M.; Godfrey, A.B. Juran’s Quality Handbook: The Complete Guide to Performance Excellence; McGraw-Hill Education: New York, NY, USA, 2018. [Google Scholar]
Yang, Y.; Zheng, L.; Zhang, J.; Cui, Q.; Li, Z.; Yu, P.S. TI-CNN: Convolutional neural networks for fake news detection. arXiv 2018, arXiv:1806.00749. [Google Scholar]
Barocas, S.; Hardt, M.; Narayanan, A. Fairness and machine learning. Limit. Oppor. 2021, 1, 1–269. [Google Scholar]
Little, R.J.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: New York, NY, USA, 2019. [Google Scholar]
Hassan, N.U.; Asghar, M.Z.; Ahmed, S.; Zafar, H. A survey on data quality issues in big data. ACM Comput. Surv. (CSUR) 2021, 54, 1–37. [Google Scholar]
Chen, H.; Chiang, R.H.; Storey, V.C. Business intelligence and analytics: From big data to big impact. MIS Q. 2012, 36, 1165–1188. [Google Scholar] [CrossRef]
Gandomi, A.; Haider, M. Beyond the hype: Big data concepts, methods, and analytics. Int. J. Inf. Manag. 2015, 35, 137–144. [Google Scholar] [CrossRef] [Green Version]
Karkouch, A.; Mousannif, H.; Al Moatassime, H.; Noel, T. Data quality in the Internet of Things: A state-of-the-art survey. J. Netw. Comput. Appl. 2018, 124, 289–310. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Petersen, S. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Daries, J.P.; Reich, J.; Waldo, J.; Young, E.M.; Whittinghill, J.; Ho, A.D.; Chuang, I. Privacy, anonymity, and big data in the social sciences. Commun. ACM 2014, 57, 56–63. [Google Scholar] [CrossRef] [Green Version]
García, S.; Luengo, J.; Herrera, F. Data Preprocessing in Data Mining; Springer: New York, NY, USA, 2016. [Google Scholar]
Kelleher, J.D.; Mac Namee, B.; D’Arcy, A. Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies; MIT Press: Boston, MA, USA, 2018. [Google Scholar]
Guyon, I.; Gunn, S.; Ben-Hur, A. Result analysis of the NIPS 2003 feature selection challenge. Adv. Neural Inf. Process. Syst. 2004, 17, 545–552. [Google Scholar]
Khatri, V.; Brown, C.V. Designing data governance. Commun. ACM 2010, 53, 148–152. [Google Scholar] [CrossRef]
Otto, B. Organizing data quality management in enterprises. In Proceedings of the 17th Americas Conference on Information Systems (AMCIS), Detroit, MI, USA, 4–8 August 2011; pp. 1–9. [Google Scholar]
Weill, P.; Ross, J.W. IT Governance: How Top Performers Manage IT Decision Rights for Superior Results; Harvard Business Press: Boston, MA, USA, 2004. [Google Scholar]
Tallon, P.P. Corporate governance of big data: Perspectives on value, risk, and cost. IEEE Comput. 2013, 46, 32–38. [Google Scholar] [CrossRef]
Panian, Z. Some practical experiences in data governance. World Acad. Sci. Eng. Technol. 2010, 66, 1248–1253. [Google Scholar]
Laney, D.B. Infonomics: How to Monetize, Manage, and Measure Information as an Asset for Competitive Advantage; Routledge: Oxfordshire, UK, 2017. [Google Scholar]
Thomas, G.; Griffin, R. Data governance: A taxonomy of data quality interventions. Int. J. Inf. Qual. 2015, 4, 4–17. [Google Scholar]
Begg, C.; Caira, T. Data governance: More than just keeping data clean. J. Enterp. Inf. Manag. 2013, 26, 595–610. [Google Scholar]
Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; John Wiley & Sons: New York, NY, USA, 2004. [Google Scholar]
Candès, E.J.; Recht, B. Exact matrix completion via convex optimization. Found. Comput. Math. 2009, 9, 717–772. [Google Scholar] [CrossRef] [Green Version]
Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2018, 66, 31–47. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations; Chapman and Hall/CRC: New York, NY, USA, 2019. [Google Scholar]
Wong, S.C.; Gatt, A.; Stamatescu, V.; McDonnell, M.D. Understanding data augmentation for classification: When to warp? In Proceedings of the 2018 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Canberra, Australia, 10–13 December 2018; pp. 1–8. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation policies from data. arXiv 2018, arXiv:1805.09501. [Google Scholar]
Yang, Y.; Loog, M.; Hospedales, T.M. Active Learning by Querying Informative and Representative Examples. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2436–2450. [Google Scholar] [CrossRef] [Green Version]
Li, Y.; Guo, Y. Adaptive Active Learning for Image Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; pp. 7663–7671. [Google Scholar] [CrossRef]
Siddiquie, B.; Gupta, A. Human Effort Estimation for Visual Tasks. Int. J. Comput. Vis. 2019, 127, 1161–1179. [Google Scholar]
Zhang, Y.; Chen, T.; Zhang, Y. Challenges and countermeasures of big data in artificial intelligence. J. Phys. Conf. Ser. 2019, 1237, 032023. [Google Scholar] [CrossRef] [Green Version]
Zhu, Y.; Lapata, M. Learning to attend, copy, and generate for session-based query suggestion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
Halevy, A.; Norvig, P.; Pereira, F. The unreasonable effectiveness of data. IEEE Intell. Syst. 2020, 24, 8–12. [Google Scholar] [CrossRef]
Li, Q.; Dong, S.; Ye, S. Storage challenges and solutions in the AI era. Front. Inf. Technol. Electron. Eng. 2021, 22, 743–767. [Google Scholar]
Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
Wu, Y.; Liu, J.; He, H.; Chen, H.; Chen, J. Data storage technology in artificial intelligence. IEEE Access 2021, 9, 37864–37881. [Google Scholar]
Hutter, F.; Kotthoff, L.; Vanschoren, J. (Eds.) Automated Machine Learning: Methods, Systems, Challenges; Springer Nature: New York, NY, USA, 2019. [Google Scholar]
Sharma, H.; Park, J.; Mahajan, D.; Amaro, E.; Kaeli, D.; Kim, Y. From high-level deep neural models to FPGAs. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, Greece, 17–21 October 2020. [Google Scholar]
Chen, Y.; Wang, T.; Yang, Y.; Zhang, B. Deep model compression: Distilling knowledge from noisy teachers. arXiv 2020, arXiv:1610.09650. [Google Scholar]
Ratner, A.; Bach, S.; Ehrenberg, H.; Fries, J.; Wu, S.; Ré, C. Snorkel: Rapid training data creation with weak supervision. Proc. VLDB Endow. 2019, 11, 269–282. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Maheshwari, A.; Killamsetty, K.; Ramakrishnan, G.; Iyer, R.; Danilevsky, M.; Popa, L. Learning to Robustly Aggregate Labeling Functions for Semi-supervised Data Programming. arXiv 2021, arXiv:2109.11410. [Google Scholar]
Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 2019, 10, 1–19. [Google Scholar] [CrossRef]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Zhang, Y. Advances and open problems in federated learning. Found. Trends® Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge computing: Vision and challenges. IEEE Internet Things J. 2020, 3, 637–646. [Google Scholar] [CrossRef]
Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [Green Version]
McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, 20–22 April 2017. [Google Scholar]
Sattler, F.; Wiedemann, S.; Müller, K.R.; Samek, W. Robust and communication-efficient federated learning from non-IID data. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 3400–3413. [Google Scholar] [CrossRef] [Green Version]
Bagdasaryan, E.; Veit, A.; Hua, Y.; Estrin, D.; Shmatikov, V. How to backdoor federated learning. In Proceedings of the 2020 International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Yurochkin, M.; Agarwal, N.; Ghosh, S.; Greenewald, K.; Hoang, L.; Khazaeni, Y. Bayesian nonparametric federated learning of neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Konečný, J.; McMahan, H.B.; Ramage, D.; Richtárik, P. Federated optimization: Distributed machine learning for on-device intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar]
Truex, S.; Baracaldo, N.; Anwar, A.; Steinke, T.; Ludwig, H. A hybrid approach to privacy-preserving federated learning. In Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security (AISec), Virtual Event, 13 November 2020. [Google Scholar]
Roman, R.; Lopez, J.; Mambo, M. Mobile edge computing, Fog et al.: A survey and analysis of security threats and challenges. Future Gener. Comput. Syst. 2018, 78, 680–698. [Google Scholar] [CrossRef] [Green Version]
Wang, S.; Tuor, T.; Salonidis, T.; Leung, K.K.; Makaya, C.; He, T.; Chan, K. Adaptive deep learning model selection on embedded systems. In Proceedings of the 3rd ACM/IEEE Symposium on Edge Computing (SEC), Arlington, VA, USA, 7–9 November 2019. [Google Scholar]
Kumar, A.; Goyal, S.; Varma, M.; Jain, P. Resource-constrained distributed machine learning: A survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–34. [Google Scholar]
Zhang, Z.; Mao, Y.; Letaief, K.B. Energy-efficient user association and resource allocation in heterogeneous cloud radio access networks. IEEE J. Sel. Areas Commun. 2019, 37, 1107–1121. [Google Scholar]
Zhang, H.; Wu, J.; Zhang, Z.; Yang, Q. Collaborative learning for data privacy and data utility. IEEE Trans. Knowl. Data Eng. 2021. [Google Scholar]
Liang, X.; Zhao, J.; Shetty, S.; Liu, J.; Li, D. Integrating blockchain for data sharing and collaboration in mobile healthcare applications. In Proceedings of the IEEE 28th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC), London, UK, 31 August–3 September 2020. [Google Scholar]
Chen, X.; Zhang, W.; Wang, X.; Li, T. Privacy-Preserving Federated Learning for IoT Applications: A Review. IEEE Internet Things J. 2021, 8, 6078–6093. [Google Scholar]
Zhao, Y.; Fan, L. A secure data sharing scheme for cross-border cooperation in the artificial intelligence era. Secur. Commun. Netw. 2021, 2021, 1–12. [Google Scholar]
Carlini, N.; Liu, C.; Erlingsson, U.; Kos, J.; Song, D.; Wicker, M. The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks. In Proceedings of the 28th USENIX Security Symposium, Santa Clara, CA, USA, 14–16 August 2019; pp. 267–284. Available online: https://www.usenix.org/system/files/sec19-carlini.pdf (accessed on 20 March 2023).
Jayaraman, B.; Evans, D. Evaluating Membership Inference Attacks in Machine Learning: An Information Theoretic Framework. IEEE Trans. Inf. Secur. 2020, 15, 1875–1890. [Google Scholar]
Dwork, C.; Roth, A.; Naor, M. Differential Privacy: A Survey of Results. In Theory and Applications of Models of Computation; Springer: New York, NY, USA, 2018; pp. 1–19. [Google Scholar]
Truex, S.; Xu, C.; Calandrino, J.; Boneh, D. The Limitations of Differential Privacy in Practice. In Proceedings of the 28th USENIX Security Symposium, Santa Clara, CA, USA, 14–16 August 2019; pp. 1045–1062. [Google Scholar]
Goodfellow, I.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. Commun. ACM 2022, 65, 56–65. [Google Scholar]
Akhtar, N.; Mian, A. Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey. IEEE Access 2018, 6, 14410–14430. [Google Scholar] [CrossRef]
Steinhardt, J.; Koh, P.W.; Liang, P. Certified Defenses against Adversarial Examples. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Zhu, M.; Yin, H.; Yang, X. A Comprehensive Survey of Poisoning Attacks in Federated Learning. IEEE Access 2021, 9, 57427–57447. [Google Scholar]
Sun, Y.; Zhang, T.; Wang, J.; Wang, X. A Survey of Deep Neural Network Backdoor Attacks and Defenses. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 4150–4169. [Google Scholar]
Gu, T.; Dolan-Gavitt, B.; Garg, S. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. In Proceedings of the 28th USENIX Security Symposium, Santa Clara, CA, USA, 14–16 August 2019; pp. 1965–1980. [Google Scholar]
Liu, Y.; Ma, X.; Ateniese, G.; Hsu, W.L. Trojaning Attack on Neural Networks. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Toronto, ON, Canada, 15–19 October 2018; pp. 27–41. [Google Scholar] [CrossRef]
Gao, Y.; Sun, X.; Zhang, Y.; Liu, J. Trojan Attacks on Federated Learning Systems: An Overview. IEEE Netw. 2021, 35, 144–150. [Google Scholar]
Tramèr, F.; Zhang, F.; Juels, A.; Reiter, M.K.; Ristenpart, T. Stealing Machine Learning Models via Prediction APIs. In USENIX Security Symposium. 2016, 16, pp. 601–618. Available online: https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_tramer.pdf (accessed on 25 March 2023).
Jagielski, M.; Severi, G.; Pousette Harger, N.; Oprea, A. Subpopulation Data Poisoning Attacks. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, 15–19 November 2021; pp. 31–3122. [Google Scholar] [CrossRef]
Liu, Y.; Chen, J.; Liu, T.; Yang, Y. Trojan Detection via Fine-Pruning. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, 9–13 November 2020; pp. 1151–1168. [Google Scholar] [CrossRef]
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 308–318. [Google Scholar] [CrossRef] [Green Version]
Shneiderman, B. Bridging the Gap Between Ethics and Practice: Guidelines for Reliable, Safe, and Trustworthy Human-centered AI Systems. ACM Trans. Interact. Intell. Syst. 2020, 10, 1–31. [Google Scholar] [CrossRef]
Bonawitz, K.; Eichner, H.; Grieskamp, W.; Huba, D.; Ingerman, A.; Ivanov, V.; Kiddon, C.; Konečný, J.; Mazzocchi, S.; McMahan, B.; et al. Towards federated learning at scale: System design. In Proceedings of the 2nd Workshop on Systems for ML at Scale; 2019; pp. 1–6. [Google Scholar]
Liu, P.; Wang, L.; Ranjan, R.; He, G.; Zhao, L. A Survey on Active Deep Learning: From Model Driven to Data Driven. ACM Comput. Surv. 2022, 54, 1–34. [Google Scholar] [CrossRef]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the 35th International Conference on Machine Learning, Vienna, Austria, 25–31 July 2018; pp. 297–306. [Google Scholar]
Yuan, X.; He, P.; Zhu, Q.; Li, X. Adversarial Examples: Attacks and Defenses for Deep Learning. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 2805–2824. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Onwuzurike, L.; Mariconti, E.; Andriotis, P.; Cristofaro, E.D.; Ross, G.; Stringhini, G. Mamadroid: Detecting android malware by building markov chains of behavioral models (extended version). ACM Trans. Priv. Secur. (TOPS) 2019, 22, 1–34. [Google Scholar] [CrossRef] [Green Version]
Ramirez, M.A.; Kim, S.K.; Hamadi, H.A.; Damiani, E.; Byon, Y.J.; Kim, T.Y.; Cho, C.S.; Yeun, C.Y. Poisoning attacks and defenses on artificial intelligence: A survey. arXiv 2022, arXiv:2202.10276. [Google Scholar]
Polonetsky, J.; Tene, O. GDPR and AI: Friends or Foes? IEEE Secur. Priv. 2018, 16, 26–33. [Google Scholar]
Barocas, S.; Hardt, M.; Narayanan, A. Fairness and Machine Learning. 2019. Available online: FairMLBook.org (accessed on 20 May 2023).
Dastin, J. Amazon Scraps Secret AI Recruiting Tool That Showed Bias against Women; Reuters: London, UK, 2018. [Google Scholar]
Simonite, T. When It Comes to Gorillas, Google Photos Remains Blind; Wired: San Francisco, CA, USA, 2018. [Google Scholar]
Vincent, J. Twitter Taught Microsoft’s AI Chatbot to Be a Racist in Less Than a Day; The Verge: Sant Monica, CA, USA, 2016. [Google Scholar]
Harding, S. Apple’s Credit Card Gender Bias Draws Regulatory Scrutiny; Forbes: New York, NY, USA, 2019. [Google Scholar]
Angwin, J.; Larson, J.; Mattu, S.; Kirchner, L. Machine Bias: There’s Software Used Across the Country to Predict Future Criminals. And It’s Biased against Blacks; ProPublica: New York, NY, USA, 2016. [Google Scholar]
Olteanu, A.; Castillo, C.; Diaz, F.; Kıcıman, E. Social data: Biases, methodological pitfalls, and ethical boundaries. Front. Big Data Sci. 2019, 2, 13. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sun, T.; Gaut, A.; Tang, S.; Huang, Y.; ElSherief, M.; Zhao, J.; Wang, W.Y. Mitigating gender bias in natural language processing: Literature review. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1630–1640. [Google Scholar]
Torralba, A.; Efros, A.A. Unbiased look at dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Springs, CO, USA, 20–25 June 2011; pp. 1521–1528. [Google Scholar]
Zhao, Z.; Wallace, B.C.; Jang, E.; Choi, Y.; Lease, M. Combating human trafficking: A survey of AI techniques and opportunities for technology-enabled counter-trafficking. ACM Comput. Surv. 2021, 54, 1–35. [Google Scholar]
Nickerson, R.S. Confirmation bias: A ubiquitous phenomenon in many guises. Rev. Gen. Psychol. 1998, 2, 175–220. [Google Scholar] [CrossRef]
Krueger, J.I.; Funder, D.C. Towards a balanced social psychology: Causes, consequences, and cures for the problem-seeking approach to social behavior and cognition. Behav. Brain Sci. 2004, 27, 313–327. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Gupta, P.; Raghavan, H. Temporal bias in machine learning. arXiv 2021, arXiv:2104.12843. [Google Scholar]
Gutierrez, M.; Serrano-Guerrero, J. Bias-aware feature selection in machine learning. arXiv 2020, arXiv:2007.07956. [Google Scholar]
Kahneman, D. Thinking, Fast and Slow; Farrar, Straus, and Giroux: New York, NY, USA, 2011. [Google Scholar]
Lee, J.D.; See, K.A. Trust in automation: Designing for appropriate reliance. Hum. Factors 2004, 46, 50–80. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Buolamwini, J.; Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; pp. 77–91. [Google Scholar]
Crawford, K. Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence; Yale University Press: New Haven, CT, USA, 2021. [Google Scholar]
Kriebitz, A.; Lütge, C. Artificial intelligence and human rights: A business ethical assessment. Bus. Hum. Rights J. 2020, 5, 84–104. [Google Scholar] [CrossRef]
Pleiss, G.; Raghavan, M.; Wu, F.; Kleinberg, J.; Weinberger, K.Q. On fairness and calibration. Adv. Neural Inf. Process. Syst. 2020, 33, 2–4. [Google Scholar]
Zhao, J.; Wang, T.; Yatskar, M.; Ordonez, V.; Chang, K.W. Gender bias in contextualized word embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 629–634. [Google Scholar]
Bellamy, R.K.E.; Dey, K.; Hind, M.; Hoffman, S.C.; Houde, S.; Kannan, K.; Nagar, S. AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. IBM J. Res. Dev. 2018, 63, 4. [Google Scholar]
Verma, S.; Rubin, J. Fairness definitions explained. In Proceedings of the International Workshop on Software Fairness, Gothenburg, Sweden, 29 May 2018; pp. 1–7. [Google Scholar]
Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; Zemel, R. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, Cambridge, MA, USA, 8–10 January 2012; pp. 214–226. [Google Scholar]
Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Herrera, F. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef] [Green Version]
Hao, K. This Is How AI Bias Really Happens—And Why It’s So Hard to Fix; MIT Technology Review; MIT Press: Cambridge, MA, USA, 2020. [Google Scholar]
Adadi, A.; Berrada, M. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Gilpin, L.H.; Bau, D.; Yuan, B.Z.; Bajwa, A.; Specter, M.; Kagal, L. Explaining explanations: An overview of interpretability of machine learning. In Proceedings of the 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 1–3 October 2018; pp. 80–89. [Google Scholar]
Bhatt, U.; Xiang, A.; Sharma, S.; Weller, A.; Taly, A.; Jia, Y.; Eckersley, P. Explainable machine learning in deployment. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, 27–30 January 2020; pp. 648–657. [Google Scholar]
Wachter, S.; Mittelstadt, B.; Russell, C. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. J. Law Technol. 2017, 31, 841–887. [Google Scholar] [CrossRef] [Green Version]
Jobin, A.; Ienca, M.; Vayena, E. The global landscape of AI ethics guidelines. Nat. Mach. Intell. 2019, 1, 389–399. [Google Scholar] [CrossRef] [Green Version]
Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. (CSUR) 2018, 51, 1–42. [Google Scholar] [CrossRef] [Green Version]
Rudin, C. Stop explaining black-box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
Maaten, L.V.D.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Carter, S.; Armstrong, Z.; Schönberger, L.; Olah, C. Activation atlases: Unsupervised exploration of high-dimensional model internals. Distill 2019, 4, e00020. [Google Scholar]
Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar]
Holzinger, A.; Langs, G.; Denk, H.; Zatloukal, K.; Müller, H. Causability and explainability of artificial intelligence in medicine. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019, 9, e1312. [Google Scholar] [CrossRef] [Green Version]
Mittelstadt, B.; Russell, C.; Wachter, S. Explaining explanations in AI. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, GA, USA, 29–31 January 2019; pp. 279–288. [Google Scholar]
Vinuesa, R.; Azizpour, H.; Leite, I.; Balaam, M.; Dignum, V.; Domisch, S.; Langhans, S.D. The role of artificial intelligence in achieving the Sustainable Development Goals. Nat. Commun. 2020, 11, 233. [Google Scholar] [CrossRef] [Green Version]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Agarwal, S. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Knight, W. The Future of AI Depends on a Huge Workforce of Human Teachers; Wired: San Francisco, CA, USA, 2021. [Google Scholar]
Kaplan, A.; Haenlein, M. Siri, Siri, in my hand: Who’s the fairest in the land? On the interpretations, illustrations, and implications of artificial intelligence. Bus. Horiz. 2019, 62, 15–25. [Google Scholar] [CrossRef]
Wang, S.; Fisch, A.; Oh, J.; Liang, P. Data Programming for Learning with Noisy Labels. Adv. Neural Inf. Process. Syst. 2020, 33, 14883–14894. [Google Scholar]
Crawford, K.; Calo, R. There is a blind spot in AI research. Nature 2021, 538, 311–313. [Google Scholar] [CrossRef] [Green Version]
McDermid, J.A.; Jia, Y.; Porter, Z.; Habli, I. Artificial intelligence explainability: The technical and ethical dimensions. Philos. Trans. R. Soc. A 2021, 379, 20200363. [Google Scholar] [CrossRef] [PubMed]
Whittlestone, J.; Nyrup, R.; Alexandrova, A.; Dihal, K.; Cave, S. Ethical and Societal Implications of Algorithms, Data, and Artificial Intelligence: A Roadmap for Research; Nuffield Foundation: London, UK, 2019. [Google Scholar]
Bughin, J.; Hazan, E.; Ramaswamy, S.; Chui, M.; Allas, T.; Dahlström, P.; Trench, M. Skill Shift: Automation and the Future of the Workforce; McKinsey Global Institute: Washington, DC, USA, 2018. [Google Scholar]
World Economic Forum. Jobs of Tomorrow: Mapping Opportunity in the New Economy. 2021. Available online: http://www3.weforum.org/docs/WEF_Jobs_of_Tomorrow_2020.pdf (accessed on 12 February 2023).
Bessen, J.E.; Impink, S.M.; Reichensperger, L.; Seamans, R. The Business of AI Startups; NBER Working Paper No. 24255; Boston University School of Law: Boston, MA, USA, 2019. [Google Scholar]
Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 2019, 267, 1–38. [Google Scholar] [CrossRef]
Xu, H.; Gu, L.; Choi, E.; Zhang, Y. Secure and privacy-preserving machine learning: A survey. Front. Comput. Sci. 2021, 15, 1–38. [Google Scholar]
Yang, G.Z.; Bellingham, J.; Dupont, P.E.; Fischer, P.; Floridi, L.; Full, R.; Wood, R. The grand challenges of Science Robotics. Sci. Robot. 2020, 3, 7650. [Google Scholar] [CrossRef] [PubMed]
Holzinger, A.; Keiblinger, K.; Holub, P.; Zatloukal, K.; Müller, H. AI for life: Trends in artificial intelligence for biotechnology. New Biotechnol. 2023, 74, 16–24. [Google Scholar] [CrossRef]
Jawad, K.; Mahto, R.; Das, A.; Ahmed, S.U.; Aziz, R.M.; Kumar, P. Novel Cuckoo Search-Based Metaheuristic Approach for Deep Learning Prediction of Depression. Appl. Sci. 2023, 13, 5322. [Google Scholar] [CrossRef]
Yaqoob, A.; Aziz, R.M.; Verma, N.K.; Lalwani, P.; Makrariya, A.; Kumar, P. A review on nature-inspired algorithms for cancer disease prediction and classification. Mathematics 2023, 11, 1081. [Google Scholar] [CrossRef]
Aziz, R.M.; Mahto, R.; Goel, K.; Das, A.; Kumar, P.; Saxena, A. Modified Genetic Algorithm with Deep Learning for Fraud Transactions of Ethereum Smart Contract. Appl. Sci. 2023, 13, 697. [Google Scholar] [CrossRef]
Aziz, R.M.; Desai, N.P.; Baluch, M.F. Computer vision model with novel cuckoo search based deep learning approach for classification of fish image. Multimed. Tools Appl. 2023, 82, 3677–3696. [Google Scholar] [CrossRef]
Aziz, R.M.; Baluch, M.F.; Patel, S.; Kumar, P. A Machine Learning based Approach to Detect the Ethereum Fraud Transactions with Limited Attributes. Karbala Int. J. Mod. Sci. 2022, 8, 13. [Google Scholar] [CrossRef]
Thayyib, P.V.; Mamilla, R.; Khan, M.; Fatima, H.; Asim, M.; Anwar, I.; Shamsudheen, M.K.; Khan, M.A. State-of-the-Art of Artificial Intelligence and Big Data Analytics Reviews in Five Different Domains: A Bibliometric Summary. Sustainability 2023, 15, 4026. [Google Scholar] [CrossRef]
Saghiri, A.M.; Vahidipour, S.M.; Jabbarpour, M.R.; Sookhak, M.; Forestiero, A. A Survey of Artificial Intelligence Challenges: Analyzing the Definitions, Relationships, and Evolutions. Appl. Sci. 2022, 12, 4054. [Google Scholar] [CrossRef]
Serey, J.; Quezada, L.; Alfaro, M.; Fuertes, G.; Vargas, M.; Ternero, R.; Sabattin, J.; Duran, C.; Gutierrez, S. Artificial Intelligence Methodologies for Data Management. Symmetry 2021, 13, 2040. [Google Scholar] [CrossRef]

Figure 1. AI Technology Landscape.

Figure 2. Machine Learning Approaches.

Figure 3. Data-Driven AI Process.

Figure 4. Dimensions of Data Challenges for AI.

Figure 5. Challenging elements of data quality.

Figure 6. Data quality measures.

Figure 7. Proposed solution to data quality challenges.

Figure 8. Challenging elements of data volume.

Figure 9. Proposed solutions for challenges of data volume.

Figure 10. Challenges’ elements of data privacy and security.

Figure 11. Mitigation strategies for data privacy and security.

Figure 12. Types of data bias.

Figure 13. Proposed solutions to the data bias and fairness challenge.

Figure 14. Challenging elements of interpretability and explainability.

Table 1. References for challenges of data in AI.

Challenge		Reference
Data Quality	Introduction	[37,38,39]
	Quality Measures	[37,38,39,40,41,42,43,44,45,46,47]
	Collection and Management	[41,46,49,50,51,52,53]
	Data Governance	[54,55,56,57,58,59,60,61,62,63]
	Proposed solutions	[54,55,64,65,66,67,68,69,70]
Data Volume	Introduction	[71]
	Data Deluge	[72,73]
	Storage Challenges	[74,75,76]
	Processing Challenges	[77,78,79]
	Data Management Challenges	[80,81]
	Data Privacy and Security	[82]
	Proposed Solutions	[75,76,79,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97]
Data Privacy and Security	Data Privacy	[98,99,100,101,102,103]
	Data Security	[104,105,106,107,108,109,110,111,112,113,114,115,116]
	Mitigation Strategies	[117,118,119,120,121,122,123]
Bias and Fairness	Introduction	[124,125,126,127,128,129]
	Types of Data Bias	[130,131,132,133,134,135,136,137,138,139]
	Consequences of Data Bias and Unfairness	[140,141,142]
	Proposed Solutions	[143,144,145,146,147,148,149]
Interpretability and Explainability	Introduction	[150,151]
	The Necessity	[152,153,154,155]
	Current Techniques	[156,157,158,159,160,161,162]
	Remaining Challenges and Future Directions	[150,157,162,163,164]
Technical Expertise	Introduction	[165,166,167]
	Scarcity of Skilled Professionals	[168,169]
	Ethical Concerns	[170,171,172]
	The Growing Demand for AI-Related Expertise	[169,173,174,175]
	Key Disciplines in High Demand for AI-Related Expertise	[170,173,179,180,181,182,183]
	Collaboration between Humans and AI	[176,178,184,185,186,187]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Aldoseri, A.; Al-Khalifa, K.N.; Hamouda, A.M. Re-Thinking Data Strategy and Integration for Artificial Intelligence: Concepts, Opportunities, and Challenges. Appl. Sci. 2023, 13, 7082. https://doi.org/10.3390/app13127082

AMA Style

Aldoseri A, Al-Khalifa KN, Hamouda AM. Re-Thinking Data Strategy and Integration for Artificial Intelligence: Concepts, Opportunities, and Challenges. Applied Sciences. 2023; 13(12):7082. https://doi.org/10.3390/app13127082

Chicago/Turabian Style

Aldoseri, Abdulaziz, Khalifa N. Al-Khalifa, and Abdel Magid Hamouda. 2023. "Re-Thinking Data Strategy and Integration for Artificial Intelligence: Concepts, Opportunities, and Challenges" Applied Sciences 13, no. 12: 7082. https://doi.org/10.3390/app13127082

APA Style

Aldoseri, A., Al-Khalifa, K. N., & Hamouda, A. M. (2023). Re-Thinking Data Strategy and Integration for Artificial Intelligence: Concepts, Opportunities, and Challenges. Applied Sciences, 13(12), 7082. https://doi.org/10.3390/app13127082

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Re-Thinking Data Strategy and Integration for Artificial Intelligence: Concepts, Opportunities, and Challenges

Abstract

1. Introduction

2. Materials and Methods

2.1. Data for AI

2.1.1. Data Learning Approaches

2.1.2. Data-Centric and Data-Driven AI

2.2. Dimensions of Data Challenges for AI

2.2.1. Dimension I: Data Quality

Challenging Measures of Data Quality and Implications on AI Systems

Challenges in Data Collection, Pre-Processing, and Management

The Role of Data Governance in Ensuring Data Quality

Addressing Data Quality Challenges: Techniques and Strategies

2.2.2. Dimension II: Data Volume

Challenging Elements of Data Volume

Mitigation Solutions for Challenges of Data Volume

2.2.3. Dimension III: Data Privacy and Security

Data Privacy Challenges in AI

Data Security Challenges in AI

Mitigation Strategies for Data Privacy and Security Challenges

2.2.4. Dimension IV: Bias and Fairness

Types of Data Bias

Consequences of Data Bias and Unfairness in AI

Addressing Data Bias and Fairness in AI

2.2.5. Dimension V: Interpretability and Explainability

The Necessity of Interpretability and Explainability

Current Techniques for Interpretability and Explainability

Remaining Challenges and Future Directions

2.2.6. Dimension VI: Technical Expertise

Scarcity of Skilled Professionals

Ethical Concerns

The Growing Demand for AI-Related Expertise

Key Disciplines in High Demand for AI-Related Expertise

Collaboration between Humans and AI

3. Results

3.1. Dimension of Data Quality and Implication for AI systems

3.2. The Role of Data Governance in Ensuring Data Quality

3.3. Best Practices to Ensure Data Quality for AI

4. Discussion

4.1. Broader Implications

4.2. Limitations

4.3. Real-Time Time Environment Challenges

4.4. Future Research Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI