Big Data and Cognitive Computing doi: 10.3390/bdcc8030033
Authors: Sepideh Molaei Stefano Cirillo Giandomenico Solimando
MicroRNAs (miRNAs) play a crucial role in cancer development, but not all miRNAs are equally significant in cancer detection. Traditional methods face challenges in effectively identifying cancer-associated miRNAs due to data complexity and volume. This study introduces a novel, feature-based technique for detecting attributes related to cancer-affecting microRNAs. It aims to enhance cancer diagnosis accuracy by identifying the most relevant miRNAs for various cancer types using a hybrid approach. In particular, we used a combination of particle swarm optimization (PSO) and artificial neural networks (ANNs) for this purpose. PSO was employed for feature selection, focusing on identifying the most informative miRNAs, while ANNs were used for recognizing patterns within the miRNA data. This hybrid method aims to overcome limitations in traditional miRNA analysis by reducing data redundancy and focusing on key genetic markers. The application of this method showed a significant improvement in the detection accuracy for various cancers, including breast and lung cancer and melanoma. Our approach demonstrated a higher precision in identifying relevant miRNAs compared to existing methods, as evidenced by the analysis of different datasets. The study concludes that the integration of PSO and ANNs provides a more efficient, cost-effective, and accurate method for cancer detection via miRNA analysis. This method can serve as a supplementary tool for cancer diagnosis and potentially aid in developing personalized cancer treatments.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8030032
Authors: Hamed Alshammari Ahmed El-Sayed Khaled Elleithy
The effectiveness of existing AI detectors is notably hampered when processing Arabic texts. This study introduces a novel AI text classifier designed specifically for Arabic, tackling the distinct challenges inherent in processing this language. A particular focus is placed on accurately recognizing human-written texts (HWTs), an area where existing AI detectors have demonstrated significant limitations. To achieve this goal, this paper utilized and fine-tuned two Transformer-based models, AraELECTRA and XLM-R, by training them on two distinct datasets: a large dataset comprising 43,958 examples and a custom dataset with 3078 examples that contain HWT and AI-generated texts (AIGTs) from various sources, including ChatGPT 3.5, ChatGPT-4, and BARD. The proposed architecture is adaptable to any language, but this work evaluates these models’ efficiency in recognizing HWTs versus AIGTs in Arabic as an example of Semitic languages. The performance of the proposed models has been compared against the two prominent existing AI detectors, GPTZero and OpenAI Text Classifier, particularly on the AIRABIC benchmark dataset. The results reveal that the proposed classifiers outperform both GPTZero and OpenAI Text Classifier with 81% accuracy compared to 63% and 50% for GPTZero and OpenAI Text Classifier, respectively. Furthermore, integrating a Dediacritization Layer prior to the classification model demonstrated a significant enhancement in the detection accuracy of both HWTs and AIGTs. This Dediacritization step markedly improved the classification accuracy, elevating it from 81% to as high as 99% and, in some instances, even achieving 100%.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8030031
Authors: Dhiaa Musleh Ali Alkhwaja Ibrahim Alkhwaja Mohammed Alghamdi Hussam Abahussain Mohammed Albugami Faisal Alfawaz Said El-Ashker Mohammed Al-Hariri
Obesity is increasingly becoming a prevalent health concern among adolescents, leading to significant risks like cardiometabolic diseases (CMDs). The early discovery and diagnosis of CMD is essential for better outcomes. This study aims to build a reliable artificial intelligence model that can predict CMD using various machine learning techniques. Support vector machines (SVMs), K-Nearest neighbor (KNN), Logistic Regression (LR), Random Forest (RF), and Gradient Boosting are five robust classifiers that are compared in this study. A novel “risk level” feature, derived through fuzzy logic applied to the Conicity Index, as a novel feature, which was previously unused, is introduced to enhance the interpretability and discriminatory properties of the proposed models. As the Conicity Index scores indicate CMD risk, two separate models are developed to address each gender individually. The performance of the proposed models is assessed using two datasets obtained from 295 records of undergraduate students in Saudi Arabia. The dataset comprises 121 male and 174 female students with diverse risk levels. Notably, Logistic Regression emerges as the top performer among males, achieving an accuracy score of 91%, while Gradient Boosting lags with a score of 72%. Among females, both Support Vector Machine and Logistic Regression lead with an accuracy score of 87%, while Random Forest performs least optimally with a score of 80%.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8030030
Authors: Keundug Park Heung-Youl Youm
The volume of the asset investment and trading market can be expanded through the issuance and management of blockchain-based security tokens that logically divide the value of assets and guarantee ownership. This paper proposes a service model to solve a problem with the existing investment service model, identifies security threats to the service model, and specifies security requirements countering the identified security threats for privacy protection and anti-money laundering (AML) involving security tokens. The identified security threats and specified security requirements should be taken into consideration when implementing the proposed service model. The proposed service model allows users to invest in tokenized tangible and intangible assets and trade in blockchain-based security tokens. This paper discusses considerations to prevent excessive regulation and market monopoly in the issuance of and trading in security tokens when implementing the proposed service model and concludes with future works.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8030029
Authors: Wei-Ling Hsu Yi-Jheng Chang Lin Mou Juan-Wen Huang Hsin-Lung Liu
Historic urban areas are the foundations of urban development. Due to rapid urbanization, the sustainable development of historic urban areas has become challenging for many cities. Elements of tourism and tourism service facilities play an important role in the sustainable development of historic areas. This study analyzed policies related to tourism in Panguifang and Meixian districts in Meizhou, Guangdong, China. Kernel density estimation was used to study the clustering characteristics of tourism elements through point of interest (POI) data, while space syntax was used to study the accessibility of roads. In addition, the Pearson correlation coefficient and regression were used to analyze the correlation between the elements and accessibility. The results show the following: (1) the overall number of tourism elements was high on the western side of the districts and low on the eastern one, and the elements were predominantly distributed along the main transportation arteries; (2) according to the integration degree and depth value, the western side was easier to access than the eastern one; and (3) the depth value of the area negatively correlated with kernel density, while the degree of integration positively correlated with it. Based on the results, the study put forward measures for optimizing the elements of tourism in Meizhou’s historic urban area to improve cultural tourism and emphasize the importance of the elements.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8030028
Authors: Niwan Wattanakitrungroj Pimchanok Wijitkajee Saichon Jaiyen Sunisa Sathapornvajana Sasiporn Tongman
For the financial health of lenders and institutions, one important risk assessment called credit risk is about correctly deciding whether or not a borrower will fail to repay a loan. It not only helps in the approval or denial of loan applications but also aids in managing the non-performing loan (NPL) trend. In this study, a dataset provided by the LendingClub company based in San Francisco, CA, USA, from 2007 to 2020 consisting of 2,925,492 records and 141 attributes was experimented with. The loan status was categorized as “Good” or “Risk”. To yield highly effective results of credit risk prediction, experiments on credit risk prediction were performed using three widely adopted supervised machine learning techniques: logistic regression, random forest, and gradient boosting. In addition, to solve the imbalanced data problem, three sampling algorithms, including under-sampling, over-sampling, and combined sampling, were employed. The results show that the gradient boosting technique achieves nearly perfect Accuracy, Precision, Recall, and F1score values, which are better than 99.92%, but its MCC values are greater than 99.77%. Three imbalanced data handling approaches can enhance the model performance of models trained by three algorithms. Moreover, the experiment of reducing the number of features based on mutual information calculation revealed slightly decreasing performance for 50 data features with Accuracy values greater than 99.86%. For 25 data features, which is the smallest size, the random forest supervised model yielded 99.15% Accuracy. Both sampling strategies and feature selection help to improve the supervised model for accurately predicting credit risk, which may be beneficial in the lending business.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8030027
Authors: Andreas F. Gkontzis Sotiris Kotsiantis Georgios Feretzakis Vassilios S. Verykios
In an epoch characterized by the swift pace of digitalization and urbanization, the essence of community well-being hinges on the efficacy of urban management. As cities burgeon and transform, the need for astute strategies to navigate the complexities of urban life becomes increasingly paramount. This study employs time series analysis to scrutinize citizen interactions with the coordinate-based problem mapping platform in the Municipality of Patras in Greece. The research explores the temporal dynamics of reported urban issues, with a specific focus on identifying recurring patterns through the lens of seasonality. The analysis, employing the seasonal decomposition technique, dissects time series data to expose trends in reported issues and areas of the city that might be obscured in raw big data. It accentuates a distinct seasonal pattern, with concentrations peaking during the summer months. The study extends its approach to forecasting, providing insights into the anticipated evolution of urban issues over time. Projections for the coming years show a consistent upward trend in both overall city issues and those reported in specific areas, with distinct seasonal variations. This comprehensive exploration of time series analysis and seasonality provides valuable insights for city stakeholders, enabling informed decision-making and predictions regarding future urban challenges.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8030026
Authors: Igor Calzada
This article investigates the intricate dynamics of data monopolies, referred to as “data-opolies”, and their implications for democratic erosion. Data-opolies, typically embodied by large technology corporations, accumulate extensive datasets, affording them significant influence. The sustainability of such data practices is critically examined within the context of decentralized Web3 technologies amidst Artificial Intelligence (AI) disruption. Additionally, the article explores emancipatory datafication strategies to counterbalance the dominance of data-opolies. It presents an in-depth analysis of two emergent phenomena within the decentralized Web3 emerging landscape: People-Centered Smart Cities and Datafied Network States. The article investigates a paradigm shift in data governance and advocates for joint efforts to establish equitable data ecosystems, with an emphasis on prioritizing data sovereignty and achieving digital self-governance. It elucidates the remarkable roles of (i) blockchain, (ii) decentralized autonomous organizations (DAOs), and (iii) data cooperatives in empowering citizens to have control over their personal data. In conclusion, the article introduces a forward-looking examination of Web3 decentralized technologies, outlining a timely path toward a more transparent, inclusive, and emancipatory data-driven democracy. This approach challenges the prevailing dominance of data-opolies and offers a framework for regenerating datafied democracies through decentralized and emerging Web3 technologies.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8030025
Authors: Alvaro A. Teran-Quezada Victor Lopez-Cabrera Jose Carlos Rangel Javier E. Sanchez-Galan
Convolutional neural networks (CNN) have provided great advances for the task of sign language recognition (SLR). However, recurrent neural networks (RNN) in the form of long–short-term memory (LSTM) have become a means for providing solutions to problems involving sequential data. This research proposes the development of a sign language translation system that converts Panamanian Sign Language (PSL) signs into text in Spanish using an LSTM model that, among many things, makes it possible to work with non-static signs (as sequential data). The deep learning model presented focuses on action detection, in this case, the execution of the signs. This involves processing in a precise manner the frames in which a sign language gesture is made. The proposal is a holistic solution that considers, in addition to the seeking of the hands of the speaker, the face and pose determinants. These were added due to the fact that when communicating through sign languages, other visual characteristics matter beyond hand gestures. For the training of this system, a data set of 330 videos (of 30 frames each) for five possible classes (different signs considered) was created. The model was tested having an accuracy of 98.8%, making this a valuable base system for effective communication between PSL users and Spanish speakers. In conclusion, this work provides an improvement of the state of the art for PSL–Spanish translation by using the possibilities of translatable signs via deep learning.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8030024
Authors: Maxim Kolomeets Olga Tushkanova Vasily Desnitsky Lidia Vitkova Andrey Chechulin
This paper aims to test the hypothesis that the quality of social media bot detection systems based on supervised machine learning may not be as accurate as researchers claim, given that bots have become increasingly sophisticated, making it difficult for human annotators to detect them better than random selection. As a result, obtaining a ground-truth dataset with human annotation is not possible, which leads to supervised machine-learning models inheriting annotation errors. To test this hypothesis, we conducted an experiment where humans were tasked with recognizing malicious bots on the VKontakte social network. We then compared the “human” answers with the “ground-truth” bot labels (‘a bot’/‘not a bot’). Based on the experiment, we evaluated the bot detection efficiency of annotators in three scenarios typical for cybersecurity but differing in their detection difficulty as follows: (1) detection among random accounts, (2) detection among accounts of a social network ‘community’, and (3) detection among verified accounts. The study showed that humans could only detect simple bots in all three scenarios but could not detect more sophisticated ones (p-value = 0.05). The study also evaluates the limits of hypothetical and existing bot detection systems that leverage non-expert-labelled datasets as follows: the balanced accuracy of such systems can drop to 0.5 and lower, depending on bot complexity and detection scenario. The paper also describes the experiment design, collected datasets, statistical evaluation, and machine learning accuracy measures applied to support the results. In the discussion, we raise the question of using human labelling in bot detection systems and its potential cybersecurity issues. We also provide open access to the datasets used, experiment results, and software code for evaluating statistical and machine learning accuracy metrics used in this paper on GitHub.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8030023
Authors: Abdullah F. Al-Aboosi Aldo Jonathan Muñoz Vazquez Fadhil Y. Al-Aboosi Mahmoud El-Halwagi Wei Zhan
Accurate prediction of renewable energy output is essential for integrating sustainable energy sources into the grid, facilitating a transition towards a more resilient energy infrastructure. Novel applications of machine learning and artificial intelligence are being leveraged to enhance forecasting methodologies, enabling more accurate predictions and optimized decision-making capabilities. Integrating these novel paradigms improves forecasting accuracy, fostering a more efficient and reliable energy grid. These advancements allow better demand management, optimize resource allocation, and improve robustness to potential disruptions. The data collected from solar intensity and wind speed is often recorded through sensor-equipped instruments, which may encounter intermittent or permanent faults. Hence, this paper proposes a novel Fourier network regression model to process solar irradiance and wind speed data. The proposed approach enables accurate prediction of the underlying smooth components, facilitating effective reconstruction of missing data and enhancing the overall forecasting performance. The present study focuses on Midland, Texas, as a case study to assess direct normal irradiance (DNI), diffuse horizontal irradiance (DHI), and wind speed. Remarkably, the model exhibits a correlation of 1 with a minimal RMSE (root mean square error) of 0.0007555. This study leverages Fourier analysis for renewable energy applications, with the aim of establishing a methodology that can be applied to a novel geographic context.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8030022
Authors: Roman Rybka Yury Davydov Danila Vlasov Alexey Serenko Alexander Sboev Vyacheslav Ilyin
Developing a spiking neural network architecture that could prospectively be trained on energy-efficient neuromorphic hardware to solve various data analysis tasks requires satisfying the limitations of prospective analog or digital hardware, i.e., local learning and limited numbers of connections, respectively. In this work, we compare two methods of connectivity reduction that are applicable to spiking networks with local plasticity; instead of a large fully-connected network (which is used as the baseline for comparison), we employ either an ensemble of independent small networks or a network with probabilistic sparse connectivity. We evaluate both of these methods with a three-layer spiking neural network, which are applied to handwritten and spoken digit classification tasks using two memristive plasticity models and the classical spike time-dependent plasticity (STDP) rule. Both methods achieve an F1-score of 0.93–0.95 on the handwritten digits recognition task and 0.85–0.93 on the spoken digits recognition task. Applying a combination of both methods made it possible to obtain highly accurate models while reducing the number of connections by more than three times compared to the basic model.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8030021
Authors: Ishaani Priyadarshini
The swift proliferation of the Internet of Things (IoT) devices in smart city infrastructures has created an urgent demand for robust cybersecurity measures. These devices are susceptible to various cyberattacks that can jeopardize the security and functionality of urban systems. This research presents an innovative approach to identifying anomalies caused by IoT cyberattacks in smart cities. The proposed method harnesses federated and split learning and addresses the dual challenge of enhancing IoT network security while preserving data privacy. This study conducts extensive experiments using authentic datasets from smart cities. To compare the performance of classical machine learning algorithms and deep learning models for detecting anomalies, model effectiveness is assessed using precision, recall, F-1 score, accuracy, and training/deployment time. The findings demonstrate that federated learning and split learning have the potential to balance data privacy concerns with competitive performance, providing robust solutions for detecting IoT cyberattacks. This study contributes to the ongoing discussion about securing IoT deployments in urban settings. It lays the groundwork for scalable and privacy-conscious cybersecurity strategies. The results underscore the vital role of these techniques in fortifying smart cities and promoting the development of adaptable and resilient cybersecurity measures in the IoT era.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8030020
Authors: Róbert Lakatos Gergő Bogacsovics Balázs Harangi István Lakatos Attila Tiba János Tóth Marianna Szabó András Hajdu
The efficiency of natural language processing has improved dramatically with the advent of machine learning models, particularly neural network-based solutions. However, some tasks are still challenging, especially when considering specific domains. This paper presents a model that can extract insights from customer reviews using machine learning methods integrated into a pipeline. For topic modeling, our composite model uses transformer-based neural networks designed for natural language processing, vector-embedding-based keyword extraction, and clustering. The elements of our model have been integrated and tailored to better meet the requirements of efficient information extraction and topic modeling of the extracted information for opinion mining. Our approach was validated and compared with other state-of-the-art methods using publicly available benchmark datasets. The results show that our system performs better than existing topic modeling and keyword extraction methods in this task.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8030019
Authors: Mohammed Suleiman Mohammed Rudwan Jean Vincent Fonou-Dombeu
Ontology merging is an important task in ontology engineering to date. However, despite the efforts devoted to ontology merging, the incorporation of relevant features of ontologies such as axioms, individuals and annotations in the output ontologies remains challenging. Consequently, existing ontology-merging solutions produce new ontologies that do not include all the relevant semantic features from the candidate ontologies. To address these limitations, this paper proposes a novel algorithm for multi-criteria ontology merging that automatically builds a new ontology from candidate ontologies by iteratively updating an RDF graph in the memory. The proposed algorithm leverages state-of-the-art Natural Language Processing tools as well as a Machine Learning-based framework to assess the similarities and merge various criteria into the resulting output ontology. The key contribution of the proposed algorithm lies in its ability to merge relevant features from the candidate ontologies to build a more accurate, integrated and cohesive output ontology. The proposed algorithm is tested with five ontologies of different computing domains and evaluated in terms of its asymptotic behavior, quality and computational performance. The experimental results indicate that the proposed algorithm produces output ontologies that meet the integrity, accuracy and cohesion quality criteria better than related studies. This performance demonstrates the effectiveness and superior capabilities of the proposed algorithm. Furthermore, the proposed algorithm enables iterative in-memory update and building of the RDF graph of the resulting output ontology, which enhances the processing speed and improves the computational efficiency, making it an ideal solution for big data applications.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8020018
Authors: Ouarda Zedadra Antonio Guerrieri Hamid Seridi Aymen Benzaid Giancarlo Fortino
Efficiently searching for multiple targets in complex environments with limited perception and computational capabilities is challenging for multiple robots, which can coordinate their actions indirectly through their environment. In this context, swarm intelligence has been a source of inspiration for addressing multi-target search problems in the literature. So far, several algorithms have been proposed for solving such a problem, and in this study, we propose two novel multi-target search algorithms inspired by the Firefly algorithm. Unlike the conventional Firefly algorithm, where light is an attractor, light represents a negative effect in our proposed algorithms. Upon discovering targets, robots emit light to repel other robots from that region. This repulsive behavior is intended to achieve several objectives: (1) partitioning the search space among different robots, (2) expanding the search region by avoiding areas already explored, and (3) preventing congestion among robots. The proposed algorithms, named Global Lawnmower Firefly Algorithm (GLFA) and Random Bounce Firefly Algorithm (RBFA), integrate inverse light-based behavior with two random walks: random bounce and global lawnmower. These algorithms were implemented and evaluated using the ArGOS simulator, demonstrating promising performance compared to existing approaches.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8020017
Authors: Marwa Salah Farhan Amira Youssef Laila Abdelhamid
Traditional data warehouses (DWs) have played a key role in business intelligence and decision support systems. However, the rapid growth of the data generated by the current applications requires new data warehousing systems. In big data, it is important to adapt the existing warehouse systems to overcome new issues and limitations. The main drawbacks of traditional Extract–Transform–Load (ETL) are that a huge amount of data cannot be processed over ETL and that the execution time is very high when the data are unstructured. This paper focuses on a new model consisting of four layers: Extract–Clean–Load–Transform (ECLT), designed for processing unstructured big data, with specific emphasis on text. The model aims to reduce execution time through experimental procedures. ECLT is applied and tested using Spark, which is a framework employed in Python. Finally, this paper compares the execution time of ECLT with different models by applying two datasets. Experimental results showed that for a data size of 1 TB, the execution time of ECLT is 41.8 s. When the data size increases to 1 million articles, the execution time is 119.6 s. These findings demonstrate that ECLT outperforms ETL, ELT, DELT, ELTL, and ELTA in terms of execution time.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8020016
Authors: Maryam Badar Marco Fisichella
Fairness-aware mining of data streams is a challenging concern in the contemporary domain of machine learning. Many stream learning algorithms are used to replace humans in critical decision-making processes, e.g., hiring staff, assessing credit risk, etc. This calls for handling massive amounts of incoming information with minimal response delay while ensuring fair and high-quality decisions. Although deep learning has achieved success in various domains, its computational complexity may hinder real-time processing, making traditional algorithms more suitable. In this context, we propose a novel adaptation of Naïve Bayes to mitigate discrimination embedded in the streams while maintaining high predictive performance through multi-objective optimization (MOO). Class imbalance is an inherent problem in discrimination-aware learning paradigms. To deal with class imbalance, we propose a dynamic instance weighting module that gives more importance to new instances and less importance to obsolete instances based on their membership in a minority or majority class. We have conducted experiments on a range of streaming and static datasets and concluded that our proposed methodology outperforms existing state-of-the-art (SoTA) fairness-aware methods in terms of both discrimination score and balanced accuracy.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8020015
Authors: Jihong Wang Zhuo Wang Lidong Zhang
Clustering protocols and simultaneous wireless information and power transfer (SWIPT) technology can solve the issue of imbalanced energy consumption among nodes in energy harvesting-cognitive radio sensor networks (EH-CRSNs). However, dynamic energy changes caused by EH/SWIPT and dynamic spectrum availability prevent existing clustering routing protocols from fully leveraging the advantages of EH and SWIPT. Therefore, a multi-hop uneven clustering routing protocol is proposed for EH-CRSNs utilizing SWIPT technology in this paper. Specifically, an EH-based energy state function is proposed to accurately track the dynamic energy variations in nodes. Utilizing this function, dynamic spectrum availability, neighbor count, and other information are integrated to design the criteria for selecting high-quality cluster heads (CHs) and relays, thereby facilitating effective data transfer to the sink. Intra-cluster and inter-cluster SWIPT mechanisms are incorporated to allow for the immediate energy replenishment for CHs or relays with insufficient energy while transmitting data, thereby preventing data transmission failures due to energy depletion. An energy status control mechanism is introduced to avoid the energy waste caused by excessive activation of the SWIPT mechanism. Simulation results indicate that the proposed protocol markedly improves the balance of energy consumption among nodes and enhances network surveillance capabilities when compared to existing clustering routing protocols.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8020014
Authors: Chao He Xinghua Zhang Dongqing Song Yingshan Shen Chengjie Mao Huosheng Wen Dingju Zhu Lihua Cai
With the popularization of better network access and the penetration of personal smartphones in today’s world, the explosion of multi-modal data, particularly opinionated video messages, has created urgent demands and immense opportunities for Multi-Modal Sentiment Analysis (MSA). Deep learning with the attention mechanism has served as the foundation technique for most state-of-the-art MSA models due to its ability to learn complex inter- and intra-relationships among different modalities embedded in video messages, both temporally and spatially. However, modal fusion is still a major challenge due to the vast feature space created by the interactions among different data modalities. To address the modal fusion challenge, we propose an MSA algorithm based on deep learning and the attention mechanism, namely the Mixture of Attention Variants for Modal Fusion (MAVMF). The MAVMF algorithm includes a two-stage process: in stage one, self-attention is applied to effectively extract image and text features, and the dependency relationships in the context of video discourse are captured by a bidirectional gated recurrent neural module; in stage two, four multi-modal attention variants are leveraged to learn the emotional contributions of important features from different modalities. Our proposed approach is end-to-end and has been shown to achieve a superior performance to the state-of-the-art algorithms when tested with two largest public datasets, CMU-MOSI and CMU-MOSEI.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8020012
Authors: Anik Baul Gobinda Chandra Sarker Prokash Sikder Utpal Mozumder Ahmed Abdelgawad
Short-term load forecasting (STLF) plays a crucial role in the planning, management, and stability of a country’s power system operation. In this study, we have developed a novel approach that can simultaneously predict the load demand of different regions in Bangladesh. When making predictions for loads from multiple locations simultaneously, the overall accuracy of the forecast can be improved by incorporating features from the various areas while reducing the complexity of using multiple models. Accurate and timely load predictions for specific regions with distinct demographics and economic characteristics can assist transmission and distribution companies in properly allocating their resources. Bangladesh, being a relatively small country, is divided into nine distinct power zones for electricity transmission across the nation. In this study, we have proposed a hybrid model, combining the Convolutional Neural Network (CNN) and Gated Recurrent Unit (GRU), designed to forecast load demand seven days ahead for each of the nine power zones simultaneously. For our study, nine years of data from a historical electricity demand dataset (from January 2014 to April 2023) are collected from the Power Grid Company of Bangladesh (PGCB) website. Considering the nonstationary characteristics of the dataset, the Interquartile Range (IQR) method and load averaging are employed to deal effectively with the outliers. Then, for more granularity, this data set has been augmented with interpolation at every 1 h interval. The proposed CNN-GRU model, trained on this augmented and refined dataset, is evaluated against established algorithms in the literature, including Long Short-Term Memory Networks (LSTM), GRU, CNN-LSTM, CNN-GRU, and Transformer-based algorithms. Compared to other approaches, the proposed technique demonstrated superior forecasting accuracy in terms of mean absolute performance error (MAPE) and root mean squared error (RMSE). The dataset and the source code are openly accessible to motivate further research.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8020013
Authors: Claudia Angelica Rivera-Romero Jorge Ulises Munoz-Minjares Carlos Lastre-Dominguez Misael Lopez-Ramirez
Identifying patient posture while they are lying in bed is an important task in medical applications such as monitoring a patient after a surgical intervention, sleep supervision to identify behavioral and physiological markers, or for bedsore prevention. An acceptable strategy to identify the patient’s position is the classification of images created from a grid of pressure sensors located in the bed. These samples can be arranged based on supervised learning methods. Usually, image conditioning is required before images are loaded into a learning method to increase classification accuracy. However, continuous monitoring of a person requires large amounts of time and computational resources if complex pre-processing algorithms are used. So, the problem is to classify the image posture of patients with different weights, heights, and positions by using minimal sample conditioning for a specific supervised learning method. In this work, it is proposed to identify the patient posture from pressure sensor images by using well-known and simple conditioning techniques and selecting the optimal texture descriptors for the Support Vector Machine (SVM) method. This is in order to obtain the best classification and to avoid image over-processing in the conditioning stage for the SVM. The experimental stages are performed with the color models Red, Green, and Blue (RGB) and Hue, Saturation, and Value (HSV). The results show an increase in accuracy from 86.9% to 92.9% and in kappa value from 0.825 to 0.904 using image conditioning with histogram equalization and a median filter, respectively.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8020011
Authors: Thoralf Reis Lukas Dumberger Sebastian Bruchhaus Thomas Krause Verena Schreyer Marco X. Bornschlegl Matthias L. Hemmje
Manual labeling and categorization are extremely time-consuming and, thus, costly. AI and ML-supported information systems can bridge this gap and support labor-intensive digital activities. Since it requires categorization, coding-based analysis, such as qualitative content analysis, reaches its limits with large amounts of data and could benefit from AI and ML-based support. Empirical social research, its application domain, benefits from Big Data’s ability to create more extensive human behavior and development models. A range of applications are available for statistical analysis to serve this purpose. This paper aims to implement an information system that supports researchers in empirical social research in performing AI-supported qualitative content analysis. AI2VIS4BigData is a reference model that standardizes use cases and artifacts for Big Data information systems that integrate AI and ML for user empowerment. Thus, this work’s concepts and implementations try to achieve an AI2VIS4BigData-compliant information system that supports social researchers in categorizing text data and creating insightful dashboards. Thereby, the text categorization is based on an existing ML component. Furthermore, it presents two evaluations that were conducted for these concepts and implementations: a qualitative cognitive walkthrough assessing the system’s usability and a quantitative user study with 18 participants revealed that though the users perceive AI support as more efficient, they need more time to reflect on the recommendations. The research revealed that AI support increased the correctness of the users’ categorizations but also slowed down their decision-making. The assumption that this is due to the UI design and additional information for processing requires follow-up research.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8010010
Authors: Ivan Izonin Tetiana Hovorushchenko Shishir Kumar Shandilya
The amount of information is constantly growing, and thus, the issue of information security is becoming more acute [...]
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8010009
Authors: Christine Dewi Danny Manongga Hendry Evangs Mailoa Kristoko Dwi Hartomo
Face mask detection is a technological application that employs computer vision methodologies to ascertain the presence or absence of a face mask on an individual depicted in an image or video. This technology gained significant attention and adoption during the COVID-19 pandemic, as wearing face masks became an important measure to prevent the spread of the virus. Face mask detection helps to enforce mask-wearing guidelines, which can significantly reduce the spread of respiratory illnesses, including COVID-19. Wearing masks in densely populated areas provides individuals with protection and hinders the spread of airborne particles that transmit viruses. The application of deep learning models in object recognition has shown significant progress, leading to promising outcomes in the identification and localization of objects within images. The primary aim of this study is to annotate and classify face mask entities depicted in authentic images. To mitigate the spread of COVID-19 within public settings, individuals can employ the use of face masks created from materials specifically designed for medical purposes. This study utilizes YOLOv8, a state-of-the-art object detection algorithm, to accurately detect and identify face masks. To analyze this study, we conducted an experiment in which we combined the Face Mask Dataset (FMD) and the Medical Mask Dataset (MMD) into a single dataset. The detection performance of an earlier research study using the FMD and MMD was improved by the suggested model to a “Good” level of 99.1%, up from 98.6%. Our study demonstrates that the model scheme we have provided is a reliable method for detecting faces that are obscured by medical masks. Additionally, after the completion of the study, a comparative analysis was conducted to examine the findings in conjunction with those of related research. The proposed detector demonstrated superior performance compared to previous research in terms of both accuracy and precision.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8010008
Authors: William Villegas-Ch Angel Jaramillo-Alcázar Sergio Luján-Mora
This study evaluated the generation of adversarial examples and the subsequent robustness of an image classification model. The attacks were performed using the Fast Gradient Sign method, the Projected Gradient Descent method, and the Carlini and Wagner attack to perturb the original images and analyze their impact on the model’s classification accuracy. Additionally, image manipulation techniques were investigated as defensive measures against adversarial attacks. The results highlighted the model’s vulnerability to conflicting examples: the Fast Gradient Signed Method effectively altered the original classifications, while the Carlini and Wagner method proved less effective. Promising approaches such as noise reduction, image compression, and Gaussian blurring were presented as effective countermeasures. These findings underscore the importance of addressing the vulnerability of machine learning models and the need to develop robust defenses against adversarial examples. This article emphasizes the urgency of addressing the threat posed by harmful standards in machine learning models, highlighting the relevance of implementing effective countermeasures and image manipulation techniques to mitigate the effects of adversarial attacks. These efforts are crucial to safeguarding model integrity and trust in an environment marked by constantly evolving hostile threats. An average 25% decrease in accuracy was observed for the VGG16 model when exposed to the Fast Gradient Signed Method and Projected Gradient Descent attacks, and an even more significant 35% decrease with the Carlini and Wagner method.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8010007
Authors: Reenu Mohandas Mark Southern Eoin O’Connell Martin Hayes
Deep learning based visual cognition has greatly improved the accuracy of defect detection, reducing processing times and increasing product throughput across a variety of manufacturing use cases. There is however a continuing need for rigorous procedures to dynamically update model-based detection methods that use sequential streaming during the training phase. This paper reviews how new process, training or validation information is rigorously incorporated in real time when detection exceptions arise during inspection. In particular, consideration is given to how new tasks, classes or decision pathways are added to existing models or datasets in a controlled fashion. An analysis of studies from the incremental learning literature is presented, where the emphasis is on the mitigation of process complexity challenges such as, catastrophic forgetting. Further, practical implementation issues that are known to affect the complexity of deep learning model architecture, including memory allocation for incoming sequential data or incremental learning accuracy, is considered. The paper highlights case study results and methods that have been used to successfully mitigate such real-time manufacturing challenges.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8010006
Authors: Abdul Rehman Khalid Nsikak Owoh Omair Uthmani Moses Ashawa Jude Osamor John Adejoh
In the era of digital advancements, the escalation of credit card fraud necessitates the development of robust and efficient fraud detection systems. This paper delves into the application of machine learning models, specifically focusing on ensemble methods, to enhance credit card fraud detection. Through an extensive review of existing literature, we identified limitations in current fraud detection technologies, including issues like data imbalance, concept drift, false positives/negatives, limited generalisability, and challenges in real-time processing. To address some of these shortcomings, we propose a novel ensemble model that integrates a Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Random Forest (RF), Bagging, and Boosting classifiers. This ensemble model tackles the dataset imbalance problem associated with most credit card datasets by implementing under-sampling and the Synthetic Over-sampling Technique (SMOTE) on some machine learning algorithms. The evaluation of the model utilises a dataset comprising transaction records from European credit card holders, providing a realistic scenario for assessment. The methodology of the proposed model encompasses data pre-processing, feature engineering, model selection, and evaluation, with Google Colab computational capabilities facilitating efficient model training and testing. Comparative analysis between the proposed ensemble model, traditional machine learning methods, and individual classifiers reveals the superior performance of the ensemble in mitigating challenges associated with credit card fraud detection. Across accuracy, precision, recall, and F1-score metrics, the ensemble outperforms existing models. This paper underscores the efficacy of ensemble methods as a valuable tool in the battle against fraudulent transactions. The findings presented lay the groundwork for future advancements in the development of more resilient and adaptive fraud detection systems, which will become crucial as credit card fraud techniques continue to evolve.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8010005
Authors: Hanan M. Alghamdi
Sentiment analysis plays a crucial role in understanding public opinion and social media trends. It involves analyzing the emotional tone and polarity of a given text. When applied to Arabic text, this task becomes particularly challenging due to the language’s complex morphology, right-to-left script, and intricate nuances in expressing emotions. Social media has emerged as a powerful platform for individuals to express their sentiments, especially regarding religious and cultural events. Consequently, studying sentiment analysis in the context of Hajj has become a captivating subject. This research paper presents a comprehensive sentiment analysis of tweets discussing the annual Hajj pilgrimage over a six-year period. By employing a combination of machine learning and deep learning models, this study successfully conducted sentiment analysis on a sizable dataset consisting of Arabic tweets. The process involves pre-processing, feature extraction, and sentiment classification. The objective was to uncover the prevailing sentiments associated with Hajj over different years, before, during, and after each Hajj event. Importantly, the results presented in this study highlight that BERT, an advanced transformer-based model, outperformed other models in accurately classifying sentiment. This underscores its effectiveness in capturing the complexities inherent in Arabic text.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8010004
Authors: Yiming Chen Shuang Liang
In the field of education, cognitive diagnosis is crucial for achieving personalized learning. The widely adopted DINA (Deterministic Inputs, Noisy And gate) model uncovers students’ mastery of essential skills necessary to answer questions correctly. However, existing DINA-based approaches overlook the dependency between knowledge points, and their model training process is computationally inefficient for large datasets. In this paper, we propose a new cognitive diagnosis model called BNMI-DINA, which stands for Bayesian Network-based Multiprocess Incremental DINA. Our proposed model aims to enhance personalized learning by providing accurate and detailed assessments of students’ cognitive abilities. By incorporating a Bayesian network, BNMI-DINA establishes the dependency relationship between knowledge points, enabling more accurate evaluations of students’ mastery levels. To enhance model convergence speed, key steps of our proposed algorithm are parallelized. We also provide theoretical proof of the convergence of BNMI-DINA. Extensive experiments demonstrate that our approach effectively enhances model accuracy and reduces computational time compared to state-of-the-art cognitive diagnosis models.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8010003
Authors: Marcos Orellana Patricio Santiago García Guillermo Daniel Ramon Jorge Luis Zambrano-Martinez Andrés Patiño-León María Verónica Serrano Priscila Cedillo
Health problems in older adults lead to situations where communication with peers, family and caregivers becomes challenging for seniors; therefore, it is necessary to use alternative methods to facilitate communication. In this context, Augmentative and Alternative Communication (AAC) methods are widely used to support this population segment. Moreover, with Artificial Intelligence (AI), and specifically, machine learning algorithms, AAC can be improved. Although there have been several studies in this field, it is interesting to analyze common phrases used by seniors, depending on their context (i.e., slang and everyday expressions typical of their age). This paper proposes a semantic analysis of the common phrases of older adults and their corresponding meanings through Natural Language Processing (NLP) techniques and a pre-trained language model using semantic textual similarity to represent the older adults’ phrases with their corresponding graphic images (pictograms). The results show good scores achieved in the semantic similarity between the phrases of the older adults and the definitions, so the relationship between the phrase and the pictogram has a high degree of probability.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8010002
Authors: Matthias Wölfel Mehrnoush Barani Shirzad Andreas Reich Katharina Anderer
The emergence of generative language models (GLMs), such as OpenAI’s ChatGPT, is changing the way we communicate with computers and has a major impact on the educational landscape. While GLMs have great potential to support education, their use is not unproblematic, as they suffer from hallucinations and misinformation. In this paper, we investigate how a very limited amount of domain-specific data, from lecture slides and transcripts, can be used to build knowledge-based and generative educational chatbots. We found that knowledge-based chatbots allow full control over the system’s response but lack the verbosity and flexibility of GLMs. The answers provided by GLMs are more trustworthy and offer greater flexibility, but their correctness cannot be guaranteed. Adapting GLMs to domain-specific data trades flexibility for correctness.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc8010001
Authors: Eleni Vlachou Aristeidis Karras Christos Karras Leonidas Theodorakopoulos Constantinos Halkiopoulos Spyros Sioutas
In this work, we present a Distributed Bayesian Inference Classifier for Large-Scale Systems, where we assess its performance and scalability on distributed environments such as PySpark. The presented classifier consistently showcases efficient inference time, irrespective of the variations in the size of the test set, implying a robust ability to handle escalating data sizes without a proportional increase in computational demands. Notably, throughout the experiments, there is an observed increase in memory usage with growing test set sizes, this increment is sublinear, demonstrating the proficiency of the classifier in memory resource management. This behavior is consistent with the typical tendencies of PySpark tasks, which witness increasing memory consumption due to data partitioning and various data operations as datasets expand. CPU resource utilization, which is another crucial factor, also remains stable, emphasizing the capability of the classifier to manage larger computational workloads without significant resource strain. From a classification perspective, the Bayesian Logistic Regression Spark Classifier consistently achieves reliable performance metrics, with a particular focus on high specificity, indicating its aptness for applications where pinpointing true negatives is crucial. In summary, based on all experiments conducted under various data sizes, our classifier emerges as a top contender for scalability-driven applications in IoT systems, highlighting its dependable performance, adept resource management, and consistent prediction accuracy.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040184
Authors: Alexander Sboev Roman Rybka Dmitry Kunitsyn Alexey Serenko Vyacheslav Ilyin Vadim Putrolaynen
In this paper, we demonstrate that fixed-weight layers generated from random distribution or logistic functions can effectively extract significant features from input data, resulting in high accuracy on a variety of tasks, including Fisher’s Iris, Wisconsin Breast Cancer, and MNIST datasets. We have observed that logistic functions yield high accuracy with less dispersion in results. We have also assessed the precision of our approach under conditions of minimizing the number of spikes generated in the network. It is practically useful for reducing energy consumption in spiking neural networks. Our findings reveal that the proposed method demonstrates the highest accuracy on Fisher’s iris and MNIST datasets with decoding using logistic regression. Furthermore, they surpass the accuracy of the conventional (non-spiking) approach using only logistic regression in the case of Wisconsin Breast Cancer. We have also investigated the impact of non-stochastic spike generation on accuracy.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040183
Authors: Shutian Deng Gang Wang Hongjun Wang Fuliang Chang
Spain possesses a vast number of poems. Most have features that mean they present significantly different styles. A superficial reading of these poems may confuse readers due to their complexity. Therefore, it is of vital importance to classify the style of the poems in advance. Currently, poetry classification studies are mostly carried out manually, which creates extremely high requirements for the professional quality of classifiers and consumes a large amount of time. Furthermore, the objectivity of the classification cannot be guaranteed because of the influence of the classifier’s subjectivity. To solve these problems, a Spanish poetry classification framework was designed using artificial intelligence technology, which improves the accuracy, efficiency, and objectivity of classification. First, an artificial-intelligence-driven Spanish poetry classification framework is described in detail, and is illustrated by a framework diagram to clearly represent each step in the process. The framework includes many algorithms and models, such as the Term Frequency–Inverse Document Frequency (TF_IDF), Bagging, Support Vector Machines (SVMs), Adaptive Boosting (AdaBoost), logistic regression (LR), Gradient Boosting Decision Trees (GBDT), LightGBM (LGB), eXtreme Gradient Boosting (XGBoost), and Random Forest (RF). The roles of each algorithm in the framework are clearly defined. Finally, experiments were performed for model selection, comparing the results of these algorithms.The Bagging model stood out for its high accuracy, and the experimental results showed that the proposed framework can help researchers carry out poetry research work more efficiently, accurately, and objectively.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040182
Authors: Jesus Insuasti Felipe Roa Carlos Mario Zapata-Jaramillo
Pre-conceptual schemas are a straightforward way to represent knowledge using controlled language regardless of context. Despite the benefits of using pre-conceptual schemas by humans, they present challenges when interpreted by computers. We propose an approach to making computers able to interpret the basic pre-conceptual schemas made by humans. To do that, the construction of a linguistic corpus is required to work with large language models—LLM. The linguistic corpus was mainly fed using Master’s and doctoral theses from the digital repository of the University of Nariño to produce a training dataset for re-training the BERT model; in addition, we complement this by explaining the elicited sentences in triads from the pre-conceptual schemas using one of the cutting-edge large language models in natural language processing: Llama 2-Chat by Meta AI. The diverse topics covered in these theses allowed us to expand the spectrum of linguistic use in the BERT model and empower the generative capabilities using the fine-tuned Llama 2-Chat model and the proposed solution. As a result, the first version of a computational solution was built to consume the language models based on BERT and Llama 2-Chat and thus automatically interpret pre-conceptual schemas by computers via natural language processing, adding, at the same time, generative capabilities. The validation of the computational solution was performed in two phases: the first one for detecting sentences and interacting with pre-conceptual schemas with students in the Formal Languages and Automata Theory course—the seventh semester of the systems engineering undergraduate program at the University of Nariño’s Tumaco campus. The second phase was for exploring the generative capabilities based on pre-conceptual schemas; this second phase was performed with students in the Object-oriented Design course—the second semester of the systems engineering undergraduate program at the University of Nariño’s Tumaco campus. This validation yielded favorable results in implementing natural language processing using the BERT and Llama 2-Chat models. In this way, some bases were laid for future developments related to this research topic.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040181
Authors: Hiromu Nakajima Minoru Sasaki
Text classification is the task of estimating the genre of a document based on information such as word co-occurrence and frequency of occurrence. Text classification has been studied by various approaches. In this study, we focused on text classification using graph structure data. Conventional graph-based methods express relationships between words and relationships between words and documents as weights between nodes. Then, a graph neural network is used for learning. However, there is a problem that conventional methods are not able to represent the relationship between documents on the graph. In this paper, we propose a graph structure that considers the relationships between documents. In the proposed method, the cosine similarity of document vectors is set as weights between document nodes. This completes a graph that considers the relationship between documents. The graph is then input into a graph convolutional neural network for training. Therefore, the aim of this study is to improve the text classification performance of conventional methods by using this graph that considers the relationships between document nodes. In this study, we conducted evaluation experiments using five different corpora of English documents. The results showed that the proposed method outperformed the performance of the conventional method by up to 1.19%, indicating that the use of relationships between documents is effective. In addition, the proposed method was shown to be particularly effective in classifying long documents.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040180
Authors: Bishal Lamichhane Aniket Kumar Singh Suman Devkota Uttam Dhakal Subham Singh Chandra Dhakal
This study analyzes a network of musical influence using machine learning and network analysis techniques. A directed network model is used to represent the influence relations between artists as nodes and edges. Network properties and centrality measures are analyzed to identify influential patterns. In addition, influence within and outside the genre is quantified using in-genre and out-genre weights. Regression analysis is performed to determine the impact of musical attributes on influence. We find that speechiness, acousticness, and valence are the top features of the most influential artists. We also introduce the IRDI, an algorithm that provides an innovative approach to quantify an artist’s influence by capturing the degree of dominance among their followers. This approach underscores influential artists who drive the evolution of music, setting trends and significantly inspiring a new generation of artists. The independent cascade model is further employed to open up the temporal dynamics of influence propagation across the entire musical network, highlighting how initial seeds of influence can contagiously spread through the network. This multidisciplinary approach provides a nuanced understanding of musical influence that refines existing methods and sheds light on influential trends and dynamics.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040179
Authors: Wieslaw L. Nowinski
Although no dataset at the nanoscale for the entire human brain has yet been acquired and neither a nanoscale human whole brain atlas has been constructed, tremendous progress in neuroimaging and high-performance computing makes them feasible in the non-distant future. To construct the human whole brain nanoscale atlas, there are several challenges, and here, we address two, i.e., the morphology modeling of the brain at the nanoscale and designing of a nanoscale brain atlas. A new nanoscale neuronal format is introduced to describe data necessary and sufficient to model the entire human brain at the nanoscale, enabling calculations of the synaptome and connectome. The design of the nanoscale brain atlas covers design principles, content, architecture, navigation, functionality, and user interface. Three novel design principles are introduced supporting navigation, exploration, and calculations, namely, a gross neuroanatomy-guided navigation of micro/nanoscale neuroanatomy; a movable and zoomable sampling volume of interest for navigation and exploration; and a nanoscale data processing in a parallel-pipeline mode exploiting parallelism resulting from the decomposition of gross neuroanatomy parcellated into structures and regions as well as nano neuroanatomy decomposed into neurons and synapses, enabling the distributed construction and continual enhancement of the nanoscale atlas. Numerous applications of this atlas can be contemplated ranging from proofreading and continual multi-site extension to exploration, morphometric and network-related analyses, and knowledge discovery. To my best knowledge, this is the first proposed neuronal morphology nanoscale model and the first attempt to design a human whole brain atlas at the nanoscale.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040178
Authors: Anna M. Girardi Elizabeth A. Cardell Stephen P. Bird
Radiological imaging is an essential component of a swallowing assessment. Artificial intelligence (AI), especially deep learning (DL) models, has enhanced the efficiency and efficacy through which imaging is interpreted, and subsequently, it has important implications for swallow diagnostics and intervention planning. However, the application of AI for the interpretation of videofluoroscopic swallow studies (VFSS) is still emerging. This review showcases the recent literature on the use of AI to interpret VFSS and highlights clinical implications for speech–language pathologists (SLPs). With a surge in AI research, there have been advances in dysphagia assessments. Several studies have demonstrated the successful implementation of DL algorithms to analyze VFSS. Notably, convolutional neural networks (CNNs), which involve training a multi-layered model to recognize specific image or video components, have been used to detect pertinent aspects of the swallowing process with high levels of precision. DL algorithms have the potential to streamline VFSS interpretation, improve efficiency and accuracy, and enable the precise interpretation of an instrumental dysphagia evaluation, which is especially advantageous when access to skilled clinicians is not ubiquitous. By enhancing the precision, speed, and depth of VFSS interpretation, SLPs can obtain a more comprehensive understanding of swallow physiology and deliver a targeted and timely intervention that is tailored towards the individual. This has practical applications for both clinical practice and dysphagia research. As this research area grows and AI technologies progress, the application of DL in the field of VFSS interpretation is clinically beneficial and has the potential to transform dysphagia assessment and management. With broader validation and inter-disciplinary collaborations, AI-augmented VFSS interpretation will likely transform swallow evaluations and ultimately improve outcomes for individuals with dysphagia. However, despite AI’s potential to streamline imaging interpretation, practitioners still need to consider the challenges and limitations of AI implementation, including the need for large training datasets, interpretability and adaptability issues, and the potential for bias.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040177
Authors: Peter R. J. Trim Yang-Im Lee
Cyber security is high up on the agenda of senior managers in private and public sector organizations and is likely to remain so for the foreseeable future. [...]
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040176
Authors: Shivashankar Hiremath Eeshan Shetty Allam Jaya Prakash Suraj Prakash Sahoo Kiran Kumar Patro Kandala N. V. P. S. Rajesh Paweł Pławiak
The internet has become an indispensable tool for organizations, permeating every facet of their operations. Virtually all companies leverage Internet services for diverse purposes, including the digital storage of data in databases and cloud platforms. Furthermore, the rising demand for software and applications has led to a widespread shift toward computer-based activities within the corporate landscape. However, this digital transformation has exposed the information technology (IT) infrastructures of these organizations to a heightened risk of cyber-attacks, endangering sensitive data. Consequently, organizations must identify and address vulnerabilities within their systems, with a primary focus on scrutinizing customer-facing websites and applications. This work aims to tackle this pressing issue by employing data analysis tools, such as Power BI, to assess vulnerabilities within a client’s application or website. Through a rigorous analysis of data, valuable insights and information will be provided, which are necessary to formulate effective remedial measures against potential attacks. Ultimately, the central goal of this research is to demonstrate that clients can establish a secure environment, shielding their digital assets from potential attackers.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040175
Authors: Deptii Chaudhari Ambika Vishal Pawar
Misinformation, fake news, and various propaganda techniques are increasingly used in digital media. It becomes challenging to uncover propaganda as it works with the systematic goal of influencing other individuals for the determined ends. While significant research has been reported on propaganda identification and classification in resource-rich languages such as English, much less effort has been made in resource-deprived languages like Hindi. The spread of propaganda in the Hindi news media has induced our attempt to devise an approach for the propaganda categorization of Hindi news articles. The unavailability of the necessary language tools makes propaganda classification in Hindi more challenging. This study proposes the effective use of deep learning and transformer-based approaches for Hindi computational propaganda classification. To address the lack of pretrained word embeddings in Hindi, Hindi Word2vec embeddings were created using the H-Prop-News corpus for feature extraction. Subsequently, three deep learning models, i.e., CNN (convolutional neural network), LSTM (long short-term memory), Bi-LSTM (bidirectional long short-term memory); and four transformer-based models, i.e., multi-lingual BERT, Distil-BERT, Hindi-BERT, and Hindi-TPU-Electra, were experimented with. The experimental outcomes indicate that the multi-lingual BERT and Hindi-BERT models provide the best performance, with the highest F1 score of 84% on the test data. These results strongly support the efficacy of the proposed solution and indicate its appropriateness for propaganda classification.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040174
Authors: Sherin M. Omran Wessam H. El-Behaidy Aliaa A. A. Youssif
A cryptocurrency is a non-centralized form of money that facilitates financial transactions using cryptographic processes. It can be thought of as a virtual currency or a payment mechanism for sending and receiving money online. Cryptocurrencies have gained wide market acceptance and rapid development during the past few years. Due to the volatile nature of the crypto-market, cryptocurrency trading involves a high level of risk. In this paper, a new normalized decomposition-based, multi-objective particle swarm optimization (N-MOPSO/D) algorithm is presented for cryptocurrency algorithmic trading. The aim of this algorithm is to help traders find the best Litecoin trading strategies that improve their outcomes. The proposed algorithm is used to manage the trade-offs among three objectives: the return on investment, the Sortino ratio, and the number of trades. A hybrid weight assignment mechanism has also been proposed. It was compared against the trading rules with their standard parameters, MOPSO/D, using normalized weighted Tchebycheff scalarization, and MOEA/D. The proposed algorithm could outperform the counterpart algorithms for benchmark and real-world problems. Results showed that the proposed algorithm is very promising and stable under different market conditions. It could maintain the best returns and risk during both training and testing with a moderate number of trades.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040173
Authors: Alexander Shknevsky Yuval Shahar Robert Moskovitch
We propose a new pruning constraint when mining frequent temporal patterns to be used as classification and prediction features, the Semantic Adjacency Criterion [SAC], which filters out temporal patterns that contain potentially semantically contradictory components, exploiting each medical domain’s knowledge. We have defined three SAC versions and tested them within three medical domains (oncology, hepatitis, diabetes) and a frequent-temporal-pattern discovery framework. Previously, we had shown that using SAC enhances the repeatability of discovering the same temporal patterns in similar proportions in different patient groups within the same clinical domain. Here, we focused on SAC’s computational implications for pattern discovery, and for classification and prediction, using the discovered patterns as features, by four different machine-learning methods: Random Forests, Naïve Bayes, SVM, and Logistic Regression. Using SAC resulted in a significant reduction, across all medical domains and classification methods, of up to 97% in the number of discovered temporal patterns, and in the runtime of the discovery process, of up to 98%. Nevertheless, the highly reduced set of only semantically transparent patterns, when used as features, resulted in classification and prediction models whose performance was at least as good as the models resulting from using the complete temporal-pattern set.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040172
Authors: Aibing Jin Prabhat Basnet Shakil Mahtab
In deep engineering, rockburst hazards frequently result in injuries, fatalities, and the destruction of contiguous structures. Due to the complex nature of rockbursts, predicting the severity of rockburst damage (intensity) without the aid of computer models is challenging. Although there are various predictive models in existence, effectively identifying the risk severity in imbalanced data remains crucial. The ensemble boosting method is often better suited to dealing with unequally distributed classes than are classical models. Therefore, this paper employs the ensemble categorical gradient boosting (CGB) method to predict short-term rockburst risk severity. After data collection, principal component analysis (PCA) was employed to avoid the redundancies caused by multi-collinearity. Afterwards, the CGB was trained on PCA data, optimal hyper-parameters were retrieved using the grid-search technique to predict the test samples, and performance was evaluated using precision, recall, and F1 score metrics. The results showed that the PCA-CGB model achieved better results in prediction than did the single CGB model or conventional boosting methods. The model achieved an F1 score of 0.8952, indicating that the proposed model is robust in predicting damage severity given an imbalanced dataset. This work provides practical guidance in risk management.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040171
Authors: Hossein Hassani Nadejda Komendantova Elena Rovenskaya Mohammad Reza Yeganegi
This research underscores the profound implications of Social Intelligence Mining, notably employing open access data and Google Search engine data for trend discernment. Utilizing advanced analytical methodologies, including wavelet coherence analysis and phase difference, hidden relationships and patterns within social data were revealed. These techniques furnish an enriched comprehension of social phenomena dynamics, bolstering decision-making processes. The study’s versatility extends across myriad domains, offering insights into public sentiment and the foresight for strategic approaches. The findings suggest immense potential in Social Intelligence Mining to influence strategies, foster innovation, and add value across diverse sectors.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040170
Authors: Amr Mohamed El Koshiry Entesar Hamed I. Eliwa Tarek Abd El-Hafeez Ahmed Omar
Social media platforms have become the primary means of communication and information sharing, facilitating interactive exchanges among users. Unfortunately, these platforms also witness the dissemination of inappropriate and toxic content, including hate speech and insults. While significant efforts have been made to classify toxic content in the English language, the same level of attention has not been given to Arabic texts. This study addresses this gap by constructing a standardized Arabic dataset specifically designed for toxic tweet classification. The dataset is annotated automatically using Google’s Perspective API and the expertise of three native Arabic speakers and linguists. To evaluate the performance of different models, we conduct a series of experiments using seven models: long short-term memory (LSTM), bidirectional LSTM, a convolutional neural network, a gated recurrent unit (GRU), bidirectional GRU, multilingual bidirectional encoder representations from transformers, and AraBERT. Additionally, we employ word embedding techniques. Our experimental findings demonstrate that the fine-tuned AraBERT model surpasses the performance of other models, achieving an impressive accuracy of 0.9960. Notably, this accuracy value outperforms similar approaches reported in recent literature. This study represents a significant advancement in Arabic toxic tweet classification, shedding light on the importance of addressing toxicity in social media platforms while considering diverse languages and cultures.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040169
Authors: Roman Odarchenko Maksim Iavich Giorgi Iashvili Solomiia Fedushko Yuriy Syerov
It is clear that 5G networks have already become integral to our present. However, a significant issue lies in the fact that current 5G communication systems are incapable of fully ensuring the required quality of service and the security of transmitted data, especially in government networks that operate in the context of the Internet of Things, hostilities, hybrid warfare, and cyberwarfare. The use of 5G extends to critical infrastructure operators and special users such as law enforcement, governments, and the military. Adapting modern cellular networks to meet the specific needs of these special users is not only feasible but also necessary. In doing so, these networks must meet additional stringent requirements for reliability, performance, and, most importantly, data security. This scientific paper is dedicated to addressing the challenges associated with ensuring cybersecurity in this context. To effectively improve or ensure a sufficient level of cybersecurity, it is essential to measure the primary indicators of the effectiveness of the security system. At the moment, there are no comprehensive lists of these key indicators that require priority monitoring. Therefore, this article first analyzed the existing similar indicators and presented a list of them, which will make it possible to continuously monitor the state of cybersecurity systems of 5G cellular networks with the aim of using them for groups of special users. Based on this list of cybersecurity KPIs, as a result, this article presents a model to identify and evaluate these indicators. To develop this model, we comprehensively analyzed potential groups of performance indicators, selected the most relevant ones, and introduced a mathematical framework for their quantitative assessment. Furthermore, as part of our research efforts, we proposed enhancements to the core of the 4G/5G network. These enhancements enable data collection and statistical analysis through specialized sensors and existing servers, contributing to improved cybersecurity within these networks. Thus, the approach proposed in the article opens up an opportunity for continuous monitoring and, accordingly, improving the performance indicators of cybersecurity systems, which in turn makes it possible to use them for the maintenance of critical infrastructure and other users whose service presents increased requirements for cybersecurity systems.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040168
Authors: Andry Alamsyah Nadhif Ditertian Girawan
The disposability of clothing has emerged as a critical concern, precipitating waste accumulation due to product quality degradation. Such consequences exert significant pressure on resources and challenge sustainability efforts. In response, this research focuses on empowering clothing companies to elevate product excellence by harnessing consumer feedback. Beyond insights, this research extends to sustainability by providing suggestions on refining product quality by improving material handling, gradually mitigating waste production, and cultivating longevity, therefore decreasing discarded clothes. Managing a vast influx of diverse reviews necessitates sophisticated natural language processing (NLP) techniques. Our study introduces a Robustly optimized BERT Pretraining Approach (RoBERTa) model calibrated for multilabel classification and BERTopic for topic modeling. The model adeptly distills vital themes from consumer reviews, exhibiting astounding accuracy in projecting concerns across various dimensions of clothing quality. NLP’s potential lies in endowing companies with insights into consumer review, augmented by the BERTopic to facilitate immersive exploration of harvested review topics. This research presents a thorough case for integrating machine learning to foster sustainability and waste reduction. The contribution of this research is notable for its integration of RoBERTa and BERTopic in multilabel classification tasks and topic modeling in the fashion industry. The results indicate that the RoBERTa model exhibits remarkable performance, as demonstrated by its macro-averaged F1 score of 0.87 and micro-averaged F1 score of 0.87. Likewise, BERTopic achieves a coherence score of 0.67, meaning the model can form an insightful topic.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040167
Authors: Yijun Shao Kaitlin Todd Andrew Shutes-David Steven P. Millard Karl Brown Amy Thomas Kathryn Chen Katherine Wilson Qing T. Zeng Debby W. Tsuang
The application of natural language processing and machine learning (ML) in electronic health records (EHRs) may help reduce dementia underdiagnosis, but models that are not designed to reflect minority populations may instead perpetuate underdiagnosis. To improve the identification of undiagnosed dementia, particularly in Black Americans (BAs), we developed support vector machine (SVM) ML models to assign dementia risk scores based on features identified in unstructured EHR data (via latent Dirichlet allocation and stable topic extraction in n = 1 M notes) and structured EHR data. We hypothesized that separate models would show differentiation between racial groups, so the models were fit separately for BAs (n = 5 K with dementia ICD codes, n = 5 K without) and White Americans (WAs; n = 5 K with codes, n = 5 K without). To validate our method, scores were generated for separate samples of BAs (n = 10 K) and WAs (n = 10 K) without dementia codes, and the EHRs of 1.2 K of these patients were reviewed by dementia experts. All subjects were age 65+ and drawn from the VA, which meant that the samples were disproportionately male. A strong positive relationship was observed between SVM-generated risk scores and undiagnosed dementia. BAs were more likely than WAs to have undiagnosed dementia per chart review, both overall (15.3% vs. 9.5%) and among Veterans with >90th percentile cutoff scores (25.6% vs. 15.3%). With chart reviews as the reference standard and varied cutoff scores, the BA model performed slightly better than the WA model (AUC = 0.86 with negative predictive value [NPV] = 0.98, positive predictive value [PPV] = 0.26, sensitivity = 0.61, specificity = 0.92 and accuracy = 0.91 at >90th percentile cutoff vs. AUC = 0.77 with NPV = 0.98, PPV = 0.15, sensitivity = 0.43, specificity = 0.91 and accuracy = 0.89 at >90th). Our findings suggest that race-specific ML models can help identify BAs who may have undiagnosed dementia. Future studies should examine model generalizability in settings with more females and test whether incorporating these models into clinical settings increases the referral of undiagnosed BAs to specialists.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040166
Authors: Kun Xiang Akihiro Fujii
Climate change (CC) has become a central global topic within the multiple branches of social disciplines. Natural Language Processing (NLP) plays a superior role since it has achieved marvelous accomplishments in various application scenarios. However, CC debates are ambiguous and complicated to interpret even for humans, especially when it comes to the aspect-oriented fine-grained level. Furthermore, the lack of large-scale effective labeled datasets is always a plight encountered in NLP. In this work, we propose a novel weak-supervised Hybrid Attention Masking Capsule Neural Network (HAMCap) for fine-grained CC debate analysis. Specifically, we use vectors with allocated different weights instead of scalars, and a hybrid attention mechanism is designed in order to better capture and represent information. By randomly masking with a Partial Context Mask (PCM) mechanism, we can better construct the internal relationship between the aspects and entities and easily obtain a large-scale generated dataset. Considering the uniqueness of linguistics, we propose a Reinforcement Learning-based Generator-Selector mechanism to automatically update and select data that are beneficial to model training. Empirical results indicate that our proposed ensemble model outperforms baselines on downstream tasks with a maximum of 50.08% on accuracy and 49.48% on F1 scores. Finally, we draw interpretable conclusions about the climate change debate, which is a widespread global concern.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040165
Authors: Pratik Thantharate Anurag Thantharate
With the digitization of healthcare, an immense amount of sensitive medical data are generated and shared between various healthcare stakeholders—however, traditional health data management mechanisms present interoperability, security, and privacy challenges. The centralized nature of current health information systems leads to single points of failure, making the data vulnerable to cyberattacks. Patients also have little control over their medical records, raising privacy concerns. Blockchain technology presents a promising solution to these challenges through its decentralized, transparent, and immutable properties. This research proposes ZeroTrustBlock, a comprehensive blockchain framework for secure and private health information exchange. The decentralized ledger enhances integrity, while permissioned access and smart contracts enable patient-centric control over medical data sharing. A hybrid on-chain and off-chain storage model balances transparency with confidentiality. Integration gateways bridge ZeroTrustBlock protocols with existing systems like EHRs. Implemented on Hyperledger Fabric, ZeroTrustBlock demonstrates substantial security improvements over mainstream databases via cryptographic mechanisms, formal privacy-preserving protocols, and access policies enacting patient consent. Results validate the architecture’s effectiveness in achieving 14,200 TPS average throughput, 480 ms average latency for 100,000 concurrent transactions, and linear scalability up to 20 nodes. However, enhancements around performance, advanced cryptography, and real-world pilots are future work. Overall, ZeroTrustBlock provides a robust application of blockchain capabilities to transform security, privacy, interoperability, and patient agency in health data management.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040163
Authors: Foteini Gramouseni Katerina D. Tzimourta Pantelis Angelidis Nikolaos Giannakeas Markos G. Tsipouras
The objective of this systematic review centers on cognitive assessment based on electroencephalography (EEG) analysis in Virtual Reality (VR), Augmented Reality (AR) and Mixed Reality (MR) environments, projected on Head Mounted Displays (HMD), in healthy individuals. A range of electronic databases were searched (Scopus, ScienceDirect, IEEE Explore and PubMed), using PRISMA research method and 82 experimental studies were included in the final report. Specific aspects of cognitive function were evaluated, including cognitive load, immersion, spatial awareness, interaction with the digital environment and attention. These were analyzed based on various aspects of the analysis, including the number of participants, stimuli, frequency bands range, data preprocessing and data analysis. Based on the analysis conducted, significant findings have emerged both in terms of the experimental structure related to cognitive neuroscience and the key parameters considered in the research. Also, numerous significant avenues and domains requiring more extensive exploration have been identified within neuroscience and cognition research in digital environments. These encompass factors such as the experimental setup, including issues like narrow participant populations and the feasibility of using EEG equipment with a limited number of sensors to overcome the challenges posed by the time-consuming placement of a multi-electrode EEG cap. There is a clear need for more in-depth exploration in signal analysis, especially concerning the α, β, and γ sub-bands and their role in providing more precise insights for evaluating cognitive states. Finally, further research into augmented and mixed reality environments will enable the extraction of more accurate conclusions regarding their utility in cognitive neuroscience.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040164
Authors: Omar Adel Karma M. Fathalla Ahmed Abo ElFarag
Emotion recognition is crucial in artificial intelligence, particularly in the domain of human–computer interaction. The ability to accurately discern and interpret emotions plays a critical role in helping machines to effectively decipher users’ underlying intentions, allowing for a more streamlined interaction process that invariably translates into an elevated user experience. The recent increase in social media usage, as well as the availability of an immense amount of unstructured data, has resulted in a significant demand for the deployment of automated emotion recognition systems. Artificial intelligence (AI) techniques have emerged as a powerful solution to this pressing concern in this context. In particular, the incorporation of multimodal AI-driven approaches for emotion recognition has proven beneficial in capturing the intricate interplay of diverse human expression cues that manifest across multiple modalities. The current study aims to develop an effective multimodal emotion recognition system known as MM-EMOR in order to improve the efficacy of emotion recognition efforts focused on audio and text modalities. The use of Mel spectrogram features, Chromagram features, and the Mobilenet Convolutional Neural Network (CNN) for processing audio data are central to the operation of this system, while an attention-based Roberta model caters to the text modality. The methodology of this study is based on an exhaustive evaluation of this approach across three different datasets. Notably, the empirical findings show that MM-EMOR outperforms competing models across the same datasets. This performance boost is noticeable, with accuracy gains of an impressive 7% on one dataset and a substantial 8% on another. Most significantly, the observed increase in accuracy for the final dataset was an astounding 18%.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040162
Authors: Todd Dobbs Abdullah-Al-Raihan Nayeem Isaac Cho Zbigniew Ras
Art authentication is the process of identifying the artist who created a piece of artwork and is manifested through events of provenance, such as art gallery exhibitions and financial transactions. Art authentication has visual influence via the uniqueness of the artist’s style in contrast to the style of another artist. The significance of this contrast is proportional to the number of artists involved and the degree of uniqueness of an artist’s collection. This visual uniqueness of style can be captured in a mathematical model produced by a machine learning (ML) algorithm on painting images. Art authentication is not always possible as provenance can be obscured or lost through anonymity, forgery, gifting, or theft of artwork. This paper presents an image-only art authentication attribute marker of contemporary art paintings for a very large number of artists. The experiments in this paper demonstrate that it is possible to use ML-generated models to authenticate contemporary art from 2368 to 100 artists with an accuracy of 48.97% to 91.23%, respectively. This is the largest effort for image-only art authentication to date, with respect to the number of artists involved and the accuracy of authentication.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040161
Authors: Yunpeng Chen Ying Zhao Wenxuan Xie Yanbo Zhai Xin Zhao Jiang Zhang Jiang Long Fangfang Zhou
Data governance aims to optimize the value derived from data assets and effectively mitigate data-related risks. The rapid growth of data assets increases the risk of data breaches. One key solution to reduce this risk is to classify data assets according to their business value and criticality to the enterprises, allocating limited resources to protect core data assets. The existing methods rely on the experience of professionals and cannot identify core data assets across business scenarios. This work conducts an empirical study to address this issue. First, we utilized data lineage graphs with expert-labeled core data assets to investigate the experience of data users on core data asset identification from a scenario perspective. Then, we explored the structural features of core data assets on data lineage graphs from an abstraction perspective. Finally, one expert seminar was conducted to derive a set of universal indicators to identify core data assets by synthesizing the results from the two perspectives. User and field studies were conducted to demonstrate the effectiveness of the indicators.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040160
Authors: Dauren Ayazbayev Andrey Bogdanchikov Kamila Orynbekova Iraklis Varlamis
This work focuses on determining semantically close words and using semantic similarity in general in order to improve performance in information retrieval tasks. The semantic similarity of words is an important task with many applications from information retrieval to spell checking or even document clustering and classification. Although, in languages with rich linguistic resources, the methods and tools for this task are well established, some languages do not have such tools. The first step in our experiment is to represent the words in a collection in a vector form and then define the semantic similarity of the terms using a vector similarity method. In order to tame the complexity of the task, which relies on the number of word (and, consequently, of the vector) pairs that have to be combined in order to define the semantically closest word pairs, A distributed method that runs on Apache Spark is designed to reduce the calculation time by running comparison tasks in parallel. Three alternative implementations are proposed and tested using a list of target words and seeking the most semantically similar words from a lexicon for each one of them. In a second step, we employ pre-trained multilingual sentence transformers to capture the content semantics at a sentence level and a vector-based semantic index to accelerate the searches. The code is written in MapReduce, and the experiments and results show that the proposed methods can provide an interesting solution for finding similar words or texts in the Kazakh language.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040159
Authors: Jihua Cui Zhenbang Wang Ziheng Yang Xin Guan
As the number of layers of deep learning models increases, the number of parameters and computation increases, making it difficult to deploy on edge devices. Pruning has the potential to significantly reduce the number of parameters and computations in a deep learning model. Existing pruning methods frequently require a specific distribution of network parameters to achieve good results when measuring filter importance. As a result, a feature map similarity score-based pruning method is proposed. We calculate the similarity score of each feature map to measure the importance of the filter and guide filter pruning using the similarity between the filter output feature maps to measure the redundancy of the corresponding filter. Pruning experiments on ResNet-56 and ResNet-110 networks on Cifar-10 datasets can compress the model by more than 70% while maintaining a higher compression ratio and accuracy than traditional methods.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7040158
Authors: Isabella Gagliardi Maria Teresa Artese
When integrating data from different sources, there are problems of synonymy, different languages, and concepts of different granularity. This paper proposes a simple yet effective approach to evaluate the semantic similarity of short texts, especially keywords. The method is capable of matching keywords from different sources and languages by exploiting transformers and WordNet-based methods. Key features of the approach include its unsupervised pipeline, mitigation of the lack of context in keywords, scalability for large archives, support for multiple languages and real-world scenarios adaptation capabilities. The work aims to provide a versatile tool for different cultural heritage archives without requiring complex customization. The paper aims to explore different approaches to identifying similarities in 1- or n-gram tags, evaluate and compare different pre-trained language models, and define integrated methods to overcome limitations. Tests to validate the approach have been conducted using the QueryLab portal, a search engine for cultural heritage archives, to evaluate the proposed pipeline.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030157
Authors: Khrystyna Lipianina-Honcharenko Carsten Wolff Anatoliy Sachenko Ivan Kit Diana Zahorodnia
Anthropogenic disasters pose a challenge to management in the modern world. At the same time, it is important to have accurate and timely information to assess the level of danger and take appropriate measures to eliminate disasters. Therefore, the purpose of the paper is to develop an effective method for assessing the level of anthropogenic disasters based on information from witnesses to the event. For this purpose, a conceptual model for assessing the consequences of anthropogenic disasters is proposed, the main components of which are the following ones: the analysis of collected data, modeling and assessment of their consequences. The main characteristics of the intelligent method for classifying the level of anthropogenic disasters are considered, in particular, exploratory data analysis using the EDA method, classification based on textual data using SMOTE, and data classification by the ensemble method of machine learning using boosting. The experimental results confirmed that for textual data, the best classification is at level V and level I with an error of 0.97 and 0.94, respectively, and the average error estimate is 0.68. For quantitative data, the classification accuracy of Potential Accident Level relative to Industry Sector is 77%, and the f1-score is 0.88, which indicates a fairly high accuracy of the model. The architecture of a mobile application for classifying the level of anthropogenic disasters has been developed, which reduces the time required to assess consequences of danger in the region. In addition, the proposed approach ensures interaction with dynamic and uncertain environments, which makes it an effective tool for classifying.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030156
Authors: Manlika Seefong Panuwat Wisutwattanasak Chamroeun Se Kestsirin Theerathitichaipa Sajjakaj Jomnonkwao Thanapong Champahom Vatanavongs Ratanavaraha Rattanaporn Kasemsri
Machine learning currently holds a vital position in predicting collision severity. Identifying factors associated with heightened risks of injury and fatalities aids in enhancing road safety measures and management. Presently, Thailand faces considerable challenges with respect to road traffic accidents. These challenges are particularly acute in industrial zones, where they contribute to a rise in injuries and fatalities. The mixture of heavy traffic, comprising both trucks and non-trucks, significantly amplifies the risk of accidents. This situation, hence, generates profound concerns for road safety in Thailand. Consequently, discerning the factors that influence the severity of injuries and fatalities becomes pivotal for formulating effective road safety policies and measures. This study is specifically aimed at predicting the factors contributing to the severity of accidents involving truck and non-truck collisions in industrial zones. It considers a variety of aspects, including roadway characteristics, underlying assumptions of cause, crash characteristics, and weather conditions. Due to the fact that accident data is big data with specific characteristics and complexity, with the employment of machine learning in tandem with the Multi-variate Adaptive Regression Splines technique, we can make precise predictions to identify the factors influencing the severity of collision outcomes. The analysis demonstrates that various factors augment the severity of accidents involving trucks. These include darting in front of a vehicle, head-on collisions, and pedestrian collisions. Conversely, for non-truck related collisions, the significant factors that heighten severity are tailgating, running signs/signals, angle collisions, head-on collisions, overtaking collisions, pedestrian collisions, obstruction collisions, and collisions during overcast conditions. These findings illuminate the significant factors influencing the severity of accidents involving trucks and non-trucks. Such insights provide invaluable information for developing targeted road safety measures and policies, thereby contributing to the mitigation of injuries and fatalities.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030155
Authors: Bernardo Panichi Alessandro Lazzeri
This paper addresses the time-intensive task of assigning accurate account labels to invoice entries within corporate bookkeeping. Despite the advent of electronic invoicing, many software solutions still rely on rule-based approaches that fail to address the multifaceted nature of this challenge. While machine learning holds promise for such repetitive tasks, the presence of low-quality training data often poses a hurdle. Frequently, labels pertain to invoice rows at a group level rather than an individual level, leading to the exclusion of numerous records during preprocessing. To enhance the efficiency of an invoice entry classifier within a semi-supervised context, this study proposes an innovative approach that combines the classifier with the A* graph search algorithm. Through experimentation across various classifiers, the results consistently demonstrated a noteworthy increase in accuracy, ranging between 1% and 4%. This improvement is primarily attributed to a marked reduction in the discard rate of data, which decreased from 39% to 14%. This paper contributes to the literature by presenting a method that leverages the synergy of a classifier and A* graph search to overcome challenges posed by limited and group-level label information in the realm of electronic invoicing classification.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030154
Authors: Leila Malihi Gunther Heidemann
Efficient model deployment is a key focus in deep learning. This has led to the exploration of methods such as knowledge distillation and network pruning to compress models and increase their performance. In this study, we investigate the potential synergy between knowledge distillation and network pruning to achieve optimal model efficiency and improved generalization. We introduce an innovative framework for model compression that combines knowledge distillation, pruning, and fine-tuning to achieve enhanced compression while providing control over the degree of compactness. Our research is conducted on popular datasets, CIFAR-10 and CIFAR-100, employing diverse model architectures, including ResNet, DenseNet, and EfficientNet. We could calibrate the amount of compression achieved. This allows us to produce models with different degrees of compression while still being just as accurate, or even better. Notably, we demonstrate its efficacy by producing two compressed variants of ResNet 101: ResNet 50 and ResNet 18. Our results reveal intriguing findings. In most cases, the pruned and distilled student models exhibit comparable or superior accuracy to the distilled student models while utilizing significantly fewer parameters.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030153
Authors: Cornelia A. Győrödi Tudor Turtureanu Robert Ş. Győrödi Doina R. Zmaranda
The accelerating pace of application development requires more frequent database switching, as technological advancements demand agile adaptation. The increase in the volume of data and at the same time, the number of transactions has determined that some applications migrate from one database to another, especially from a relational database to a non-relational (NoSQL) alternative. In this transition phase, the coexistence of both databases becomes necessary. In addition, certain users choose to keep both databases permanently updated to exploit the individual strengths of each database in order to streamline operations. Existing solutions mainly focus on replication, failing to adequately address the management of synchronization between a relational and a non-relational (NoSQL) database. This paper proposes a practical IT approach to this problem and tests the feasibility of the proposed solution by developing an application that maintains the synchronization between a MySQL database as a relational database and MongoDB as a non-relational database. The performance and capabilities of the solution are analyzed to ensure data consistency and correctness. In addition, problems that arose during the development of the application are highlighted and solutions are proposed to solve them.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030152
Authors: Christos Bormpotsis Mohamed Sedky Asma Patel
In the realm of foreign exchange (Forex) market predictions, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been commonly employed. However, these models often exhibit instability due to vulnerability to data perturbations attributed to their monolithic architecture. Hence, this study proposes a novel neuroscience-informed modular network that harnesses closing prices and sentiments from Yahoo Finance and Twitter APIs. Compared to monolithic methods, the objective is to advance the effectiveness of predicting price fluctuations in Euro to British Pound Sterling (EUR/GBP). The proposed model offers a unique methodology based on a reinvigorated modular CNN, replacing pooling layers with orthogonal kernel initialisation RNNs coupled with Monte Carlo Dropout (MCoRNNMCD). It integrates two pivotal modules: a convolutional simple RNN and a convolutional Gated Recurrent Unit (GRU). These modules incorporate orthogonal kernel initialisation and Monte Carlo Dropout techniques to mitigate overfitting, assessing each module’s uncertainty. The synthesis of these parallel feature extraction modules culminates in a three-layer Artificial Neural Network (ANN) decision-making module. Established on objective metrics like the Mean Square Error (MSE), rigorous evaluation underscores the proposed MCoRNNMCD–ANN’s exceptional performance. MCoRNNMCD–ANN surpasses single CNNs, LSTMs, GRUs, and the state-of-the-art hybrid BiCuDNNLSTM, CLSTM, CNN–LSTM, and LSTM–GRU in predicting hourly EUR/GBP closing price fluctuations.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030151
Authors: Hana Alostad Shoug Dawiek Hasan Davulcu
The Kuwaiti dialect is a particular dialect of Arabic spoken in Kuwait; it differs significantly from standard Arabic and the dialects of neighboring countries in the same region. Few research papers with a focus on the Kuwaiti dialect have been published in the field of NLP. In this study, we created Kuwaiti dialect language resources using Q8VaxStance, a vaccine stance labeling system for a large dataset of tweets. This dataset fills this gap and provides a valuable resource for researchers studying vaccine hesitancy in Kuwait. Furthermore, it contributes to the Arabic natural language processing field by providing a dataset for developing and evaluating machine learning models for stance detection in the Kuwaiti dialect. The proposed vaccine stance labeling system combines the benefits of weak supervised learning and zero-shot learning; for this purpose, we implemented 52 experiments on 42,815 unlabeled tweets extracted between December 2020 and July 2022. The results of the experiments show that using keyword detection in conjunction with zero-shot model labeling functions is significantly better than using only keyword detection labeling functions or just zero-shot model labeling functions. Furthermore, for the total number of generated labels, the difference between using the Arabic language in both the labels and prompt or a mix of Arabic labels and an English prompt is statistically significant, indicating that it generates more labels than when using English in both the labels and prompt. The best accuracy achieved in our experiments in terms of the Macro-F1 values was found when using keyword and hashtag detection labeling functions in conjunction with zero-shot model labeling functions, specifically in experiments KHZSLF-EE4 and KHZSLF-EA1, with values of 0.83 and 0.83, respectively. Experiment KHZSLF-EE4 was able to label 42,270 tweets, while experiment KHZSLF-EA1 was able to label 42,764 tweets. Finally, the average value of annotation agreement between the generated labels and human labels ranges between 0.61 and 0.64, which is considered a good level of agreement.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030150
Authors: Manar M. F. Donia Wessam H. El-Behaidy Aliaa A. A. Youssif
The study of human behaviors aims to gain a deeper perception of stimuli that control decision making. To describe, explain, predict, and control behavior, human behavior can be classified as either non-aggressive or anomalous behavior. Anomalous behavior is any unusual activity; impulsive aggressive, or violent behaviors are the most harmful. The detection of such behaviors at the initial spark is critical for guiding public safety decisions and a key to its security. This paper proposes an automatic aggressive-event recognition method based on effective feature representation and analysis. The proposed approach depends on a spatiotemporal discriminative feature that combines histograms of oriented gradients and dense optical flow features. In addition, the principal component analysis (PCA) and linear discriminant analysis (LDA) techniques are used for complexity reduction. The performance of the proposed approach is analyzed on three datasets: Hockey-Fight (HF), Stony Brook University (SBU)-Kinect, and Movie-Fight (MF), with accuracy rates of 96.5%, 97.8%, and 99.6%, respectively. Also, this paper assesses and contrasts the feature engineering and learned features for impulsive aggressive event recognition. Experiments show promising results of the proposed method compared to the state of the art. The implementation of the proposed work is available here.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030149
Authors: Mario Michelessa Christophe Hurter Brian Y. Lim Jamie Ng Suat Ling Bogdan Cautis Carol Anne Hargreaves
Social networks have become important objects of study in recent years. Social media marketing has, for example, greatly benefited from the vast literature developed in the past two decades. The study of social networks has taken advantage of recent advances in machine learning to process these immense amounts of data. Automatic emotional labeling of content on social media has, for example, been made possible by the recent progress in natural language processing. In this work, we are interested in the influence maximization problem, which consists of finding the most influential nodes in the social network. The problem is classically carried out using classical performance metrics such as accuracy or recall, which is not the end goal of the influence maximization problem. Our work presents an end-to-end learning model, SGREEDYNN, for the selection of the most influential nodes in a social network, given a history of information diffusion. In addition, this work proposes data visualization techniques to interpret the augmenting performances of our method compared to classical training. The results of this method are confirmed by visualizing the final influence of the selected nodes on network instances with edge bundling techniques. Edge bundling is a visual aggregation technique that makes patterns emerge. It has been shown to be an interesting asset for decision-making. By using edge bundling, we observe that our method chooses more diverse and high-degree nodes compared to the classical training.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030148
Authors: Georgios Trichopoulos Markos Konstantakis George Caridakis Akrivi Katifori Myrto Koukouli
This paper introduces a groundbreaking approach to enriching the museum experience using ChatGPT4, a state-of-the-art language model by OpenAI. By developing a museum guide powered by ChatGPT4, we aimed to address the challenges visitors face in navigating vast collections of artifacts and interpreting their significance. Leveraging the model’s natural-language-understanding and -generation capabilities, our guide offers personalized, informative, and engaging experiences. However, caution must be exercised as the generated information may lack scientific integrity and accuracy. To mitigate this, we propose incorporating human oversight and validation mechanisms. The subsequent sections present our own case study, detailing the design, architecture, and experimental evaluation of the museum guide system, highlighting its practical implementation and insights into the benefits and limitations of employing ChatGPT4 in the cultural heritage context.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030147
Authors: Maryna Stasevych Viktor Zvarych
The future of innovative robotic technologies and artificial intelligence (AI) in pharmacy and medicine is promising, with the potential to revolutionize various aspects of health care. These advances aim to increase efficiency, improve patient outcomes, and reduce costs while addressing pressing challenges such as personalized medicine and the need for more effective therapies. This review examines the major advances in robotics and AI in the pharmaceutical and medical fields, analyzing the advantages, obstacles, and potential implications for future health care. In addition, prominent organizations and research institutions leading the way in these technological advancements are highlighted, showcasing their pioneering efforts in creating and utilizing state-of-the-art robotic solutions in pharmacy and medicine. By thoroughly analyzing the current state of robotic technologies in health care and exploring the possibilities for further progress, this work aims to provide readers with a comprehensive understanding of the transformative power of robotics and AI in the evolution of the healthcare sector. Striking a balance between embracing technology and preserving the human touch, investing in R&D, and establishing regulatory frameworks within ethical guidelines will shape a future for robotics and AI systems. The future of pharmacy and medicine is in the seamless integration of robotics and AI systems to benefit patients and healthcare providers.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030146
Authors: Matthieu Saumard
Speech Emotions Recognition (SER) has gained significant attention in the fields of human–computer interaction and speech processing. In this article, we present a novel approach to improve SER performance by interpreting the Mel Frequency Cepstral Coefficients (MFCC) as a multivariate functional data object, which accelerates learning while maintaining high accuracy. To treat MFCCs as functional data, we preprocess them as images and apply resizing techniques. By representing MFCCs as functional data, we leverage the temporal dynamics of speech, capturing essential emotional cues more effectively. Consequently, this enhancement significantly contributes to the learning process of SER methods without compromising performance. Subsequently, we employ a supervised learning model, specifically a functional Support Vector Machine (SVM), directly on the MFCC represented as functional data. This enables the utilization of the full functional information, allowing for more accurate emotion recognition. The proposed approach is rigorously evaluated on two distinct databases, EMO-DB and IEMOCAP, serving as benchmarks for SER evaluation. Our method demonstrates competitive results in terms of accuracy, showcasing its effectiveness in emotion recognition. Furthermore, our approach significantly reduces the learning time, making it computationally efficient and practical for real-world applications. In conclusion, our novel approach of treating MFCCs as multivariate functional data objects exhibits superior performance in SER tasks, delivering both improved accuracy and substantial time savings during the learning process. This advancement holds great potential for enhancing human–computer interaction and enabling more sophisticated emotion-aware applications.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030145
Authors: Ekaterina Lesnyak Tabea Belkot Johannes Hurka Jan Philipp Hörding Lea Kuhlmann Pavel Paulau Marvin Schnabel Patrik Schönfeldt Jan Middelberg
The heat transition is a central pillar of the energy transition, aiming to decarbonize and improve the energy efficiency of the heat supply in both the private and industrial sectors. On the one hand, this is achieved by substituting fossil fuels with renewable energy. On the other hand, it involves reducing overall heat consumption and associated transmission and ventilation losses. In addition to refurbishment, digitalization contributes significantly. Despite substantial research on Digital Twins (DTs) for heat transition at different scales, a cross-scale perspective on heat optimization still needs to be developed. In response to this research gap, the present study examines four instances of applied DTs across various scales: building, campus, neighborhood, and urban. The study compares their objectives and conceptual frameworks while also identifying common challenges and potential synergies. The study’s findings indicate that all DT scales face similar data-related challenges, such as gathering, ownership, connectivity, and reliability. Also, hierarchical synergy is identified among the DTs, implying the need for collaboration and exchange. In response to this, the “Wärmewende” data platform, whose objectives and concepts are presented in the paper, promotes research data and knowledge exchange with internal and external stakeholders.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030144
Authors: Muhammad Shoaib Arif Aiman Mukheimer Daniyal Asif
Clinical decision-making in chronic disorder prognosis is often hampered by high variance, leading to uncertainty and negative outcomes, especially in cases such as chronic kidney disease (CKD). Machine learning (ML) techniques have emerged as valuable tools for reducing randomness and enhancing clinical decision-making. However, conventional methods for CKD detection often lack accuracy due to their reliance on limited sets of biological attributes. This research proposes a novel ML model for predicting CKD, incorporating various preprocessing steps, feature selection, a hyperparameter optimization technique, and ML algorithms. To address challenges in medical datasets, we employ iterative imputation for missing values and a novel sequential approach for data scaling, combining robust scaling, z-standardization, and min-max scaling. Feature selection is performed using the Boruta algorithm, and the model is developed using ML algorithms. The proposed model was validated on the UCI CKD dataset, achieving outstanding performance with 100% accuracy. Our approach, combining innovative preprocessing steps, the Boruta feature selection, and the k-nearest neighbors algorithm, along with a hyperparameter optimization using grid-search cross-validation (CV), demonstrates its effectiveness in enhancing the early detection of CKD. This research highlights the potential of ML techniques in improving clinical support systems and reducing the impact of uncertainty in chronic disorder prognosis.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030143
Authors: Amjad Alraizza Abdulmohsen Algarni
Ransomware attacks pose significant security threats to personal and corporate data and information. The owners of computer-based resources suffer from verification and privacy violations, monetary losses, and reputational damage due to successful ransomware assaults. As a result, it is critical to accurately and swiftly identify ransomware. Numerous methods have been proposed for identifying ransomware, each with its own advantages and disadvantages. The main objective of this research is to discuss current trends in and potential future debates on automated ransomware detection. This document includes an overview of ransomware, a timeline of assaults, and details on their background. It also provides comprehensive research on existing methods for identifying, avoiding, minimizing, and recovering from ransomware attacks. An analysis of studies between 2017 and 2022 is another advantage of this research. This provides readers with up-to-date knowledge of the most recent developments in ransomware detection and highlights advancements in methods for combating ransomware attacks. In conclusion, this research highlights unanswered concerns and potential research challenges in ransomware detection.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030142
Authors: Mohammad H. Alshayeji Jassim Al-Buloushi
Improved disease prediction accuracy and reliability are the main concerns in the development of models for the medical field. This study examined methods for increasing classification accuracy and proposed a precise and reliable framework for categorizing breast cancers using mammography scans. Concatenated Convolutional Neural Networks (CNN) were developed based on three models: Two by transfer learning and one entirely from scratch. Misclassification of lesions from mammography images can also be reduced using this approach. Bayesian optimization performs hyperparameter tuning of the layers, and data augmentation will refine the model by using more training samples. Analysis of the model’s accuracy revealed that it can accurately predict disease with 97.26% accuracy in binary cases and 99.13% accuracy in multi-classification cases. These findings are in contrast with recent studies on the same issue using the same dataset and demonstrated a 16% increase in multi-classification accuracy. In addition, an accuracy improvement of 6.4% was achieved after hyperparameter modification and augmentation. Thus, the model tested in this study was deemed superior to those presented in the extant literature. Hence, the concatenation of three different CNNs from scratch and transfer learning allows the extraction of distinct and significant features without leaving them out, enabling the model to make exact diagnoses.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030141
Authors: Ahmed Ramzy Marwan Torki Mohamed Abdeen Omar Saif Mustafa ElNainay AbdAllah Alshanqiti Emad Nabil
Religious studies are a rich land for Natural Language Processing (NLP). The reason is that all religions have their instructions as written texts. In this paper, we apply NLP to Islamic Hadiths, which are the written traditions, sayings, actions, approvals, and discussions of the Prophet Muhammad, his companions, or his followers. A Hadith is composed of two parts: the chain of narrators (Sanad) and the content of the Hadith (Matn). A Hadith is transmitted from its author to a Hadith book author using a chain of narrators. The problem we solve focuses on the classification of Hadiths based on their origin of narration. This is important for several reasons. First, it helps determine the authenticity and reliability of the Hadiths. Second, it helps trace the chain of narration and identify the narrators involved in transmitting Hadiths. Finally, it helps understand the historical and cultural contexts in which Hadiths were transmitted, and the different levels of authority attributed to the narrators. To the best of our knowledge, and based on our literature review, this problem is not solved before using machine/deep learning approaches. To solve this classification problem, we created a novel Author-Based Hadith Classification Dataset (ABCD) collected from classical Hadiths’ books. The ABCD size is 29 K Hadiths and it contains unique 18 K narrators, with all their information. We applied machine learning (ML), and deep learning (DL) approaches. ML was applied on Sanad and Matn separately; then, we did the same with DL. The results revealed that ML performs better than DL using the Matn input data, with a 77% F1-score. DL performed better than ML using the Sanad input data, with a 92% F1-score. We used precision and recall alongside the F1-score; details of the results are explained at the end of the paper. We claim that the ABCD and the reported results will motivate the community to work in this new area. Our dataset and results will represent a baseline for further research on the same problem.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030140
Authors: Abdelhak Etchiali Fethallah Hadjila Amina Bekkouche
Currently, the selection of web services with an uncertain quality of service (QoS) is gaining much attention in the service-oriented computing paradigm (SOC). In fact, searching for a service composition that fulfills a complex user’s request is known to be NP-complete. The search time is mainly dependent on the number of requested tasks, the size of the available services, and the size of the QoS realizations (i.e., sample size). To handle this problem, we propose a two-stage approach that reduces the search space using heuristics for ranking the task services and a bat algorithm metaheuristic for selecting the final near-optimal compositions. The fitness used by the metaheuristic aims to fulfil all the global constraints of the user. The experimental study showed that the ranking heuristics, termed “fuzzy Pareto dominance” and “Zero-order stochastic dominance”, are highly effective compared to the other heuristics and most of the existing state-of-the-art methods.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030139
Authors: Flavio Corradini Sara Pettinari Barbara Re Lorenzo Rossi Francesco Tiezzi
The development of process-driven systems and the advancements in digital twins have led to the birth of new ways of monitoring and analyzing systems, i.e., digital process twins. Specifically, a digital process twin can allow the monitoring of system behavior and the analysis of the execution status to improve the whole system. However, the concept of the digital process twin is still theoretical, and process-driven systems cannot really benefit from them. In this regard, this work discusses how to effectively exploit a digital process twin and proposes an implementation that combines the monitoring, refinement, and enactment of system behavior. We demonstrated the proposed solution in a multi-robot scenario.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030138
Authors: Péter Dobra János Jósvai
Nowadays, one of the important and indispensable conditions for the effectiveness and competitiveness of industrial companies is the high efficiency of manufacturing and assembly. These enterprises based on different methods and tools systematically monitor their efficiency metrics with Key Performance Indicators (KPIs). One of these most frequently used metrics is Overall Equipment Effectiveness (OEE), the product of availability, performance and quality. In addition to monitoring, it is also necessary to predict efficiency, which can be implemented with the support of machine learning techniques. This paper presents and compares several supervised machine learning techniques amongst other polynomial regression, lasso regression, ridge regression and gradient boost regression. The aim of this article is to determine the best estimation method for semiautomatic assembly line and large batch size. The case study presented with a real industrial example gives the answer as to which of the cumulative or rolling horizon prediction methods is more accurate.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030137
Authors: Markus Frohmann Manuel Karner Said Khudoyan Robert Wagner Markus Schedl
Recently, various methods to predict the future price of financial assets have emerged. One promising approach is to combine the historic price with sentiment scores derived via sentiment analysis techniques. In this article, we focus on predicting the future price of Bitcoin, which is currently the most popular cryptocurrency. More precisely, we propose a hybrid approach, combining time series forecasting and sentiment prediction from microblogs, to predict the intraday price of Bitcoin. Moreover, in addition to standard sentiment analysis methods, we are the first to employ a fine-tuned BERT model for this task. We also introduce a novel weighting scheme in which the weight of the sentiment of each tweet depends on the number of its creator’s followers. For evaluation, we consider periods with strongly varying ranges of Bitcoin prices. This enables us to assess the models w.r.t. robustness and generalization to varied market conditions. Our experiments demonstrate that BERT-based sentiment analysis and the proposed weighting scheme improve upon previous methods. Specifically, our hybrid models that use linear regression as the underlying forecasting algorithm perform best in terms of the mean absolute error (MAE of 2.67) and root mean squared error (RMSE of 3.28). However, more complicated models, particularly long short-term memory networks and temporal convolutional networks, tend to have generalization and overfitting issues, resulting in considerably higher MAE and RMSE scores.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030136
Authors: William Villegas-Ch Joselin García-Ortiz Angel Jaramillo-Alcazar
This paper investigated the importance of explainability in artificial intelligence models and its application in the context of prediction in Formula (1). A step-by-step analysis was carried out, including collecting and preparing data from previous races, training an AI model to make predictions, and applying explainability techniques in the said model. Two approaches were used: the attention technique, which allowed visualizing the most relevant parts of the input data using heat maps, and the permutation importance technique, which evaluated the relative importance of features. The results revealed that feature length and qualifying performance are crucial variables for position predictions in Formula (1). These findings highlight the relevance of explainability in AI models, not only in Formula (1) but also in other fields and sectors, by ensuring fairness, transparency, and accountability in AI-based decision making. The results highlight the importance of considering explainability in AI models and provide a practical methodology for its implementation in Formula (1) and other domains.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030135
Authors: Elena Mastria Francesco Pacenza Jessica Zangari Francesco Calimeri Simona Perri Giorgio Terracina
Stream Reasoning (SR) focuses on developing advanced approaches for applying inference to dynamic data streams; it has become increasingly relevant in various application scenarios such as IoT, Smart Cities, Emergency Management, and Healthcare, despite being a relatively new field of research. The current lack of standardized formalisms and benchmarks has been hindering the comparison between different SR approaches. We proposed a new benchmark, called EnviroStream, for evaluating SR systems on weather and environmental data. The benchmark includes queries and datasets of different sizes. We adopted I-DLV-sr, a recently released SR system based on Answer Set Programming, as a baseline for query modelling and experimentation. We also showcased continuous online reasoning via a web application.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030134
Authors: Hossein Hassani Steve MacFeely
With the ubiquitous use of digital technologies and the consequent data deluge, official statistics faces new challenges and opportunities. In this context, strengthening official statistics through effective data governance will be crucial to ensure reliability, quality, and access to data. This paper presents a comprehensive framework for digital data governance for official statistics, addressing key components, such as data collection and management, processing and analysis, data sharing and dissemination, as well as privacy and ethical considerations. The framework integrates principles of data governance into digital statistical processes, enabling statistical organizations to navigate the complexities of the digital environment. Drawing on case studies and best practices, the paper highlights successful implementations of digital data governance in official statistics. The paper concludes by discussing future trends and directions, including emerging technologies and opportunities for advancing digital data governance.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030133
Authors: Ze-Yang Tang Qi-Biao Hu Yi-Bo Cui Lei Hu Yi-Wen Li Yu-Jie Li
This paper aims to address the issue of evaluating the operation of electric vehicle charging stations (EVCSs). Previous studies have commonly employed the method of constructing comprehensive evaluation systems, which greatly relies on manual experience for index selection and weight allocation. To overcome this limitation, this paper proposes an evaluation method based on natural language models for assessing the operation of charging stations. By utilizing the proposed SimCSEBERT model, this study analyzes the operational data, user charging data, and basic information of charging stations to predict the operational status and identify influential factors. Additionally, this study compared the evaluation accuracy and impact factor analysis accuracy of the baseline and the proposed model. The experimental results demonstrate that our model achieves a higher evaluation accuracy (operation evaluation accuracy = 0.9464; impact factor analysis accuracy = 0.9492) and effectively assesses the operation of EVCSs. Compared with traditional evaluation methods, this approach exhibits improved universality and a higher level of intelligence. It provides insights into the operation of EVCSs and user demands, allowing for the resolution of supply–demand contradictions that are caused by power supply constraints and the uneven distribution of charging demands. Furthermore, it offers guidance for more efficient and targeted strategies for the operation of charging stations.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030132
Authors: Nurgali Kadyrbek Madina Mansurova Adai Shomanov Gaukhar Makharova
This study is devoted to the transcription of human speech in the Kazakh language in dynamically changing conditions. It discusses key aspects related to the phonetic structure of the Kazakh language, technical considerations in collecting the transcribed audio corpus, and the use of deep neural networks for speech modeling. A high-quality decoded audio corpus was collected, containing 554 h of data, giving an idea of the frequencies of letters and syllables, as well as demographic parameters such as the gender, age, and region of residence of native speakers. The corpus contains a universal vocabulary and serves as a valuable resource for the development of modules related to speech. Machine learning experiments were conducted using the DeepSpeech2 model, which includes a sequence-to-sequence architecture with an encoder, decoder, and attention mechanism. To increase the reliability of the model, filters initialized with symbol-level embeddings were introduced to reduce the dependence on accurate positioning on object maps. The training process included simultaneous preparation of convolutional filters for spectrograms and symbolic objects. The proposed approach, using a combination of supervised and unsupervised learning methods, resulted in a 66.7% reduction in the weight of the model while maintaining relative accuracy. The evaluation on the test sample showed a 7.6% lower character error rate (CER) compared to existing models, demonstrating its most modern characteristics. The proposed architecture provides deployment on platforms with limited resources. Overall, this study presents a high-quality audio corpus, an improved speech recognition model, and promising results applicable to speech-related applications and languages beyond Kazakh.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030131
Authors: Xinyu Tian Qinghe Zheng Zhiguo Yu Mingqiang Yang Yao Ding Abdussalam Elhanashi Sergio Saponara Kidiyo Kpalma
At present, the design of modern vehicles requires improving driving performance while meeting emission standards, leading to increasingly complex power systems. In autonomous driving systems, accurate, real-time vehicle speed prediction is one of the key factors in achieving automated driving. Accurate prediction and optimal control based on future vehicle speeds are key strategies for dealing with ever-changing and complex actual driving environments. However, predicting driver behavior is uncertain and may be influenced by the surrounding driving environment, such as weather and road conditions. To overcome these limitations, we propose a real-time vehicle speed prediction method based on a lightweight deep learning model driven by big temporal data. Firstly, the temporal data collected by automotive sensors are decomposed into a feature matrix through empirical mode decomposition (EMD). Then, an informer model based on the attention mechanism is designed to extract key information for learning and prediction. During the iterative training process of the informer, redundant parameters are removed through importance measurement criteria to achieve real-time inference. Finally, experimental results demonstrate that the proposed method achieves superior speed prediction performance through comparing it with state-of-the-art statistical modelling methods and deep learning models. Tests on edge computing devices also confirmed that the designed model can meet the requirements of actual tasks.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030130
Authors: Alberto del Rio Giuseppe Conti Sandra Castano-Solis Javier Serrano David Jimenez Jesus Fraile-Ardanuy
The digital transition that drives the new industrial revolution is largely driven by the application of intelligence and data. This boost leads to an increase in energy consumption, much of it associated with computing in data centers. This fact clashes with the growing need to save and improve energy efficiency and requires a more optimized use of resources. The deployment of new services in edge and cloud computing, virtualization, and software-defined networks requires a better understanding of consumption patterns aimed at more efficient and sustainable models and a reduction in carbon footprints. These patterns are suitable to be exploited by machine, deep, and reinforced learning techniques in pursuit of energy consumption optimization, which can ideally improve the energy efficiency of data centers and big computing servers providing these kinds of services. For the application of these techniques, it is essential to investigate data collection processes to create initial information points. Datasets also need to be created to analyze how to diagnose systems and sort out new ways of optimization. This work describes a data collection methodology used to create datasets that collect consumption data from a real-world work environment dedicated to data centers, server farms, or similar architectures. Specifically, it covers the entire process of energy stimuli generation, data extraction, and data preprocessing. The evaluation and reproduction of this method is offered to the scientific community through an online repository created for this work, which hosts all the code available for its download.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030129
Authors: Hilmil Pradana
Predicting traffic risk incidents in first-person helps to ensure a safety reaction can occur before the incident happens for a wide range of driving scenarios and conditions. One challenge to building advanced driver assistance systems is to create an early warning system for the driver to react safely and accurately while perceiving the diversity of traffic-risk predictions in real-world applications. In this paper, we aim to bridge the gap by investigating two key research questions regarding the driver’s current status of driving through online videos and the types of other moving objects that lead to dangerous situations. To address these problems, we proposed an end-to-end two-stage architecture: in the first stage, unsupervised learning is applied to collect all suspicious events on actual driving; in the second stage, supervised learning is used to classify all suspicious event results from the first stage to a common event type. To enrich the classification type, the metadata from the result of the first stage is sent to the second stage to handle the data limitation while training our classification model. Through the online situation, our method runs 9.60 fps on average with 1.44 fps on standard deviation. Our quantitative evaluation shows that our method reaches 81.87% and 73.43% for the average F1-score on labeled data of CST-S3D and real driving datasets, respectively. Furthermore, the proposed method has the potential to assist distribution companies in evaluating the driving performance of their driver by automatically monitoring near-miss events and analyzing driving patterns for training programs to reduce future accidents.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030128
Authors: Nehad M. Ibrahim Dalia G. Gabr Atta Rahman Dhiaa Musleh Dania AlKhulaifi Mariam AlKharraa
Plant taxonomy is the scientific study of the classification and naming of various plant species. It is a branch of biology that aims to categorize and organize the diverse variety of plant life on earth. Traditionally, plant taxonomy has been performed using morphological and anatomical characteristics, such as leaf shape, flower structure, and seed and fruit characters. Artificial intelligence (AI), machine learning, and especially deep learning can also play an instrumental role in plant taxonomy by automating the process of categorizing plant species based on the available features. This study investigated transfer learning techniques to analyze images of plants and extract features that can be used to cluster the species hierarchically using the k-means clustering algorithm. Several pretrained deep learning models were employed and evaluated. In this regard, two separate datasets were used in the study comprising of seed images of wild plants collected from Egypt. Extensive experiments using the transfer learning method (DenseNet201) demonstrated that the proposed methods achieved superior accuracy compared to traditional methods with the highest accuracy of 93% and F1-score and area under the curve (AUC) of 95%, respectively. That is considerable in contrast to the state-of-the-art approaches in the literature.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030127
Authors: Dhiaa A. Musleh Ibrahim Alkhwaja Ali Alkhwaja Mohammed Alghamdi Hussam Abahussain Faisal Alfawaz Nasro Min-Allah Mamoun Masoud Abdulqader
YouTube is a popular video-sharing platform that offers a diverse range of content. Assessing the quality of a video without watching it poses a significant challenge, especially considering the recent removal of the dislike count feature on YouTube. Although comments have the potential to provide insights into video content quality, navigating through the comments section can be time-consuming and overwhelming work for both content creators and viewers. This paper proposes an NLP-based model to classify Arabic comments as positive or negative. It was trained on a novel dataset of 4212 labeled comments, with a Kappa score of 0.818. The model uses six classifiers: SVM, Naïve Bayes, Logistic Regression, KNN, Decision Tree, and Random Forest. It achieved 94.62% accuracy and an MCC score of 91.46% with NB. Precision, Recall, and F1-measure for NB were 94.64%, 94.64%, and 94.62%, respectively. The Decision Tree had a suboptimal performance with 84.10% accuracy and an MCC score of 69.64% without TF-IDF. This study provides valuable insights for content creators to improve their content and audience engagement by analyzing viewers’ sentiments toward the videos. Furthermore, it bridges a literature gap by offering a comprehensive approach to Arabic sentiment analysis, which is currently limited in the field.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030126
Authors: Rosario Davide D’Amico Sri Addepalli John Ahmet Erkoyuncu
The digital twin (DT) research field is experiencing rapid expansion; yet, the research on industrial practices in this area remains poorly understood. This paper aims to address this knowledge gap by sharing feedback and future requirements from the manufacturing industry. The methodology employed in this study involves an examination of a survey that received 99 responses and interviews with 14 experts from 10 prominent UK organisations, most of which are involved in the defence industry in the UK. The survey and interviews explored topics such as DT design, return on investment, drivers, inhibitors, and future directions for DT development in manufacturing. This study’s findings indicate that DTs should possess characteristics such as adaptability, scalability, interoperability, and the ability to support assets throughout their entire life cycle. On average, completed DT projects reach the breakeven point in less than two years. The primary motivators behind DT development were identified to be autonomy, customer satisfaction, safety, awareness, optimisation, and sustainability. Meanwhile, the main obstacles include a lack of expertise, funding, and interoperability. This study concludes that the federation of twins and a paradigm shift in industrial thinking are essential components for the future of DT development.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030125
Authors: Omar Mohammed Horani Ali Khatibi Anas Ratib AL-Soud Jacquline Tham Ahmad Samed Al-Adwan
The adoption of business analytics (BA) has become increasingly important for organizations seeking to gain a competitive edge in today’s data-driven business landscape. Hence, understanding the key factors influencing the adoption of BA at the organizational level is decisive for the successful implementation of these technologies. This paper presents a systematic literature review that utilizes the PRISMA technique to investigate the organizational, technological, and environmental factors that affect the adoption of BA. By conducting a thorough examination of pertinent research, this review consolidates the current understanding and pinpoints essential elements that shape the process of adoption. Out of a total of 614 articles published between 2012 and 2022, 29 final articles were carefully chosen. The findings highlight the significance of organizational factors, technological factors, and environmental factors in shaping the adoption of the BA process. By consolidating and analyzing the current body of research, this paper offers valuable insights for organizations aiming to adopt BA successfully and maximize their benefits at the organizational level. The synthesized findings also contribute to the existing literature and provide a foundation for future research in this field.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030124
Authors: Katherine Abramski Salvatore Citraro Luigi Lombardi Giulio Rossetti Massimo Stella
Large Language Models (LLMs) are becoming increasingly integrated into our lives. Hence, it is important to understand the biases present in their outputs in order to avoid perpetuating harmful stereotypes, which originate in our own flawed ways of thinking. This challenge requires developing new benchmarks and methods for quantifying affective and semantic bias, keeping in mind that LLMs act as psycho-social mirrors that reflect the views and tendencies that are prevalent in society. One such tendency that has harmful negative effects is the global phenomenon of anxiety toward math and STEM subjects. In this study, we introduce a novel application of network science and cognitive psychology to understand biases towards math and STEM fields in LLMs from ChatGPT, such as GPT-3, GPT-3.5, and GPT-4. Specifically, we use behavioral forma mentis networks (BFMNs) to understand how these LLMs frame math and STEM disciplines in relation to other concepts. We use data obtained by probing the three LLMs in a language generation task that has previously been applied to humans. Our findings indicate that LLMs have negative perceptions of math and STEM fields, associating math with negative concepts in 6 cases out of 10. We observe significant differences across OpenAI’s models: newer versions (i.e., GPT-4) produce 5× semantically richer, more emotionally polarized perceptions with fewer negative associations compared to older versions and N=159 high-school students. These findings suggest that advances in the architecture of LLMs may lead to increasingly less biased models that could even perhaps someday aid in reducing harmful stereotypes in society rather than perpetuating them.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030123
Authors: Kang-Ren Leow Meng-Chew Leow Lee-Yeng Ong
The Online Roadshow, a new type of web application, is a digital marketing approach that aims to maximize contactless business engagement. It leverages web computing to conduct interactive game sessions via the internet. As a result, massive amounts of personal data are generated during the engagement process between the audience and the Online Roadshow (e.g., gameplay data and clickstream information). The high volume of data collected is valuable for more effective market segmentation in strategic business planning through data-driven processes such as web personalization and trend evaluation. However, the data storage and processing techniques used in conventional data analytic approaches are typically overloaded in such a computing environment. Hence, this paper proposed a new big data processing framework to improve the processing, handling, and storing of these large amounts of data. The proposed framework aims to provide a better dual-mode solution for processing the generated data for the Online Roadshow engagement process in both historical and real-time scenarios. Multiple functional modules, such as the Application Controller, the Message Broker, the Data Processing Module, and the Data Storage Module, were reformulated to provide a more efficient solution that matches the new needs of the Online Roadshow data analytics procedures. Some tests were conducted to compare the performance of the proposed frameworks against existing similar frameworks and verify the performance of the proposed framework in fulfilling the data processing requirements of the Online Roadshow. The experimental results evidenced multiple advantages of the proposed framework for Online Roadshow compared to similar existing big data processing frameworks.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030122
Authors: Wael H. Gomaa Abdelrahman E. Nagib Mostafa M. Saeed Abdulmohsen Algarni Emad Nabil
Automated scoring systems have been revolutionized by natural language processing, enabling the evaluation of students’ diverse answers across various academic disciplines. However, this presents a challenge as students’ responses may vary significantly in terms of length, structure, and content. To tackle this challenge, this research introduces a novel automated model for short answer grading. The proposed model uses pretrained “transformer” models, specifically T5, in conjunction with a BI-LSTM architecture which is effective in processing sequential data by considering the past and future context. This research evaluated several preprocessing techniques and different hyperparameters to identify the most efficient architecture. Experiments were conducted using a standard benchmark dataset named the North Texas Dataset. This research achieved a state-of-the-art correlation value of 92.5 percent. The proposed model’s accuracy has significant implications for education as it has the potential to save educators considerable time and effort, while providing a reliable and fair evaluation for students, ultimately leading to improved learning outcomes.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7030121
Authors: Gianluca Barbera Luiz Araujo Silvia Fernandes
Social Media Analytics (SMA) is more and more relevant in today’s market dynamics. However, it is necessary to use it wisely, either in promoting any kind of product/brand, or interacting with customers. This requires its effective understanding and monitoring. One way is through web data scraping (WDS) tools that allow to select sites and platforms to compare them in their performances. They can optimize extraction of big data published on social media. Due to current challenges, a sector that can particularly take advantage of this source is tourism (and its related sectors). This year has the hope of tourism’s revival after a pandemic whose impacts are still affecting several activities. Many traders and entrepreneurs have already used these versatile tools. However, do they really know their potential? The present study highlights the use of WDS to collect data from TripAdvisor’s social pages. Besides comparing competitors’ performance, companies also gain new knowledge of unnoticed preferences/habits. This contributes to more interesting innovations and results for them and for their customers. The approach used here is based on a project for smart tourism consultancy, from the identification of a gap in our region, to aid tourism organizations to enhance their digital presence and business model. Many things can be detected in this big source of unstructured data very quickly and easily without programming. Moreover, exploring code, either to refine the web scraper or connect it with other platforms/apps, can be an object of future research to leverage consumer behavior prediction for more advanced interactions.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7020120
Authors: Muhammad Hussain
The aim of this research is to develop an automated pallet inspection architecture with two key objectives: high performance with respect to defect classification and computational efficacy, i.e., lightweight footprint. As automated pallet racking via machine vision is a developing field, the procurement of racking datasets can be a difficult task. Therefore, the first contribution of this study was the proposal of several tailored augmentations that were generated based on modelling production floor conditions/variances within warehouses. Secondly, the variant selection algorithm was proposed, starting with extreme-end analysis and providing a protocol for selecting the optimal architecture with respect to accuracy and computational efficiency. The proposed YOLO-v5n architecture generated the highest MAP@0.5 of 96.8% compared to previous works in the racking domain, with a computational footprint in terms of the number of parameters at its lowest, i.e., 1.9 M compared to YOLO-v5x at 86.7 M.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7020119
Authors: Karima Khettabi Zineddine Kouahla Brahim Farou Hamid Seridi Mohamed Ferrag
Internet of Things (IoT) systems include many smart devices that continuously generate massive spatio-temporal data, which can be difficult to process. These continuous data streams need to be stored smartly so that query searches are efficient. In this work, we propose an efficient method, in the fog-cloud computing architecture, to index continuous and heterogeneous data streams in metric space. This method divides the fog layer into three levels: clustering, clusters processing and indexing. The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is used to group the data from each stream into homogeneous clusters at the clustering fog level. Each cluster in the first data stream is stored in the clusters processing fog level and indexed directly in the indexing fog level in a Binary tree with Hyperplane (BH tree). The indexing of clusters in the subsequent data stream is determined by the coefficient of variation (CV) value of the union of the new cluster with the existing clusters in the cluster processing fog layer. An analysis and comparison of our experimental results with other results in the literature demonstrated the effectiveness of the CV method in reducing energy consumption during BH tree construction, as well as reducing the search time and energy consumption during a k Nearest Neighbor (kNN) parallel query search.
]]>Big Data and Cognitive Computing doi: 10.3390/bdcc7020118
Authors: Pejman Ebrahimi Hakimeh Dustmohammadloo Hosna Kabiri Parisa Bouzari Mária Fekete-Farkas
For many years, entrepreneurs were considered the change agents of their societies. They use their initiative and innovative minds to solve problems and create value. In the aftermath of the digital transformation era, a new group of entrepreneurs have emerged who are called transformational entrepreneurs. They use various digital platforms to create value. Surprisingly, despite their importance, they have not been sufficiently investigated. Therefore, this research scrutinizes the elements affecting transformational entrepreneurship in digital platforms. To do so, the authors have considered a two-phase method. First, interpretive structural modeling (ISM) and Matrices d’Impacts Croises Multiplication Appliqué a Un Classement (MICMAC) are used to suggest a model. ISM is a qualitative method to reach a visualized hierarchical structure. Then, four unsupervised machine learning algorithms are used to ensure the accuracy of the proposed model. The findings reveal that transformational leadership could mediate the relationship between the entrepreneurial mindset and thinking and digital transformation, interdisciplinary approaches, value creation logic, and technology diffusion. The GMM in the full type, however, has the best accuracy among the various covariance types, with an accuracy of 0.895. From the practical point of view, this paper provides important insights for practitioners, entrepreneurs, and public actors to help them develop transformational entrepreneurship skills. The results could also serve as a guideline for companies regarding how to manage the consequences of a crisis such as a pandemic. The findings also provide significant insight for higher education policymakers.
]]>