Big Data and Cognitive Computing

24 pages, 429 KiB

Open AccessArticle

Cancer Detection Using a New Hybrid Method Based on Pattern Recognition in MicroRNAs Combining Particle Swarm Optimization Algorithm and Artificial Neural Network

by Sepideh Molaei, Stefano Cirillo and Giandomenico Solimando

Big Data Cogn. Comput. 2024, 8(3), 33; https://doi.org/10.3390/bdcc8030033 - 19 Mar 2024

Cited by 3 | Viewed by 2879

Abstract

MicroRNAs (miRNAs) play a crucial role in cancer development, but not all miRNAs are equally significant in cancer detection. Traditional methods face challenges in effectively identifying cancer-associated miRNAs due to data complexity and volume. This study introduces a novel, feature-based technique for detecting [...] Read more.

MicroRNAs (miRNAs) play a crucial role in cancer development, but not all miRNAs are equally significant in cancer detection. Traditional methods face challenges in effectively identifying cancer-associated miRNAs due to data complexity and volume. This study introduces a novel, feature-based technique for detecting attributes related to cancer-affecting microRNAs. It aims to enhance cancer diagnosis accuracy by identifying the most relevant miRNAs for various cancer types using a hybrid approach. In particular, we used a combination of particle swarm optimization (PSO) and artificial neural networks (ANNs) for this purpose. PSO was employed for feature selection, focusing on identifying the most informative miRNAs, while ANNs were used for recognizing patterns within the miRNA data. This hybrid method aims to overcome limitations in traditional miRNA analysis by reducing data redundancy and focusing on key genetic markers. The application of this method showed a significant improvement in the detection accuracy for various cancers, including breast and lung cancer and melanoma. Our approach demonstrated a higher precision in identifying relevant miRNAs compared to existing methods, as evidenced by the analysis of different datasets. The study concludes that the integration of PSO and ANNs provides a more efficient, cost-effective, and accurate method for cancer detection via miRNA analysis. This method can serve as a supplementary tool for cancer diagnosis and potentially aid in developing personalized cancer treatments. Full article

(This article belongs to the Special Issue Big Data and Information Science Technology)

► Show Figures

Figure 1

26 pages, 9647 KiB

Open AccessArticle

AI-Generated Text Detector for Arabic Language Using Encoder-Based Transformer Architecture

by Hamed Alshammari, Ahmed El-Sayed and Khaled Elleithy

Big Data Cogn. Comput. 2024, 8(3), 32; https://doi.org/10.3390/bdcc8030032 - 18 Mar 2024

Cited by 11 | Viewed by 6112

Abstract

The effectiveness of existing AI detectors is notably hampered when processing Arabic texts. This study introduces a novel AI text classifier designed specifically for Arabic, tackling the distinct challenges inherent in processing this language. A particular focus is placed on accurately recognizing human-written [...] Read more.

The effectiveness of existing AI detectors is notably hampered when processing Arabic texts. This study introduces a novel AI text classifier designed specifically for Arabic, tackling the distinct challenges inherent in processing this language. A particular focus is placed on accurately recognizing human-written texts (HWTs), an area where existing AI detectors have demonstrated significant limitations. To achieve this goal, this paper utilized and fine-tuned two Transformer-based models, AraELECTRA and XLM-R, by training them on two distinct datasets: a large dataset comprising 43,958 examples and a custom dataset with 3078 examples that contain HWT and AI-generated texts (AIGTs) from various sources, including ChatGPT 3.5, ChatGPT-4, and BARD. The proposed architecture is adaptable to any language, but this work evaluates these models’ efficiency in recognizing HWTs versus AIGTs in Arabic as an example of Semitic languages. The performance of the proposed models has been compared against the two prominent existing AI detectors, GPTZero and OpenAI Text Classifier, particularly on the AIRABIC benchmark dataset. The results reveal that the proposed classifiers outperform both GPTZero and OpenAI Text Classifier with 81% accuracy compared to 63% and 50% for GPTZero and OpenAI Text Classifier, respectively. Furthermore, integrating a Dediacritization Layer prior to the classification model demonstrated a significant enhancement in the detection accuracy of both HWTs and AIGTs. This Dediacritization step markedly improved the classification accuracy, elevating it from 81% to as high as 99% and, in some instances, even achieving 100%. Full article

► Show Figures

Figure 1

28 pages, 1665 KiB

Open AccessArticle

Machine Learning Approaches for Predicting Risk of Cardiometabolic Disease among University Students

by Dhiaa Musleh, Ali Alkhwaja, Ibrahim Alkhwaja, Mohammed Alghamdi, Hussam Abahussain, Mohammed Albugami, Faisal Alfawaz, Said El-Ashker and Mohammed Al-Hariri

Big Data Cogn. Comput. 2024, 8(3), 31; https://doi.org/10.3390/bdcc8030031 - 13 Mar 2024

Cited by 7 | Viewed by 4053

Abstract

Obesity is increasingly becoming a prevalent health concern among adolescents, leading to significant risks like cardiometabolic diseases (CMDs). The early discovery and diagnosis of CMD is essential for better outcomes. This study aims to build a reliable artificial intelligence model that can predict [...] Read more.

Obesity is increasingly becoming a prevalent health concern among adolescents, leading to significant risks like cardiometabolic diseases (CMDs). The early discovery and diagnosis of CMD is essential for better outcomes. This study aims to build a reliable artificial intelligence model that can predict CMD using various machine learning techniques. Support vector machines (SVMs), K-Nearest neighbor (KNN), Logistic Regression (LR), Random Forest (RF), and Gradient Boosting are five robust classifiers that are compared in this study. A novel “risk level” feature, derived through fuzzy logic applied to the Conicity Index, as a novel feature, which was previously unused, is introduced to enhance the interpretability and discriminatory properties of the proposed models. As the Conicity Index scores indicate CMD risk, two separate models are developed to address each gender individually. The performance of the proposed models is assessed using two datasets obtained from 295 records of undergraduate students in Saudi Arabia. The dataset comprises 121 male and 174 female students with diverse risk levels. Notably, Logistic Regression emerges as the top performer among males, achieving an accuracy score of 91%, while Gradient Boosting lags with a score of 72%. Among females, both Support Vector Machine and Logistic Regression lead with an accuracy score of 87%, while Random Forest performs least optimally with a score of 80%. Full article

(This article belongs to the Special Issue Revolutionizing Healthcare: Exploring the Latest Advances in Digital Health Technology)

► Show Figures

Figure 1

13 pages, 11349 KiB

Open AccessArticle

Proposal of a Service Model for Blockchain-Based Security Tokens

by Keundug Park and Heung-Youl Youm

Big Data Cogn. Comput. 2024, 8(3), 30; https://doi.org/10.3390/bdcc8030030 - 12 Mar 2024

Cited by 1 | Viewed by 3110

Abstract

The volume of the asset investment and trading market can be expanded through the issuance and management of blockchain-based security tokens that logically divide the value of assets and guarantee ownership. This paper proposes a service model to solve a problem with the [...] Read more.

The volume of the asset investment and trading market can be expanded through the issuance and management of blockchain-based security tokens that logically divide the value of assets and guarantee ownership. This paper proposes a service model to solve a problem with the existing investment service model, identifies security threats to the service model, and specifies security requirements countering the identified security threats for privacy protection and anti-money laundering (AML) involving security tokens. The identified security threats and specified security requirements should be taken into consideration when implementing the proposed service model. The proposed service model allows users to invest in tokenized tangible and intangible assets and trade in blockchain-based security tokens. This paper discusses considerations to prevent excessive regulation and market monopoly in the issuance of and trading in security tokens when implementing the proposed service model and concludes with future works. Full article

(This article belongs to the Special Issue Blockchain Meets IoT for Big Data)

► Show Figures

Figure 1

13 pages, 9470 KiB

Open AccessArticle

The Distribution and Accessibility of Elements of Tourism in Historic and Cultural Cities

by Wei-Ling Hsu, Yi-Jheng Chang, Lin Mou, Juan-Wen Huang and Hsin-Lung Liu

Big Data Cogn. Comput. 2024, 8(3), 29; https://doi.org/10.3390/bdcc8030029 - 11 Mar 2024

Cited by 4 | Viewed by 3261

Abstract

Historic urban areas are the foundations of urban development. Due to rapid urbanization, the sustainable development of historic urban areas has become challenging for many cities. Elements of tourism and tourism service facilities play an important role in the sustainable development of historic [...] Read more.

Historic urban areas are the foundations of urban development. Due to rapid urbanization, the sustainable development of historic urban areas has become challenging for many cities. Elements of tourism and tourism service facilities play an important role in the sustainable development of historic areas. This study analyzed policies related to tourism in Panguifang and Meixian districts in Meizhou, Guangdong, China. Kernel density estimation was used to study the clustering characteristics of tourism elements through point of interest (POI) data, while space syntax was used to study the accessibility of roads. In addition, the Pearson correlation coefficient and regression were used to analyze the correlation between the elements and accessibility. The results show the following: (1) the overall number of tourism elements was high on the western side of the districts and low on the eastern one, and the elements were predominantly distributed along the main transportation arteries; (2) according to the integration degree and depth value, the western side was easier to access than the eastern one; and (3) the depth value of the area negatively correlated with kernel density, while the degree of integration positively correlated with it. Based on the results, the study put forward measures for optimizing the elements of tourism in Meizhou’s historic urban area to improve cultural tourism and emphasize the importance of the elements. Full article

(This article belongs to the Special Issue Big Data Analytics for Cultural Heritage 2nd Edition)

► Show Figures

Graphical abstract

22 pages, 2151 KiB

Open AccessArticle

Enhancing Supervised Model Performance in Credit Risk Classification Using Sampling Strategies and Feature Ranking

by Niwan Wattanakitrungroj, Pimchanok Wijitkajee, Saichon Jaiyen, Sunisa Sathapornvajana and Sasiporn Tongman

Big Data Cogn. Comput. 2024, 8(3), 28; https://doi.org/10.3390/bdcc8030028 - 6 Mar 2024

Viewed by 2853

Abstract

For the financial health of lenders and institutions, one important risk assessment called credit risk is about correctly deciding whether or not a borrower will fail to repay a loan. It not only helps in the approval or denial of loan applications but [...] Read more.

For the financial health of lenders and institutions, one important risk assessment called credit risk is about correctly deciding whether or not a borrower will fail to repay a loan. It not only helps in the approval or denial of loan applications but also aids in managing the non-performing loan (NPL) trend. In this study, a dataset provided by the LendingClub company based in San Francisco, CA, USA, from 2007 to 2020 consisting of 2,925,492 records and 141 attributes was experimented with. The loan status was categorized as “Good” or “Risk”. To yield highly effective results of credit risk prediction, experiments on credit risk prediction were performed using three widely adopted supervised machine learning techniques: logistic regression, random forest, and gradient boosting. In addition, to solve the imbalanced data problem, three sampling algorithms, including under-sampling, over-sampling, and combined sampling, were employed. The results show that the gradient boosting technique achieves nearly perfect

A c c u r a c y

,

P r e c i s i o n

,

R e c a l l

, and

F 1 s c o r e

values, which are better than 99.92%, but its

M C C

values are greater than 99.77%. Three imbalanced data handling approaches can enhance the model performance of models trained by three algorithms. Moreover, the experiment of reducing the number of features based on mutual information calculation revealed slightly decreasing performance for 50 data features with

A c c u r a c y

values greater than 99.86%. For 25 data features, which is the smallest size, the random forest supervised model yielded 99.15%

A c c u r a c y

. Both sampling strategies and feature selection help to improve the supervised model for accurately predicting credit risk, which may be beneficial in the lending business. Full article

(This article belongs to the Topic Big Data and Artificial Intelligence, 2nd Volume)

► Show Figures

Figure 1

22 pages, 9109 KiB

Open AccessArticle

Temporal Dynamics of Citizen-Reported Urban Challenges: A Comprehensive Time Series Analysis

by Andreas F. Gkontzis, Sotiris Kotsiantis, Georgios Feretzakis and Vassilios S. Verykios

Big Data Cogn. Comput. 2024, 8(3), 27; https://doi.org/10.3390/bdcc8030027 - 4 Mar 2024

Cited by 1 | Viewed by 2337

Abstract

In an epoch characterized by the swift pace of digitalization and urbanization, the essence of community well-being hinges on the efficacy of urban management. As cities burgeon and transform, the need for astute strategies to navigate the complexities of urban life becomes increasingly [...] Read more.

In an epoch characterized by the swift pace of digitalization and urbanization, the essence of community well-being hinges on the efficacy of urban management. As cities burgeon and transform, the need for astute strategies to navigate the complexities of urban life becomes increasingly paramount. This study employs time series analysis to scrutinize citizen interactions with the coordinate-based problem mapping platform in the Municipality of Patras in Greece. The research explores the temporal dynamics of reported urban issues, with a specific focus on identifying recurring patterns through the lens of seasonality. The analysis, employing the seasonal decomposition technique, dissects time series data to expose trends in reported issues and areas of the city that might be obscured in raw big data. It accentuates a distinct seasonal pattern, with concentrations peaking during the summer months. The study extends its approach to forecasting, providing insights into the anticipated evolution of urban issues over time. Projections for the coming years show a consistent upward trend in both overall city issues and those reported in specific areas, with distinct seasonal variations. This comprehensive exploration of time series analysis and seasonality provides valuable insights for city stakeholders, enabling informed decision-making and predictions regarding future urban challenges. Full article

(This article belongs to the Special Issue Big Data and Information Science Technology)

► Show Figures

Figure 1

21 pages, 350 KiB

Open AccessArticle

Democratic Erosion of Data-Opolies: Decentralized Web3 Technological Paradigm Shift Amidst AI Disruption

by Igor Calzada

Big Data Cogn. Comput. 2024, 8(3), 26; https://doi.org/10.3390/bdcc8030026 - 26 Feb 2024

Cited by 7 | Viewed by 7389

Abstract

This article investigates the intricate dynamics of data monopolies, referred to as “data-opolies”, and their implications for democratic erosion. Data-opolies, typically embodied by large technology corporations, accumulate extensive datasets, affording them significant influence. The sustainability of such data practices is critically examined within [...] Read more.

This article investigates the intricate dynamics of data monopolies, referred to as “data-opolies”, and their implications for democratic erosion. Data-opolies, typically embodied by large technology corporations, accumulate extensive datasets, affording them significant influence. The sustainability of such data practices is critically examined within the context of decentralized Web3 technologies amidst Artificial Intelligence (AI) disruption. Additionally, the article explores emancipatory datafication strategies to counterbalance the dominance of data-opolies. It presents an in-depth analysis of two emergent phenomena within the decentralized Web3 emerging landscape: People-Centered Smart Cities and Datafied Network States. The article investigates a paradigm shift in data governance and advocates for joint efforts to establish equitable data ecosystems, with an emphasis on prioritizing data sovereignty and achieving digital self-governance. It elucidates the remarkable roles of (i) blockchain, (ii) decentralized autonomous organizations (DAOs), and (iii) data cooperatives in empowering citizens to have control over their personal data. In conclusion, the article introduces a forward-looking examination of Web3 decentralized technologies, outlining a timely path toward a more transparent, inclusive, and emancipatory data-driven democracy. This approach challenges the prevailing dominance of data-opolies and offers a framework for regenerating datafied democracies through decentralized and emerging Web3 technologies. Full article

17 pages, 33366 KiB

Open AccessArticle

Sign-to-Text Translation from Panamanian Sign Language to Spanish in Continuous Capture Mode with Deep Neural Networks

by Alvaro A. Teran-Quezada, Victor Lopez-Cabrera, Jose Carlos Rangel and Javier E. Sanchez-Galan

Big Data Cogn. Comput. 2024, 8(3), 25; https://doi.org/10.3390/bdcc8030025 - 26 Feb 2024

Cited by 6 | Viewed by 2800

Abstract

Convolutional neural networks (CNN) have provided great advances for the task of sign language recognition (SLR). However, recurrent neural networks (RNN) in the form of long–short-term memory (LSTM) have become a means for providing solutions to problems involving sequential data. This research proposes [...] Read more.

Convolutional neural networks (CNN) have provided great advances for the task of sign language recognition (SLR). However, recurrent neural networks (RNN) in the form of long–short-term memory (LSTM) have become a means for providing solutions to problems involving sequential data. This research proposes the development of a sign language translation system that converts Panamanian Sign Language (PSL) signs into text in Spanish using an LSTM model that, among many things, makes it possible to work with non-static signs (as sequential data). The deep learning model presented focuses on action detection, in this case, the execution of the signs. This involves processing in a precise manner the frames in which a sign language gesture is made. The proposal is a holistic solution that considers, in addition to the seeking of the hands of the speaker, the face and pose determinants. These were added due to the fact that when communicating through sign languages, other visual characteristics matter beyond hand gestures. For the training of this system, a data set of 330 videos (of 30 frames each) for five possible classes (different signs considered) was created. The model was tested having an accuracy of 98.8%, making this a valuable base system for effective communication between PSL users and Spanish speakers. In conclusion, this work provides an improvement of the state of the art for PSL–Spanish translation by using the possibilities of translatable signs via deep learning. Full article

(This article belongs to the Special Issue Advances and Applications of Deep Learning Methods and Image Processing)

► Show Figures

Figure 1

12 pages, 1867 KiB

Open AccessArticle

Experimental Evaluation: Can Humans Recognise Social Media Bots?

by Maxim Kolomeets, Olga Tushkanova, Vasily Desnitsky, Lidia Vitkova and Andrey Chechulin

Big Data Cogn. Comput. 2024, 8(3), 24; https://doi.org/10.3390/bdcc8030024 - 26 Feb 2024

Cited by 5 | Viewed by 4131

Abstract

This paper aims to test the hypothesis that the quality of social media bot detection systems based on supervised machine learning may not be as accurate as researchers claim, given that bots have become increasingly sophisticated, making it difficult for human annotators to [...] Read more.

This paper aims to test the hypothesis that the quality of social media bot detection systems based on supervised machine learning may not be as accurate as researchers claim, given that bots have become increasingly sophisticated, making it difficult for human annotators to detect them better than random selection. As a result, obtaining a ground-truth dataset with human annotation is not possible, which leads to supervised machine-learning models inheriting annotation errors. To test this hypothesis, we conducted an experiment where humans were tasked with recognizing malicious bots on the VKontakte social network. We then compared the “human” answers with the “ground-truth” bot labels (‘a bot’/‘not a bot’). Based on the experiment, we evaluated the bot detection efficiency of annotators in three scenarios typical for cybersecurity but differing in their detection difficulty as follows: (1) detection among random accounts, (2) detection among accounts of a social network ‘community’, and (3) detection among verified accounts. The study showed that humans could only detect simple bots in all three scenarios but could not detect more sophisticated ones (p-value = 0.05). The study also evaluates the limits of hypothetical and existing bot detection systems that leverage non-expert-labelled datasets as follows: the balanced accuracy of such systems can drop to 0.5 and lower, depending on bot complexity and detection scenario. The paper also describes the experiment design, collected datasets, statistical evaluation, and machine learning accuracy measures applied to support the results. In the discussion, we raise the question of using human labelling in bot detection systems and its potential cybersecurity issues. We also provide open access to the datasets used, experiment results, and software code for evaluating statistical and machine learning accuracy metrics used in this paper on GitHub. Full article

(This article belongs to the Special Issue Security, Privacy, and Trust in Artificial Intelligence Applications)

► Show Figures

Figure 1

16 pages, 12168 KiB

Open AccessArticle

Solar and Wind Data Recognition: Fourier Regression for Robust Recovery

by Abdullah F. Al-Aboosi, Aldo Jonathan Muñoz Vazquez, Fadhil Y. Al-Aboosi, Mahmoud El-Halwagi and Wei Zhan

Big Data Cogn. Comput. 2024, 8(3), 23; https://doi.org/10.3390/bdcc8030023 - 24 Feb 2024

Cited by 2 | Viewed by 2803

Abstract

Accurate prediction of renewable energy output is essential for integrating sustainable energy sources into the grid, facilitating a transition towards a more resilient energy infrastructure. Novel applications of machine learning and artificial intelligence are being leveraged to enhance forecasting methodologies, enabling more accurate [...] Read more.

Accurate prediction of renewable energy output is essential for integrating sustainable energy sources into the grid, facilitating a transition towards a more resilient energy infrastructure. Novel applications of machine learning and artificial intelligence are being leveraged to enhance forecasting methodologies, enabling more accurate predictions and optimized decision-making capabilities. Integrating these novel paradigms improves forecasting accuracy, fostering a more efficient and reliable energy grid. These advancements allow better demand management, optimize resource allocation, and improve robustness to potential disruptions. The data collected from solar intensity and wind speed is often recorded through sensor-equipped instruments, which may encounter intermittent or permanent faults. Hence, this paper proposes a novel Fourier network regression model to process solar irradiance and wind speed data. The proposed approach enables accurate prediction of the underlying smooth components, facilitating effective reconstruction of missing data and enhancing the overall forecasting performance. The present study focuses on Midland, Texas, as a case study to assess direct normal irradiance (DNI), diffuse horizontal irradiance (DHI), and wind speed. Remarkably, the model exhibits a correlation of 1 with a minimal RMSE (root mean square error) of 0.0007555. This study leverages Fourier analysis for renewable energy applications, with the aim of establishing a methodology that can be applied to a novel geographic context. Full article

► Show Figures

Figure 1

17 pages, 1200 KiB

Open AccessArticle

Comparison of Bagging and Sparcity Methods for Connectivity Reduction in Spiking Neural Networks with Memristive Plasticity

by Roman Rybka, Yury Davydov, Danila Vlasov, Alexey Serenko, Alexander Sboev and Vyacheslav Ilyin

Big Data Cogn. Comput. 2024, 8(3), 22; https://doi.org/10.3390/bdcc8030022 - 23 Feb 2024

Cited by 1 | Viewed by 2277

Abstract

Developing a spiking neural network architecture that could prospectively be trained on energy-efficient neuromorphic hardware to solve various data analysis tasks requires satisfying the limitations of prospective analog or digital hardware, i.e., local learning and limited numbers of connections, respectively. In this work, [...] Read more.

Developing a spiking neural network architecture that could prospectively be trained on energy-efficient neuromorphic hardware to solve various data analysis tasks requires satisfying the limitations of prospective analog or digital hardware, i.e., local learning and limited numbers of connections, respectively. In this work, we compare two methods of connectivity reduction that are applicable to spiking networks with local plasticity; instead of a large fully-connected network (which is used as the baseline for comparison), we employ either an ensemble of independent small networks or a network with probabilistic sparse connectivity. We evaluate both of these methods with a three-layer spiking neural network, which are applied to handwritten and spoken digit classification tasks using two memristive plasticity models and the classical spike time-dependent plasticity (STDP) rule. Both methods achieve an F1-score of 0.93–0.95 on the handwritten digits recognition task and 0.85–0.93 on the spoken digits recognition task. Applying a combination of both methods made it possible to obtain highly accurate models while reducing the number of connections by more than three times compared to the basic model. Full article

(This article belongs to the Special Issue Computational Intelligence: Spiking Neural Networks)

► Show Figures

Figure 1

15 pages, 2801 KiB

Open AccessArticle

Anomaly Detection of IoT Cyberattacks in Smart Cities Using Federated Learning and Split Learning

by Ishaani Priyadarshini

Big Data Cogn. Comput. 2024, 8(3), 21; https://doi.org/10.3390/bdcc8030021 - 22 Feb 2024

Cited by 20 | Viewed by 5994

Abstract

The swift proliferation of the Internet of Things (IoT) devices in smart city infrastructures has created an urgent demand for robust cybersecurity measures. These devices are susceptible to various cyberattacks that can jeopardize the security and functionality of urban systems. This research presents [...] Read more.

The swift proliferation of the Internet of Things (IoT) devices in smart city infrastructures has created an urgent demand for robust cybersecurity measures. These devices are susceptible to various cyberattacks that can jeopardize the security and functionality of urban systems. This research presents an innovative approach to identifying anomalies caused by IoT cyberattacks in smart cities. The proposed method harnesses federated and split learning and addresses the dual challenge of enhancing IoT network security while preserving data privacy. This study conducts extensive experiments using authentic datasets from smart cities. To compare the performance of classical machine learning algorithms and deep learning models for detecting anomalies, model effectiveness is assessed using precision, recall, F-1 score, accuracy, and training/deployment time. The findings demonstrate that federated learning and split learning have the potential to balance data privacy concerns with competitive performance, providing robust solutions for detecting IoT cyberattacks. This study contributes to the ongoing discussion about securing IoT deployments in urban settings. It lays the groundwork for scalable and privacy-conscious cybersecurity strategies. The results underscore the vital role of these techniques in fortifying smart cities and promoting the development of adaptable and resilient cybersecurity measures in the IoT era. Full article

(This article belongs to the Special Issue Deep Network Learning and Its Applications)

► Show Figures

Figure 1

24 pages, 2186 KiB

Open AccessArticle

A Machine Learning-Based Pipeline for the Extraction of Insights from Customer Reviews

by Róbert Lakatos, Gergő Bogacsovics, Balázs Harangi, István Lakatos, Attila Tiba, János Tóth, Marianna Szabó and András Hajdu

Big Data Cogn. Comput. 2024, 8(3), 20; https://doi.org/10.3390/bdcc8030020 - 22 Feb 2024

Cited by 1 | Viewed by 2983

Abstract

The efficiency of natural language processing has improved dramatically with the advent of machine learning models, particularly neural network-based solutions. However, some tasks are still challenging, especially when considering specific domains. This paper presents a model that can extract insights from customer reviews [...] Read more.

The efficiency of natural language processing has improved dramatically with the advent of machine learning models, particularly neural network-based solutions. However, some tasks are still challenging, especially when considering specific domains. This paper presents a model that can extract insights from customer reviews using machine learning methods integrated into a pipeline. For topic modeling, our composite model uses transformer-based neural networks designed for natural language processing, vector-embedding-based keyword extraction, and clustering. The elements of our model have been integrated and tailored to better meet the requirements of efficient information extraction and topic modeling of the extracted information for opinion mining. Our approach was validated and compared with other state-of-the-art methods using publicly available benchmark datasets. The results show that our system performs better than existing topic modeling and keyword extraction methods in this task. Full article

(This article belongs to the Special Issue Artificial Intelligence and Natural Language Processing)

► Show Figures

Figure 1

20 pages, 1280 KiB

Open AccessArticle

A Novel Algorithm for Multi-Criteria Ontology Merging through Iterative Update of RDF Graph

by Mohammed Suleiman Mohammed Rudwan and Jean Vincent Fonou-Dombeu

Big Data Cogn. Comput. 2024, 8(3), 19; https://doi.org/10.3390/bdcc8030019 - 21 Feb 2024

Cited by 1 | Viewed by 2306

Abstract

Ontology merging is an important task in ontology engineering to date. However, despite the efforts devoted to ontology merging, the incorporation of relevant features of ontologies such as axioms, individuals and annotations in the output ontologies remains challenging. Consequently, existing ontology-merging solutions produce [...] Read more.

Ontology merging is an important task in ontology engineering to date. However, despite the efforts devoted to ontology merging, the incorporation of relevant features of ontologies such as axioms, individuals and annotations in the output ontologies remains challenging. Consequently, existing ontology-merging solutions produce new ontologies that do not include all the relevant semantic features from the candidate ontologies. To address these limitations, this paper proposes a novel algorithm for multi-criteria ontology merging that automatically builds a new ontology from candidate ontologies by iteratively updating an RDF graph in the memory. The proposed algorithm leverages state-of-the-art Natural Language Processing tools as well as a Machine Learning-based framework to assess the similarities and merge various criteria into the resulting output ontology. The key contribution of the proposed algorithm lies in its ability to merge relevant features from the candidate ontologies to build a more accurate, integrated and cohesive output ontology. The proposed algorithm is tested with five ontologies of different computing domains and evaluated in terms of its asymptotic behavior, quality and computational performance. The experimental results indicate that the proposed algorithm produces output ontologies that meet the integrity, accuracy and cohesion quality criteria better than related studies. This performance demonstrates the effectiveness and superior capabilities of the proposed algorithm. Furthermore, the proposed algorithm enables iterative in-memory update and building of the RDF graph of the resulting output ontology, which enhances the processing speed and improves the computational efficiency, making it an ideal solution for big data applications. Full article

► Show Figures

Figure 1

Journal Menu

Journal Browser

Big Data Cogn. Comput., Volume 8, Issue 3 (March 2024) – 15 articles

Further Information

Guidelines

MDPI Initiatives

Follow MDPI