Machine Learning, Statistics and Big Data

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "Probability and Statistics".

Deadline for manuscript submissions: 30 June 2024 | Viewed by 13551

Special Issue Editors


E-Mail Website
Guest Editor
Dep. of Statistics-Forecasts-Mathematics, Faculty of Economics and Business Administration & the Interdisciplinary Centre for Data Science, Babeș-Bolyai University, Cluj-Napoca, Romania
Interests: spatial econometrics; economic forecasting; econometrics; statistics

E-Mail Website
Guest Editor
Department of Finance and Accounting, Faculty of Economics, University of Oradea, Oradea, Romania
Interests: machine learning; sentiment analysis; AI in finance; behavioural finance; statistics

Special Issue Information

Dear Colleagues, 

Intense technological progress has led to a significant increase in data production and the importance of evaluating these data. Algorithms have been constructed in order to analyze and predict data for decision-making purposes. Classical econometrics are increasingly being compared to or even replaced by machine learning methods for data analysis. Special analytical procedures are being developed for big data situations, which can be found in all fields of human activity, from finance to transportation. As the goal of the European Commission is to sustain innovations in machine learning and artificial intelligence techniques in different sectors, the main goal of this Special Issue is to gather researchers in the field of statistics, econometrics, machine learning and big data. Contributions in the form of different types of theoretical developments, procedure constructions, or applications of such methods are welcome. FinTech and artificial intelligence methods applied in finance are encouraged.

This Special Issue is supported by and developed under the auspices of the COST CA 19130 “Fintech and Artificial Intelligence in Finance”, supported by COST (European Cooperation in Science and Technology); www.cost.eu, https://fin-ai.eu/

Dr. Codruta Mare
Dr. Ioana Florina Coita
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • statistics
  • econometrics
  • machine learning
  • big data
  • financial econometrics
  • spatial econometrics
  • spatial machine learning
  • sentiment analysis
  • FinTech
  • digital finance
  • artificial intellingence
  • supervised vs. unsupervised learning
  • forecasting methods
  • IoT
  • cloud
  • blockchain
  • architecture for big data
  • big data analytics
  • data mining
  • cyberspace

Published Papers (9 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 1903 KiB  
Article
Spillover Effect of Network Public Opinion on Market Prices of Small-Scale Agricultural Products
by Xingchen Lv, Weijun Lin, Jun Meng and Linan Mo
Mathematics 2024, 12(4), 539; https://doi.org/10.3390/math12040539 - 8 Feb 2024
Viewed by 499
Abstract
Network public opinion plays a crucial role in the behavior and decision making of various stakeholders, including farmers, middlemen, and consumers. It also affects the price fluctuations of small-scale agricultural products. Understanding the transmission path and spillover effect of network public opinion on [...] Read more.
Network public opinion plays a crucial role in the behavior and decision making of various stakeholders, including farmers, middlemen, and consumers. It also affects the price fluctuations of small-scale agricultural products. Understanding the transmission path and spillover effect of network public opinion on the price fluctuations of these products is essential for ensuring their sustainable development and price stability. This paper selects the monthly data of network public opinion and related market prices of small-scale agricultural products from January 2014 to December 2021, constructs a network public opinion value through the sentiment classification results of deep learning models, and uses the trivariate VAR-BEKK-GARCH(1,1) model and spillover index model to study the spillover effect and spillover index of network public opinion on the market prices of small-scale agricultural products (national average price and origin price). The results show that: (1) There is a bidirectional volatility spillover effect between public opinion sentiment and the market prices of small-scale agricultural products. Additionally, this two-way volatility spillover effect is also evident between the average market prices and the origin prices of these commodities. (2) The influence of network public opinion on the market prices of small-scale agricultural products is substantial, with the spillover index being more pronounced for origin prices than for national average prices and reaching its zenith earlier. Consequently, based on these results, recommendations are provided to adapt planting and inventory strategies, enhance vigilance towards price risk transmission amongst small-scale agricultural product markets, and improve the comprehensive information platform encompassing the entire industry chain. Full article
(This article belongs to the Special Issue Machine Learning, Statistics and Big Data)
Show Figures

Figure 1

22 pages, 1183 KiB  
Article
Mastery of “Monthly Effects”: Big Data Insights into Contrarian Strategies for DJI 30 and NDX 100 Stocks over a Two-Decade Period
by Chien-Liang Chiu, Paoyu Huang, Min-Yuh Day, Yensen Ni and Yuhsin Chen
Mathematics 2024, 12(2), 356; https://doi.org/10.3390/math12020356 - 22 Jan 2024
Viewed by 710
Abstract
In contrast to finding better monthly performance shown in a specific month, such as the January effect (i.e., better stock price performance in January as opposed to other months), which has been extensively studied, the goal of this study is to determine whether [...] Read more.
In contrast to finding better monthly performance shown in a specific month, such as the January effect (i.e., better stock price performance in January as opposed to other months), which has been extensively studied, the goal of this study is to determine whether investors would obtain better subsequent performance as technical trading signals emitted in a specific month because, from the investment perspective, investors purchasing stocks now would not know their performance until later. We contend that our analysis emphasizes its critical role in steering investment decisions and enhancing profitability; nonetheless, this issue appears to be overlooked in the relevant literature. As such, utilizing big data to analyze the constituent stocks of the DJI 30 and NDX 100 indices from 2003 to 2022 (i.e., two-decade data), this study investigates whether trading these stocks as trading signals emitted via contrarian regulation of stochastic oscillator indicators (SOIs) and the relative strength index (RSI) in specific months would result in superior subsequent performance (hereafter referred to as “monthly effects”). This study discovers that the oversold signals generated by these two contrarian regulations in March were associated with higher subsequent performance for holding 100 to 250 trading days (roughly one year) than other months. These findings highlight the importance of the trading time and the superiority of the RSI over SOIs in generating profits. This study sheds light on the significance of oversold trading signals and suggests that the “monthly effect” is crucial for achieving higher returns. Full article
(This article belongs to the Special Issue Machine Learning, Statistics and Big Data)
Show Figures

Figure 1

26 pages, 5669 KiB  
Article
A Natural-Language-Processing-Based Method for the Clustering and Analysis of Movie Reviews and Classification by Genre
by Fernando González, Miguel Torres-Ruiz, Guadalupe Rivera-Torruco, Liliana Chonona-Hernández and Rolando Quintero
Mathematics 2023, 11(23), 4735; https://doi.org/10.3390/math11234735 - 22 Nov 2023
Viewed by 1356
Abstract
Reclassification of massive datasets acquired through different approaches, such as web scraping, is a big challenge to demonstrate the effectiveness of a machine learning model. Notably, there is a strong influence of the quality of the dataset used for training those models. Thus, [...] Read more.
Reclassification of massive datasets acquired through different approaches, such as web scraping, is a big challenge to demonstrate the effectiveness of a machine learning model. Notably, there is a strong influence of the quality of the dataset used for training those models. Thus, we propose a threshold algorithm as an efficient method to remove stopwords. This method employs an unsupervised classification technique, such as K-means, to accurately categorize user reviews from the IMDb dataset into their most suitable categories, generating a well-balanced dataset. Analysis of the performance of the algorithm revealed a notable influence of the text vectorization method used concerning the generation of clusters when assessing various preprocessing approaches. Moreover, the algorithm demonstrated that the word embedding technique and the removal of stopwords to retrieve the clustered text significantly impacted the categorization. The proposed method involves confirming the presence of a suggested stopword within each review across various genres. Upon satisfying this condition, the method assesses if the word’s frequency exceeds a predefined threshold. The threshold algorithm yielded a mapping genre success above 80% compared to precompiled lists and a Zipf’s law-based method. In addition, we employed the mini-batch K-means method for the clustering formation of each differently preprocessed dataset. This approach enabled us to reclassify reviews more coherently. Summing up, our methodology categorizes sparsely labeled data into meaningful clusters, in particular, by using a combination of the proposed stopword removal method and TF-IDF. The reclassified and balanced datasets showed a significant improvement, achieving 94% accuracy compared to the original dataset. Full article
(This article belongs to the Special Issue Machine Learning, Statistics and Big Data)
Show Figures

Figure 1

11 pages, 436 KiB  
Article
Efficient Estimation and Validation of Shrinkage Estimators in Big Data Analytics
by Salomi du Plessis, Mohammad Arashi, Gaonyalelwe Maribe and Salomon M. Millard
Mathematics 2023, 11(22), 4632; https://doi.org/10.3390/math11224632 - 13 Nov 2023
Viewed by 740
Abstract
Shrinkage estimators are often used to mitigate the consequences of multicollinearity in linear regression models. Despite the ease with which these techniques can be applied to small- or moderate-size datasets, they encounter significant challenges in the big data domain. Some of these challenges [...] Read more.
Shrinkage estimators are often used to mitigate the consequences of multicollinearity in linear regression models. Despite the ease with which these techniques can be applied to small- or moderate-size datasets, they encounter significant challenges in the big data domain. Some of these challenges are that the volume of data often exceeds the storage capacity of a single computer and that the time required to obtain results becomes infeasible due to the computational burden of a high volume of data. We propose an algorithm for the efficient model estimation and validation of various well-known shrinkage estimators to be used in scenarios where the volume of the data is large. Our proposed algorithm utilises sufficient statistics that can be computed and updated at the row level, thus minimizing access to the entire dataset. A simulation study, as well as an application on a real-world dataset, illustrates the efficiency of the proposed approach. Full article
(This article belongs to the Special Issue Machine Learning, Statistics and Big Data)
Show Figures

Figure 1

20 pages, 2643 KiB  
Article
JQPro:Join Query Processing in a Distributed System for Big RDF Data Using the Hash-Merge Join Technique
by Nahla Mohammed Elzein, Mazlina Abdul Majid, Ibrahim Abaker Targio Hashem, Ashraf Osman Ibrahim, Anas W. Abulfaraj and Faisal Binzagr
Mathematics 2023, 11(5), 1275; https://doi.org/10.3390/math11051275 - 6 Mar 2023
Cited by 1 | Viewed by 1614
Abstract
In the last decade, the volume of semantic data has increased exponentially, with the number of Resource Description Framework (RDF) datasets exceeding trillions of triples in RDF repositories. Hence, the size of RDF datasets continues to grow. However, with the increasing number of [...] Read more.
In the last decade, the volume of semantic data has increased exponentially, with the number of Resource Description Framework (RDF) datasets exceeding trillions of triples in RDF repositories. Hence, the size of RDF datasets continues to grow. However, with the increasing number of RDF triples, complex multiple RDF queries are becoming a significant demand. Sometimes, such complex queries produce many common sub-expressions in a single query or over multiple queries running as a batch. In addition, it is also difficult to minimize the number of RDF queries and processing time for a large amount of related data in a typical distributed environment encounter. To address this complication, we introduce a join query processing model for big RDF data, called JQPro. By adopting a MapReduce framework in JQPro, we developed three new algorithms, which are hash-join, sort-merge, and enhanced MapReduce-join for join query processing of RDF data. Based on an experiment conducted, the result showed that the JQPro model outperformed the two popular algorithms, gStore and RDF-3X, with respect to the average execution time. Furthermore, the JQPro model was also tested against RDF-3X, RDFox, and PARJs using the LUBM benchmark. The result showed that the JQPro model had better performance in comparison with the other models. In conclusion, the findings showed that JQPro achieved improved performance with 87.77% in terms of execution time. Hence, in comparison with the selected models, JQPro performs better. Full article
(This article belongs to the Special Issue Machine Learning, Statistics and Big Data)
Show Figures

Figure 1

19 pages, 3590 KiB  
Article
Blockchain-Based Distributed Federated Learning in Smart Grid
by Marcel Antal, Vlad Mihailescu, Tudor Cioara and Ionut Anghel
Mathematics 2022, 10(23), 4499; https://doi.org/10.3390/math10234499 - 29 Nov 2022
Cited by 6 | Viewed by 2050
Abstract
The participation of prosumers in demand-response programs is essential for the success of demand-side management in renewable-powered energy grids. Unfortunately, the engagement is still low due to concerns related to the privacy of their energy data used in the prediction processes. In this [...] Read more.
The participation of prosumers in demand-response programs is essential for the success of demand-side management in renewable-powered energy grids. Unfortunately, the engagement is still low due to concerns related to the privacy of their energy data used in the prediction processes. In this paper, we propose a blockchain-based distributed federated learning (FL) technique for energy-demand prediction that combines FL with blockchain to provide data privacy and trust features for energy prosumers. The privacy-sensitive energy data are stored locally at edge prosumer nodes without revealing it to third parties, with only the learned local model weights being shared using a blockchain network. The global federated model is not centralized but distributed and replicated over the blockchain overlay, ensuring the model immutability and provenance of parameter updates. We had proposed smart contracts to deal with the integration of local machine-learning prediction models with the blockchain, defining functions for the model parameters’ scaling and reduction of blockchain overhead. The centralized, local-edge, and blockchain-integrated models are comparatively evaluated for prediction of energy demand 24 h ahead using a multi-layer perceptron model and the monitored energy data of several prosumers. The results show only a slight decrease in prediction accuracy in the case of blockchain-based distributed FL with reliable data privacy support compared with the centralized learning solution. Full article
(This article belongs to the Special Issue Machine Learning, Statistics and Big Data)
Show Figures

Figure 1

13 pages, 936 KiB  
Article
Machine Learning Models for Predicting Romanian Farmers’ Purchase of Crop Insurance
by Codruţa Mare, Daniela Manaţe, Gabriela-Mihaela Mureşan, Simona Laura Dragoş, Cristian Mihai Dragoş and Alexandra-Anca Purcel
Mathematics 2022, 10(19), 3625; https://doi.org/10.3390/math10193625 - 3 Oct 2022
Cited by 3 | Viewed by 1844
Abstract
Considering the large size of the agricultural sector in Romania, increasing the crop insurance adoption rate and identifying the factors that drive adoption can present a real interest in the Romanian market. The main objective of this research was to identify the performance [...] Read more.
Considering the large size of the agricultural sector in Romania, increasing the crop insurance adoption rate and identifying the factors that drive adoption can present a real interest in the Romanian market. The main objective of this research was to identify the performance of machine learning (ML) models in predicting Romanian farmers’ purchase of crop insurance based on crop-level and farmer-level characteristics. The data set used contains 721 responses to a survey administered to Romanian farmers in September 2021, and includes both characteristics related to the crop as well as farmer-level socio-demographic attributes, perception about risk, perception about insurers and knowledge about agricultural insurance. Various ML algorithms have been implemented, and among the approaches developed, the Multi-Layer Perceptron Classifier (MLP) and the Linear Support Vector Classifier (SVC) outperform the other algorithms in terms of overall accuracy. Tree-based ensembles were used to identify the most prominent features, which included the farmer’s general perception of risk, their likelihood of engaging in risky behaviour, as well as their level of knowledge about crop insurance. The models implemented in this study could be a useful tool for insurers and policymakers for predicting potential crop insurance ownership. Full article
(This article belongs to the Special Issue Machine Learning, Statistics and Big Data)
Show Figures

Figure 1

27 pages, 420 KiB  
Article
Ridge Regression and the Elastic Net: How Do They Do as Finders of True Regressors and Their Coefficients?
by Rajaram Gana
Mathematics 2022, 10(17), 3057; https://doi.org/10.3390/math10173057 - 24 Aug 2022
Cited by 1 | Viewed by 1323
Abstract
For the linear model Y=Xb+error, where the number of regressors (p) exceeds the number of observations (n), the Elastic Net (EN) was proposed, in 2005, to estimate b. [...] Read more.
For the linear model Y=Xb+error, where the number of regressors (p) exceeds the number of observations (n), the Elastic Net (EN) was proposed, in 2005, to estimate b. The EN uses both the Lasso, proposed in 1996, and ordinary Ridge Regression (RR), proposed in 1970, to estimate b. However, when p>n, using only RR to estimate b has not been considered in the literature thus far. Because RR is based on the least-squares framework, only using RR to estimate b is computationally much simpler than using the EN. We propose a generalized ridge regression (GRR) algorithm, a superior alternative to the EN, for estimating b as follows: partition X from left to right so that every partition, but the last one, has 3 observations per regressor; for each partition, we estimate Y with the regressors in that partition using ordinary RR; retain the regressors with statistically significant t-ratios and the corresponding RR tuning parameter k, by partition; use the retained regressors and k values to re-estimate Y by GRR across all partitions, which yields b. Algorithmic efficacy is compared using 4 metrics by simulation, because the algorithm is mathematically intractable. Three metrics, with their probabilities of RR’s superiority over EN in parentheses, are: the proportion of true regressors discovered (99%); the squared distance, from the true coefficients, of the significant coefficients (86%); and the squared distance, from the true coefficients, of estimated coefficients that are both significant and true (74%). The fourth metric is the probability that none of the regressors discovered are true, which for RR and EN is 4% and 25%, respectively. This indicates the additional advantage RR has over the EN in terms of discovering causal regressors. Full article
(This article belongs to the Special Issue Machine Learning, Statistics and Big Data)
22 pages, 483 KiB  
Article
Efficient Mining Support-Confidence Based Framework Generalized Association Rules
by Amira Mouakher, Fahima Hajjej and Sarra Ayouni
Mathematics 2022, 10(7), 1163; https://doi.org/10.3390/math10071163 - 3 Apr 2022
Cited by 1 | Viewed by 1642
Abstract
Mining association rules are one of the most critical data mining problems, intensively studied since their inception. Several approaches have been proposed in the literature to extend the basic association rule framework to extract more general rules, including the negation operator. Thereby, this [...] Read more.
Mining association rules are one of the most critical data mining problems, intensively studied since their inception. Several approaches have been proposed in the literature to extend the basic association rule framework to extract more general rules, including the negation operator. Thereby, this extension is expected to bring valuable knowledge about an examined dataset to the user. However, the efficient extraction of such rules is challenging, especially for sparse datasets. This paper focuses on the extraction of literalsets, i.e., a set of present and absent items. By consequence, generalized association rules can be straightforwardly derived from these literalsets. To this end, we introduce and prove the soundness of a theorem that paves the way to speed up the costly computation of the support of a literalist. Furthermore, we introduce FasterIE, an efficient algorithm that puts the proved theorem at work to efficiently extract the whole set of frequent literalets. Thus, the FasterIE algorithm is shown to devise very efficient strategies, which minimize as far as possible the number of node visits in the explored search space. Finally, we have carried out experiments on benchmark datasets to back the effectiveness claim of the proposed algorithm versus its competitors. Full article
(This article belongs to the Special Issue Machine Learning, Statistics and Big Data)
Show Figures

Figure 1

Back to TopTop