Supervised and Unsupervised Classification Algorithms (2nd Edition)

A special issue of Algorithms (ISSN 1999-4893). This special issue belongs to the section "Algorithms for Multidisciplinary Applications".

Deadline for manuscript submissions: 31 August 2024 | Viewed by 11167

Special Issue Editors


E-Mail Website
Guest Editor
Department of Economics and Law, University of Cassino and Southern Lazio, 03043 Cassino, Italy
Interests: data science; statistical network analysis; supervised classification
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Supervised and unsupervised classification algorithms are the two main branches of machine learning methods. Supervised classification refers to the task of training a system using labeled data divided into classes, and assigning data to these existing classes. The process consists in computing a model from a set of labeled training data, and then applying the model to predict the class label for incoming unlabeled data. It is called supervised learning because the training data set supervises the learning process. Supervised classification algorithms are divided into two categories: classification and regression.

In unsupervised classification, the data being processed are unlabeled, so in the lack of prior knowledge, the algorithm tries to search for a similarity to generate clusters and assign classes. Unsupervised classification algorithms are divided into three categories: clustering, data estimation, and dimensionality reduction.

Applications range from object detection from biomedical images and disease prediction to natural language understanding and generation.

Submissions are welcome both for traditional classification problems as well as new applications. Potential topics include but are not limited to image classification, data integration, clustering approaches, feature extraction, etc.

Dr. Mario Rosario Guarracino
Dr. Laura Antonelli
Dr. Pietro Hiram Guzzi
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Algorithms is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • supervised classification algorithms
  • clustering algorithms
  • network analysis
  • community extraction
  • data science
  • biological knowledge extraction

Related Special Issue

Published Papers (8 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 368 KiB  
Article
Smooth Information Criterion for Regularized Estimation of Item Response Models
by Alexander Robitzsch
Algorithms 2024, 17(4), 153; https://doi.org/10.3390/a17040153 - 06 Apr 2024
Viewed by 666
Abstract
Item response theory (IRT) models are frequently used to analyze multivariate categorical data from questionnaires or cognitive test data. In order to reduce the model complexity in item response models, regularized estimation is now widely applied, adding a nondifferentiable penalty function like the [...] Read more.
Item response theory (IRT) models are frequently used to analyze multivariate categorical data from questionnaires or cognitive test data. In order to reduce the model complexity in item response models, regularized estimation is now widely applied, adding a nondifferentiable penalty function like the LASSO or the SCAD penalty to the log-likelihood function in the optimization function. In most applications, regularized estimation repeatedly estimates the IRT model on a grid of regularization parameters λ. The final model is selected for the parameter that minimizes the Akaike or Bayesian information criterion (AIC or BIC). In recent work, it has been proposed to directly minimize a smooth approximation of the AIC or the BIC for regularized estimation. This approach circumvents the repeated estimation of the IRT model. To this end, the computation time is substantially reduced. The adequacy of the new approach is demonstrated by three simulation studies focusing on regularized estimation for IRT models with differential item functioning, multidimensional IRT models with cross-loadings, and the mixed Rasch/two-parameter logistic IRT model. It was found from the simulation studies that the computationally less demanding direct optimization based on the smooth variants of AIC and BIC had comparable or improved performance compared to the ordinarily employed repeated regularized estimation based on AIC or BIC. Full article
(This article belongs to the Special Issue Supervised and Unsupervised Classification Algorithms (2nd Edition))
Show Figures

Figure 1

20 pages, 6112 KiB  
Article
Multi-Augmentation-Based Contrastive Learning for Semi-Supervised Learning
by Jie Wang, Jie Yang, Jiafan He and Dongliang Peng
Algorithms 2024, 17(3), 91; https://doi.org/10.3390/a17030091 - 20 Feb 2024
Viewed by 1005
Abstract
Semi-supervised learning has been proven to be effective in utilizing unlabeled samples to mitigate the problem of limited labeled data. Traditional semi-supervised learning methods generate pseudo-labels for unlabeled samples and train the classifier using both labeled and pseudo-labeled samples. However, in data-scarce scenarios, [...] Read more.
Semi-supervised learning has been proven to be effective in utilizing unlabeled samples to mitigate the problem of limited labeled data. Traditional semi-supervised learning methods generate pseudo-labels for unlabeled samples and train the classifier using both labeled and pseudo-labeled samples. However, in data-scarce scenarios, reliance on labeled samples for initial classifier generation can degrade performance. Methods based on consistency regularization have shown promising results by encouraging consistent outputs for different semantic variations of the same sample obtained through diverse augmentation techniques. However, existing methods typically utilize only weak and strong augmentation variants, limiting information extraction. Therefore, a multi-augmentation contrastive semi-supervised learning method (MAC-SSL) is proposed. MAC-SSL introduces moderate augmentation, combining outputs from moderately and weakly augmented unlabeled images to generate pseudo-labels. Cross-entropy loss ensures consistency between strongly augmented image outputs and pseudo-labels. Furthermore, the MixUP is adopted to blend outputs from labeled and unlabeled images, enhancing consistency between re-augmented outputs and new pseudo-labels. The proposed method achieves a state-of-the-art performance (accuracy) through extensive experiments conducted on multiple datasets with varying numbers of labeled samples. Ablation studies further investigate each component’s significance. Full article
(This article belongs to the Special Issue Supervised and Unsupervised Classification Algorithms (2nd Edition))
Show Figures

Figure 1

23 pages, 4215 KiB  
Article
Frequent Errors in Modeling by Machine Learning: A Prototype Case of Predicting the Timely Evolution of COVID-19 Pandemic
by Károly Héberger
Algorithms 2024, 17(1), 43; https://doi.org/10.3390/a17010043 - 19 Jan 2024
Viewed by 1634
Abstract
Background: The development and application of machine learning (ML) methods have become so fast that almost nobody can follow their developments in every detail. It is no wonder that numerous errors and inconsistencies in their usage have also spread with a similar [...] Read more.
Background: The development and application of machine learning (ML) methods have become so fast that almost nobody can follow their developments in every detail. It is no wonder that numerous errors and inconsistencies in their usage have also spread with a similar speed independently from the tasks: regression and classification. This work summarizes frequent errors committed by certain authors with the aim of helping scientists to avoid them. Methods: The principle of parsimony governs the train of thought. Fair method comparison can be completed with multicriteria decision-making techniques, preferably by the sum of ranking differences (SRD). Its coupling with analysis of variance (ANOVA) decomposes the effects of several factors. Earlier findings are summarized in a review-like manner: the abuse of the correlation coefficient and proper practices for model discrimination are also outlined. Results: Using an illustrative example, the correct practice and the methodology are summarized as guidelines for model discrimination, and for minimizing the prediction errors. The following factors are all prerequisites for successful modeling: proper data preprocessing, statistical tests, suitable performance parameters, appropriate degrees of freedom, fair comparison of models, and outlier detection, just to name a few. A checklist is provided in a tutorial manner on how to present ML modeling properly. The advocated practices are reviewed shortly in the discussion. Conclusions: Many of the errors can easily be filtered out with careful reviewing. Every authors’ responsibility is to adhere to the rules of modeling and validation. A representative sampling of recent literature outlines correct practices and emphasizes that no error-free publication exists. Full article
(This article belongs to the Special Issue Supervised and Unsupervised Classification Algorithms (2nd Edition))
Show Figures

Graphical abstract

24 pages, 5543 KiB  
Article
A Multi-Class Deep Learning Approach for Early Detection of Depressive and Anxiety Disorders Using Twitter Data
by Lamia Bendebane, Zakaria Laboudi, Asma Saighi, Hassan Al-Tarawneh, Adel Ouannas and Giuseppe Grassi
Algorithms 2023, 16(12), 543; https://doi.org/10.3390/a16120543 - 27 Nov 2023
Viewed by 1616
Abstract
Social media occupies an important place in people’s daily lives where users share various contents and topics such as thoughts, experiences, events and feelings. The massive use of social media has led to the generation of huge volumes of data. These data constitute [...] Read more.
Social media occupies an important place in people’s daily lives where users share various contents and topics such as thoughts, experiences, events and feelings. The massive use of social media has led to the generation of huge volumes of data. These data constitute a treasure trove, allowing the extraction of high volumes of relevant information particularly by involving deep learning techniques. Based on this context, various research studies have been carried out with the aim of studying the detection of mental disorders, notably depression and anxiety, through the analysis of data extracted from the Twitter platform. However, although these studies were able to achieve very satisfactory results, they nevertheless relied mainly on binary classification models by treating each mental disorder separately. Indeed, it would be better if we managed to develop systems capable of dealing with several mental disorders at the same time. To address this point, we propose a well-defined methodology involving the use of deep learning to develop effective multi-class models for detecting both depression and anxiety disorders through the analysis of tweets. The idea consists in testing a large number of deep learning models ranging from simple to hybrid variants to examine their strengths and weaknesses. Moreover, we involve the grid search technique to help find suitable values for the learning rate hyper-parameter due to its importance in training models. Our work is validated through several experiments and comparisons by considering various datasets and other binary classification models. The aim is to show the effectiveness of both the assumptions used to collect the data and the use of multi-class models rather than binary class models. Overall, the results obtained are satisfactory and very competitive compared to related works. Full article
(This article belongs to the Special Issue Supervised and Unsupervised Classification Algorithms (2nd Edition))
Show Figures

Figure 1

27 pages, 9031 KiB  
Article
Supervised Methods for Modeling Spatiotemporal Glacier Variations by Quantification of the Area and Terminus of Mountain Glaciers Using Remote Sensing
by Edmund Robbins, Thu Thu Hlaing, Jonathan Webb and Nezamoddin N. Kachouie
Algorithms 2023, 16(10), 486; https://doi.org/10.3390/a16100486 - 19 Oct 2023
Viewed by 1150
Abstract
Glaciers are important indictors of climate change as changes in glaciers physical features such as their area is in response to measurable evidence of fluctuating climate factors such as temperature, precipitation, and CO2. Although a general retreat of mountain glacier systems [...] Read more.
Glaciers are important indictors of climate change as changes in glaciers physical features such as their area is in response to measurable evidence of fluctuating climate factors such as temperature, precipitation, and CO2. Although a general retreat of mountain glacier systems has been identified in relation to centennial trends toward warmer temperatures, there is the potential to extract a great deal more information regarding regional variations in climate from the mapping of the time history of the terminus position or surface area of the glaciers. The remote nature of glaciers renders direct measurement impractical on anything other than a local scale. Considering the sheer number of mountain glaciers around the globe, ground measurements of terminus position are only available for a small percentage of glaciers and ground measurements of glacier area are rare. In this project, changes in the terminal point and area of Franz Josef and Gorner glaciers were quantified in response to climate factors using satellite imagery taken by Landsat at regular intervals. Two supervised learning methods including a parametric method (multiple regression) and a nonparametric method (generalized additive model) were implemented to identify climate factors that impact glacier changes. Local temperature, CO2, and precipitation were identified as significant factors for predicting changes in both Franz Josef and Gorner glaciers. Spatiotemporal quantification of glacier change is an essential task to model glacier variations in response to global and local climate factors. This work provided valuable insights on quantification of surface area of glaciers using satellite imagery with potential implementation of a generic approach. Full article
(This article belongs to the Special Issue Supervised and Unsupervised Classification Algorithms (2nd Edition))
Show Figures

Figure 1

26 pages, 35873 KiB  
Article
Quantitative and Qualitative Comparison of Decision-Map Techniques for Explaining Classification Models
by Yu Wang, Alister Machado and Alexandru Telea
Algorithms 2023, 16(9), 438; https://doi.org/10.3390/a16090438 - 11 Sep 2023
Viewed by 1160
Abstract
Visualization techniques for understanding and explaining machine learning models have gained significant attention. One such technique is the decision map, which creates a 2D depiction of the decision behavior of classifiers trained on high-dimensional data. While several decision map techniques have been proposed [...] Read more.
Visualization techniques for understanding and explaining machine learning models have gained significant attention. One such technique is the decision map, which creates a 2D depiction of the decision behavior of classifiers trained on high-dimensional data. While several decision map techniques have been proposed recently, such as Decision Boundary Maps (DBMs), Supervised Decision Boundary Maps (SDBMs), and DeepView (DV), there is no framework for comprehensively evaluating and comparing these techniques. In this paper, we propose such a framework by combining quantitative metrics and qualitative assessment. We apply our framework to DBM, SDBM, and DV using a range of both synthetic and real-world classification techniques and datasets. Our results show that none of the evaluated decision-map techniques consistently outperforms the others in all measured aspects. Separately, our analysis exposes several previously unknown properties and limitations of decision-map techniques. To support practitioners, we also propose a workflow for selecting the most appropriate decision-map technique for given datasets, classifiers, and requirements of the application at hand. Full article
(This article belongs to the Special Issue Supervised and Unsupervised Classification Algorithms (2nd Edition))
Show Figures

Figure 1

20 pages, 447 KiB  
Article
Optimized Tensor Decomposition and Principal Component Analysis Outperforming State-of-the-Art Methods When Analyzing Histone Modification Chromatin Immunoprecipitation Profiles
by Turki Turki, Sanjiban Sekhar Roy and Y.-H. Taguchi
Algorithms 2023, 16(9), 401; https://doi.org/10.3390/a16090401 - 23 Aug 2023
Cited by 1 | Viewed by 1689
Abstract
It is difficult to identify histone modification from datasets that contain high-throughput sequencing data. Although multiple methods have been developed to identify histone modification, most of these methods are not specific to histone modification but are general methods that aim to identify protein [...] Read more.
It is difficult to identify histone modification from datasets that contain high-throughput sequencing data. Although multiple methods have been developed to identify histone modification, most of these methods are not specific to histone modification but are general methods that aim to identify protein binding to the genome. In this study, tensor decomposition (TD) and principal component analysis (PCA)-based unsupervised feature extraction with optimized standard deviation were successfully applied to gene expression and DNA methylation. The proposed method was used to identify histone modification. Histone modification along the genome is binned within the region of length L. Considering principal components (PCs) or singular value vectors (SVVs) that PCA or TD attributes to samples, we can select PCs or SVVs attributed to regions. The selected PCs and SVVs further attribute p-values to regions, and adjusted p-values are used to select regions. The proposed method identified various histone modifications successfully and outperformed various state-of-the-art methods. This method is expected to serve as a de facto standard method to identify histone modification. For reproducibility and to ensure the systematic analysis of our study is applicable to datasets from different gene expression experiments, we have made our tools publicly available for download from gitHub. Full article
(This article belongs to the Special Issue Supervised and Unsupervised Classification Algorithms (2nd Edition))
Show Figures

Figure 1

17 pages, 599 KiB  
Article
Two Medoid-Based Algorithms for Clustering Sets
by Libero Nigro and Pasi Fränti
Algorithms 2023, 16(7), 349; https://doi.org/10.3390/a16070349 - 20 Jul 2023
Cited by 1 | Viewed by 1078
Abstract
This paper proposes two algorithms for clustering data, which are variable-sized sets of elementary items. An example of such data occurs in the analysis of a medical diagnosis, where the goal is to detect human subjects who share common diseases to possibly predict [...] Read more.
This paper proposes two algorithms for clustering data, which are variable-sized sets of elementary items. An example of such data occurs in the analysis of a medical diagnosis, where the goal is to detect human subjects who share common diseases to possibly predict future illnesses from previous medical history. The first proposed algorithm is based on K-medoids and the second algorithm extends the random swap algorithm, which has proven to be capable of efficient and careful clustering; both algorithms depend on a distance function among data objects (sets), which can use application-sensitive weights or priorities. The proposed distance function makes it possible to exploit several seeding methods that can improve clustering accuracy. A key factor in the two algorithms is their parallel implementation in Java, based on functional programming using streams and lambda expressions. The use of parallelism smooths out the O(N2) computational cost behind K-medoids and clustering indexes such as the Silhouette index and allows for the handling of non-trivial datasets. This paper applies the algorithms to several benchmark case studies of sets and demonstrates how accurate and time-efficient clustering solutions can be achieved. Full article
(This article belongs to the Special Issue Supervised and Unsupervised Classification Algorithms (2nd Edition))
Show Figures

Figure 1

Back to TopTop