1. Introduction
With the rapid advancements in technology, data production, storage, and access have become increasingly easy. However, this has resulted in the accumulation of massive amounts of data, necessitating the extraction of meaningful information from these data. Data mining, the process of making sense of data, involves a variety of operations to transform these data into information [
1]. Data mining is a comprehensive, interdisciplinary field that integrates methods from multiple scientific domains, including statistics, machine learning, artificial intelligence, algorithm design, optimization, database management, and visualization [
2]. Data mining has numerous applications, such as quality control, bioinformatics, machine vision, and customer knowledge. Machine learning algorithms, including neural networks, regression, instance-based methods, Bayesian, and decision tree-based approaches, are commonly used for classification or clustering in the literature. Classification aims to predict the labels of new datasets using supervised learning methods. Clustering, on the other hand, creates subgroups by grouping similar instances within a dataset, potentially solving large-scale problems.
Fault diagnosis is the process of detecting faults in systems and is used in many different industries. For instance, in the automotive industry [
3], it is used to identify malfunctions in vehicles. In computer science, it is utilized to find and correct software bugs [
4,
5]. In the aerospace sector [
6], it serves to detect potential faults in aircraft and spacecraft. Overall, the process of fault diagnosis is utilized to enhance system efficiency, minimize disruptions, and prevent potential failures.
In this study, we focused on the fault detection of steel plates. The Steel Plates Faults dataset, provided by the Semeion Research Center in Rome, is extensively used in machine learning for automatic pattern recognition due to its categorization of steel plate faults into seven types. This dataset assists in identifying inferior quality products in various manufacturing industries where steel plates are crucial raw material. The proposed algorithm is applied on this dataset and the results are compared with the results in the literature.
Accordingly, our work aims to (1) refine clustering methods for identifying the most effective dataset segmentation, (2) investigate which feature subsets best enhance classification performance, and (3) compare the proposed ETSCCM approach with established algorithms in terms of accuracy and efficiency. To this end, we employed the Steel Plates Faults dataset—previously utilized in various benchmarking studies—to maximize classification performance. After standard preprocessing and normalization, we pursued two key objectives simultaneously: (i) determining the optimal number of clusters and their centers via a hybrid approach that combines K-means and hierarchical clustering, and (ii) identifying the best classifier among the random forest (RF), K-nearest neighbors (KNN), and support vector machine (SVM) models. RF achieved the highest accuracy and was thus chosen for subsequent feature selection experiments, which generated multiple feature subsets. We then refined RF’s performance across these subsets and cluster configurations through parameter optimization and compared the results with findings from the literature. Overall, this process yielded a 6–18% improvement in classification accuracy, underscoring the core contribution of this study.
In the following sections, we briefly review key research on data mining, classification, feature selection, and clustering approaches. We then present examples of studies where these methods are used in a hybrid manner. This discussion clarifies the current state of machine learning-based methods for fault detection in steel plates and highlights existing research gaps, thereby illustrating how the proposed ETSCCM approach makes a concrete contribution to the literature.
Data mining is the process of extracting valuable information from large databases, using a variety of algorithms [
7]. This process, which began in the 1960s, has evolved and gained dynamism with the development of databases and data warehouses [
8]. The data mining process includes steps such as data preprocessing, mining, and evaluation and interpretation [
9]. Preprocessing is a critical step in preparing the dataset for mining and encompasses tasks like data cleaning, reduction, and transformation. This step typically takes around 50–70% of the total process and is crucial for ensuring a reliable dataset [
10].
Classification algorithms assign data instances to predetermined categories [
11]. Ensemble learning algorithms are often used in classification due to their strong performance and robust learning capabilities, making them a classic and popular model in the literature. Random forest, a popular model formed from the combination of decision trees, produces highly accurate results [
12]. Decision tree-based algorithms that are known for high performance include Bagging, AdaBoost, and the Classification and Regression Tree (CART). Bagging, developed by Breiman [
13], is an effective method that handles situations where minor variations in the learning set can lead to major prediction changes by drawing repeated samples and creating a combined estimator. AdaBoost [
14] aims to transform weak classifiers into a strong classifier. This method adjusts weight coefficients on training data, increasing the weight of incorrectly classified examples while decreasing the weight of correctly classified ones [
15]. CART presents the classification process through a simple and intuitive tree structure. CART can be applied to both discrete and continuous datasets, selects the optimal classification feature model, and does not consider linearity issues between independent variables, which allows it to effectively handle outliers [
16]. SVM is a classification algorithm designed to find the optimal boundary or hyperplane that best separates two classes, aiming to maximize the space between them for clearer distinction [
17]. The KNN algorithm, introduced by Fix and Hodges [
18], is a classic classification algorithm, widely studied and used in many fields because of its simplicity and effectiveness [
19]. These methods adapt to various data types and scenarios. For instance, random forests and Bagging excel with high-dimensional data; AdaBoost is suited for complex classification challenges; CART is preferred for its model clarity; SVM is chosen for high accuracy needs; and KNN works well for straightforward classification tasks. This variety in classification algorithms allows them to meet diverse application needs in data analysis and machine learning. Classification algorithms, employed across diverse applications from simple tasks to complex structural analyses, also play a key role in modern image processing. Accordingly, a recent study [
20] highlights advanced classification and segmentation methods for automated microstructure analysis in steels, stressing the importance of high-quality data and correlative microscopy.
Feature selection is employed in high-dimensional data to enhance model performance. Identifying the ideal feature set can lead to higher accuracy and improved model interpretability [
21,
22]. Hyperparameter optimization helps determine the best parameter values before training the model, enhancing its performance. Grid and random search are popular methods used for this purpose [
23].
The implementation of data mining requires measuring the model’s accuracy and evaluating its performance. Metrics such as accuracy, error rate, and the F-measure are commonly used performance indicators to assess the efficacy of classifiers [
24].
Clustering is a frequently used unsupervised learning technique in data mining that works on unlabeled data to group similar items together [
25]. This method aims to increase intra-cluster similarity and decrease inter-cluster similarity, thereby ensuring high cluster consistency. K-means is among the most widely applied clustering algorithms, partitioning items into a predefined number of clusters (k) and iteratively refining assignments by allocating each object to the closest cluster centroid [
26]. Choosing the number of clusters is a critical step, and various methods have been proposed in different studies [
27]. Hierarchical clustering organizes clusters into nested subdivisions and has two main approaches: agglomerative (merging) and divisive (splitting). The advantage of hierarchical clustering is that it provides a desired level of detail [
28]. The Ward method is a technique that forms clusters by minimizing the variance within each cluster based on the sum of squares criterion [
29]. Hybrid clustering methods aim to achieve better results by combining the advantages of multiple clustering techniques. The number and centers of clusters determined by hierarchical clustering can be used as starting points for K-means [
30].
Studies have been conducted on the combined use of hybrid clustering and classification techniques, and these approaches have been observed to improve accuracy rates. In a study by Wu & Lin [
31], both classification and clustering algorithms are utilized for churn analysis to develop an effective marketing strategy. A real-time dataset containing 23 attributes and 5000 data points is used in the case study. Following churn prediction, the most likely churners are clustered into three groups: low, medium, and high risk. K-means is used for clustering, and decision trees, artificial neural networks, and support vector machines are employed for classification. Yang et al. [
32] conducted research that includes clustering and deep learning based on long short-term memory on the dataset of mobile applications. New customers are clustered into six groups using K-means based on single and multi-features. Then, a predictive model is applied to each cluster. Another study [
33] presents a method for brain tumor segmentation and classification using MRI data through a deep learning approach. It involves preprocessing, segmentation with K-means clustering, and classification into benign or malignant tumors using a fine-tuned VGG19 model. Synthetic data augmentation is employed to enhance classification accuracy. Evaluated on the BraTS 2015 datasets, this approach demonstrated superior accuracy compared to previous methods.
In this paper, the hybrid approach is applied to the Steel Plates Faults dataset to achieve better classification performances. The steel plates dataset, notable for its multi-class structure, has also been frequently used in the comparison of algorithms proposed by researchers. Some studies can be summarized as follows. Shu et al. [
34] proposes the information gain-based semi-supervised feature selection algorithm for partially labeled hybrid data. The effectiveness and efficiency of the proposed algorithm extended decision labeled annotation (ELA) is evaluated through experiments on ten datasets from the UCI machine learning repository. The results show that the proposed algorithms outperform existing feature selection methods. Another study [
35] introduces a principal component analysis-based decision tree forest (PDTF) method to increase the diversity among base classifiers in a forest of decision trees. The trees in the forest have minimal correlation, leading to improved classification accuracy. Another study proposes a feature selection algorithm to extract important features from the original data, which are then combined with long short-term memory (LSTM) networks for enhanced accuracy in classifying heterogeneous data streams. The study includes experiments on various datasets and Indian National Stock Exchange data feeds. In another study, the attribute reduction approach of the personalized neighbor of the D-S evidence theory-based justifiable granule selection method (PN-DSJG) was proposed by Hengrong et al. [
36]. For the experimental evaluation, fifteen benchmark datasets, involving six high-dimensional microarray datasets, are chosen to display the performance of the method and algorithm. Zhang et al. [
37] developed a bi-selection approach for data reduction. Some numerical experiments are conducted to assess the performance of the proposed bi-selection method based on fuzzy rough sets.
The proposed ETSCCM method stands apart from existing approaches by unifying clustering and feature selection within a single framework, thereby boosting classification accuracy for fault detection. Unlike traditional methods, which typically emphasize either clustering or classification in isolation, ETSCCM orchestrates both processes comprehensively. This synergy addresses a notable gap in the literature, where advanced feature selection and refined clustering are rarely combined, and capitalizes on sophisticated selection techniques to optimize classification performance. By meeting this need, ETSCCM demonstrates clear potential for substantially improving both the accuracy and efficiency of classification tasks, offering a more holistic and effective solution than existing methods.