Enhanced Three-Stage Cluster-Then-Classify Method (ETSCCM)

Yilmaz Eroglu, Duygu; Guleryuz, Elif

doi:10.3390/met15030318

Open AccessArticle

Enhanced Three-Stage Cluster-Then-Classify Method (ETSCCM)

by

Duygu Yilmaz Eroglu

^1,*

and

Elif Guleryuz

²

¹

Department of Industrial Engineering, Bursa Uludağ University, Gorukle Campus, 16059 Nilufer, Bursa, Turkey

²

Tofaş Turk Automobile Factory Inc., Demirtaş Dumlupınar Organized Industrial Zone, Istanbul Street No: 574, 16110 Osmangazi, Bursa, Turkey

^*

Author to whom correspondence should be addressed.

Metals 2025, 15(3), 318; https://doi.org/10.3390/met15030318

Submission received: 11 February 2025 / Revised: 7 March 2025 / Accepted: 12 March 2025 / Published: 14 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Modern steel manufacturing processes demand rigorous quality control to rapidly and accurately detect and classify defects in steel plates. In this work, we propose an enhanced three-stage cluster-then-classify method (ETSCCM) that merges clustering-based data partitioning with strategic feature subset selection and refined hyperparameter tuning. Initially, the appropriate number of clusters is determined by combining K-means with hierarchical clustering, ensuring a more precise segmentation of the Steel Plates Fault dataset. Concurrently, various correlated feature subsets are assessed to identify those that maximize classification performance. The best-performing scenario is then used in conjunction with the most effective classifier, identified through comparative analyses involving widely adopted algorithms. Experimental outcomes on real-world fault data, as well as additional publicly available datasets, indicate that our approach can achieve a significant increase in prediction accuracy compared to conventional methods. This study introduces a new method by jointly refining cluster assignments and classification parameters through scenario-based feature subsets, going beyond single-stage methods in enhancing detection accuracy. Through this multi-stage process, pivotal data relationships are uncovered, resulting in a robust, adaptable framework that advances industrial fault diagnosis.

Keywords:

clustering; classification; steel plates faults; feature selection

1. Introduction

With the rapid advancements in technology, data production, storage, and access have become increasingly easy. However, this has resulted in the accumulation of massive amounts of data, necessitating the extraction of meaningful information from these data. Data mining, the process of making sense of data, involves a variety of operations to transform these data into information [1]. Data mining is a comprehensive, interdisciplinary field that integrates methods from multiple scientific domains, including statistics, machine learning, artificial intelligence, algorithm design, optimization, database management, and visualization [2]. Data mining has numerous applications, such as quality control, bioinformatics, machine vision, and customer knowledge. Machine learning algorithms, including neural networks, regression, instance-based methods, Bayesian, and decision tree-based approaches, are commonly used for classification or clustering in the literature. Classification aims to predict the labels of new datasets using supervised learning methods. Clustering, on the other hand, creates subgroups by grouping similar instances within a dataset, potentially solving large-scale problems.

Fault diagnosis is the process of detecting faults in systems and is used in many different industries. For instance, in the automotive industry [3], it is used to identify malfunctions in vehicles. In computer science, it is utilized to find and correct software bugs [4,5]. In the aerospace sector [6], it serves to detect potential faults in aircraft and spacecraft. Overall, the process of fault diagnosis is utilized to enhance system efficiency, minimize disruptions, and prevent potential failures.

In this study, we focused on the fault detection of steel plates. The Steel Plates Faults dataset, provided by the Semeion Research Center in Rome, is extensively used in machine learning for automatic pattern recognition due to its categorization of steel plate faults into seven types. This dataset assists in identifying inferior quality products in various manufacturing industries where steel plates are crucial raw material. The proposed algorithm is applied on this dataset and the results are compared with the results in the literature.

Accordingly, our work aims to (1) refine clustering methods for identifying the most effective dataset segmentation, (2) investigate which feature subsets best enhance classification performance, and (3) compare the proposed ETSCCM approach with established algorithms in terms of accuracy and efficiency. To this end, we employed the Steel Plates Faults dataset—previously utilized in various benchmarking studies—to maximize classification performance. After standard preprocessing and normalization, we pursued two key objectives simultaneously: (i) determining the optimal number of clusters and their centers via a hybrid approach that combines K-means and hierarchical clustering, and (ii) identifying the best classifier among the random forest (RF), K-nearest neighbors (KNN), and support vector machine (SVM) models. RF achieved the highest accuracy and was thus chosen for subsequent feature selection experiments, which generated multiple feature subsets. We then refined RF’s performance across these subsets and cluster configurations through parameter optimization and compared the results with findings from the literature. Overall, this process yielded a 6–18% improvement in classification accuracy, underscoring the core contribution of this study.

In the following sections, we briefly review key research on data mining, classification, feature selection, and clustering approaches. We then present examples of studies where these methods are used in a hybrid manner. This discussion clarifies the current state of machine learning-based methods for fault detection in steel plates and highlights existing research gaps, thereby illustrating how the proposed ETSCCM approach makes a concrete contribution to the literature.

Data mining is the process of extracting valuable information from large databases, using a variety of algorithms [7]. This process, which began in the 1960s, has evolved and gained dynamism with the development of databases and data warehouses [8]. The data mining process includes steps such as data preprocessing, mining, and evaluation and interpretation [9]. Preprocessing is a critical step in preparing the dataset for mining and encompasses tasks like data cleaning, reduction, and transformation. This step typically takes around 50–70% of the total process and is crucial for ensuring a reliable dataset [10].

Classification algorithms assign data instances to predetermined categories [11]. Ensemble learning algorithms are often used in classification due to their strong performance and robust learning capabilities, making them a classic and popular model in the literature. Random forest, a popular model formed from the combination of decision trees, produces highly accurate results [12]. Decision tree-based algorithms that are known for high performance include Bagging, AdaBoost, and the Classification and Regression Tree (CART). Bagging, developed by Breiman [13], is an effective method that handles situations where minor variations in the learning set can lead to major prediction changes by drawing repeated samples and creating a combined estimator. AdaBoost [14] aims to transform weak classifiers into a strong classifier. This method adjusts weight coefficients on training data, increasing the weight of incorrectly classified examples while decreasing the weight of correctly classified ones [15]. CART presents the classification process through a simple and intuitive tree structure. CART can be applied to both discrete and continuous datasets, selects the optimal classification feature model, and does not consider linearity issues between independent variables, which allows it to effectively handle outliers [16]. SVM is a classification algorithm designed to find the optimal boundary or hyperplane that best separates two classes, aiming to maximize the space between them for clearer distinction [17]. The KNN algorithm, introduced by Fix and Hodges [18], is a classic classification algorithm, widely studied and used in many fields because of its simplicity and effectiveness [19]. These methods adapt to various data types and scenarios. For instance, random forests and Bagging excel with high-dimensional data; AdaBoost is suited for complex classification challenges; CART is preferred for its model clarity; SVM is chosen for high accuracy needs; and KNN works well for straightforward classification tasks. This variety in classification algorithms allows them to meet diverse application needs in data analysis and machine learning. Classification algorithms, employed across diverse applications from simple tasks to complex structural analyses, also play a key role in modern image processing. Accordingly, a recent study [20] highlights advanced classification and segmentation methods for automated microstructure analysis in steels, stressing the importance of high-quality data and correlative microscopy.

Feature selection is employed in high-dimensional data to enhance model performance. Identifying the ideal feature set can lead to higher accuracy and improved model interpretability [21,22]. Hyperparameter optimization helps determine the best parameter values before training the model, enhancing its performance. Grid and random search are popular methods used for this purpose [23].

The implementation of data mining requires measuring the model’s accuracy and evaluating its performance. Metrics such as accuracy, error rate, and the F-measure are commonly used performance indicators to assess the efficacy of classifiers [24].

Clustering is a frequently used unsupervised learning technique in data mining that works on unlabeled data to group similar items together [25]. This method aims to increase intra-cluster similarity and decrease inter-cluster similarity, thereby ensuring high cluster consistency. K-means is among the most widely applied clustering algorithms, partitioning items into a predefined number of clusters (k) and iteratively refining assignments by allocating each object to the closest cluster centroid [26]. Choosing the number of clusters is a critical step, and various methods have been proposed in different studies [27]. Hierarchical clustering organizes clusters into nested subdivisions and has two main approaches: agglomerative (merging) and divisive (splitting). The advantage of hierarchical clustering is that it provides a desired level of detail [28]. The Ward method is a technique that forms clusters by minimizing the variance within each cluster based on the sum of squares criterion [29]. Hybrid clustering methods aim to achieve better results by combining the advantages of multiple clustering techniques. The number and centers of clusters determined by hierarchical clustering can be used as starting points for K-means [30].

Studies have been conducted on the combined use of hybrid clustering and classification techniques, and these approaches have been observed to improve accuracy rates. In a study by Wu & Lin [31], both classification and clustering algorithms are utilized for churn analysis to develop an effective marketing strategy. A real-time dataset containing 23 attributes and 5000 data points is used in the case study. Following churn prediction, the most likely churners are clustered into three groups: low, medium, and high risk. K-means is used for clustering, and decision trees, artificial neural networks, and support vector machines are employed for classification. Yang et al. [32] conducted research that includes clustering and deep learning based on long short-term memory on the dataset of mobile applications. New customers are clustered into six groups using K-means based on single and multi-features. Then, a predictive model is applied to each cluster. Another study [33] presents a method for brain tumor segmentation and classification using MRI data through a deep learning approach. It involves preprocessing, segmentation with K-means clustering, and classification into benign or malignant tumors using a fine-tuned VGG19 model. Synthetic data augmentation is employed to enhance classification accuracy. Evaluated on the BraTS 2015 datasets, this approach demonstrated superior accuracy compared to previous methods.

In this paper, the hybrid approach is applied to the Steel Plates Faults dataset to achieve better classification performances. The steel plates dataset, notable for its multi-class structure, has also been frequently used in the comparison of algorithms proposed by researchers. Some studies can be summarized as follows. Shu et al. [34] proposes the information gain-based semi-supervised feature selection algorithm for partially labeled hybrid data. The effectiveness and efficiency of the proposed algorithm extended decision labeled annotation (ELA) is evaluated through experiments on ten datasets from the UCI machine learning repository. The results show that the proposed algorithms outperform existing feature selection methods. Another study [35] introduces a principal component analysis-based decision tree forest (PDTF) method to increase the diversity among base classifiers in a forest of decision trees. The trees in the forest have minimal correlation, leading to improved classification accuracy. Another study proposes a feature selection algorithm to extract important features from the original data, which are then combined with long short-term memory (LSTM) networks for enhanced accuracy in classifying heterogeneous data streams. The study includes experiments on various datasets and Indian National Stock Exchange data feeds. In another study, the attribute reduction approach of the personalized neighbor of the D-S evidence theory-based justifiable granule selection method (PN-DSJG) was proposed by Hengrong et al. [36]. For the experimental evaluation, fifteen benchmark datasets, involving six high-dimensional microarray datasets, are chosen to display the performance of the method and algorithm. Zhang et al. [37] developed a bi-selection approach for data reduction. Some numerical experiments are conducted to assess the performance of the proposed bi-selection method based on fuzzy rough sets.

The proposed ETSCCM method stands apart from existing approaches by unifying clustering and feature selection within a single framework, thereby boosting classification accuracy for fault detection. Unlike traditional methods, which typically emphasize either clustering or classification in isolation, ETSCCM orchestrates both processes comprehensively. This synergy addresses a notable gap in the literature, where advanced feature selection and refined clustering are rarely combined, and capitalizes on sophisticated selection techniques to optimize classification performance. By meeting this need, ETSCCM demonstrates clear potential for substantially improving both the accuracy and efficiency of classification tasks, offering a more holistic and effective solution than existing methods.

2. Preliminaries: Data Description, Classification, and Clustering Methods

In this section of the paper, the fundamental characteristics of the methods, which are components of the proposed methodology, will be briefly discussed.

2.1. The Dataset Used

Steel plates are an essential raw material in the automotive, construction, and shipbuilding industries, forming a key link in the industrial production chain. Although production technologies have steadily improved, a certain proportion of plates still exhibit defects. Identifying and removing these faulty products early not only helps reduce costs but also sustains process efficiency. Consequently, rapid and precise fault detection methods play a critical role in industrial quality control.

The Steel Plates Faults dataset, compiled by the Semeion Research Center in Rome, Italy consists of real production data intended to capture various quality defects observed in steel plate manufacturing. Data were gathered at multiple stages of industrial inspection—from raw plate formation to final surface finishing—through both visual and sensor-based assessments. Each suspected fault was then manually verified by a team of quality control experts, resulting in one of seven documented labels (e.g., Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps, or Other_Faults). This verification ensures that every record corresponds to a known, documented defect. A total of 27 features were extracted for each plate sample, encompassing physical measurements (e.g., X_Minimum, Y_Maximum, thickness, width, length) and surface irregularities or other shape descriptors. By merging on-site sensor data with expert-labeled fault information, the dataset offers a broad representation of how and when defects manifest throughout the steel plate production lifecycle, thus enabling researchers to replicate the industrial fault diagnosis process computationally. After standard preprocessing and feature selection, machine learning algorithms can then classify each plate according to its specific fault label, making the dataset a practical benchmark for testing and optimizing new fault diagnosis techniques.

In the Steel Plates Faults dataset, an attempt has been made to predict the types of quality faults in steel plates. The names of the class labels in the dataset and the number of examples for each class label are provided in Table 1. The 27 features included in the dataset are summarized in Table 2.

2.2. Preprocessing

An outlier is any data point in a dataset that significantly differs from the rest of the observations. One commonly used method in outlier analysis is the Z-score method. The Z-score indicates how many standard deviations an observation deviates from the mean in a dataset with a normal distribution. The Z-score for any observation is calculated by subtracting the mean from the observation and then dividing by the standard deviation. The suitability of the dataset for a normal distribution has been analyzed using the Kolmogorov–Smirnov test.

Another preprocessing step for the dataset is feature selection, for which a heatmap has been obtained. Each cell in the heatmap reveals the relationships between features. The correlation matrix observed in the heatmap assigns values between −1 and 1 for each pairwise feature relationship. A value close to 1 indicates a strong positive correlation between two features, close to −1 indicates a strong negative correlation, and 0 indicates no linear relationship.

2.3. Classification Techniques Used in the Study

In this study, to observe the highest classification performance, the algorithms detailed below—which are frequently used and demonstrate high performance in the literature—were employed.

2.3.1. Random Forests (RF)

The RF algorithm, introduced by Breiman [38], combines multiple decision trees to perform supervised classification tasks. Each tree is trained on a randomly selected subset of the training dataset S, and the final classification is determined by majority voting among all trees.

Key Components:

S: Entire training set
i: Index of each tree (1 to k)
k: Total number of trees in the forest
T_i: Decision tree model for the i-th tree
S_i: Randomly sampled dataset for T_i’s development

Algorithm Steps:

Initialize: Start with the full training dataset S.
Tree Construction: For each tree T_i, i = 1 to k:
○
Sample a subset S_i from S.
○
Train T_i on S_i.
Node Splitting: At each node, select the best split using a random subset of features F.
Tree Growth: Allow each tree to grow fully without pruning.
Prediction: Combine the predictions from all k trees using majority voting.

2.3.2. K-Nearest Neighbor (KNN)

The KNN algorithm, first introduced by Fix [39] and later developed by Cover and Hart [40], is a widely used supervised machine learning method suitable for both classification and regression tasks. The core principle of KNN is to classify a data point based on the majority label of its K-nearest neighbors in the feature space, which helps mitigate potential biases stemming from imbalanced data distributions [41]. This simplicity makes KNN adaptable and effective for various applications.

Key Steps in KNN:

Select the parameter K, representing the number of nearest neighbors to consider.
Calculate the distance between the data point to be classified and all other points in the dataset using a distance metric.
Identify the K-nearest neighbors by ranking these distances in ascending order.
Assign the most frequent class label among the K neighbors to the data point.

For distance calculation, this study employs the Euclidean distance (q = 2) as expressed in Equation (1):

d (i, j) = \sqrt[q]{({|x_{i 1} - x_{j 1}|}^{q} + {|x_{i 2} - x_{j 2}|}^{q} + \dots + {|x_{i p} - x_{j p}|}^{q})}

(1)

2.3.3. Support Vector Machines (SVM)

Introduced by Boser et al. [17], SVMs are widely recognized classification algorithms [42] that aim to find the optimal hyperplane for separating data points into distinct classes. Their key feature is maximizing the margin between classes to enhance separation.

In linear SVM, a hyperplane defined by Equation (2) separates data points:

w x + b = 0

(2)

Here:

w: Weight vector of the hyperplane
b: Bias term
‖w‖: Euclidean norm of w

The margin, representing the distance between the hyperplane and the closest data points, is expressed in Equation (3):

m a r g i n = \frac{2}{‖ w ‖}

(3)

The decision function, determining the class label for a data point based on its position relative to the hyperplane, is given in Equation (4):

f (x_{i}) = \{\begin{matrix} 1, i f w x_{i} + b \geq 1 \\ - 1, i f w x_{i} + b \leq - 1 \end{matrix}

(4)

For multiclass datasets, SVM employs strategies like ‘one-vs-one’ or ‘one-vs-the-rest’ to construct multiple hyperplanes.

2.4. Clustering Techniques Used in the Study

2.4.1. K-Means

The K-means algorithm, introduced by MacQueen [43], partitions a dataset into K non-overlapping clusters. Its main steps are as follows:

Initialization: Select K centroids, either randomly or using a specific method.
Assignment: Assign each data point x_i to the closest centroid C_j by minimizing the squared Euclidean distance, as shown in Equation (5):

m i n i m i z e {‖ x_{i} - C_{j} ‖}^{2}

(5)

3.: Update: Recalculate each centroid C_j as the mean of all points assigned to it, as expressed in Equation (6):

C_{j} = \frac{1}{|S_{j}|} \sum_{x_{i} \in S_{j}} x_{i}

(6)

Here, S_j is the set of points assigned to the jth centroid, and

|S_{j}|

is the number of elements in this set.

4.: Repeat: Continue the assignment and update steps until centroids stabilize or the maximum number of iterations is reached.

2.4.2. Hierarchical Clustering

Hierarchical clustering is a method that partitions a dataset into nested segments through either merging (agglomerative) or splitting (divisive) [28]. In the agglomerative approach, clustering begins with each data point as its own cluster and progressively merges the closest clusters until only one cluster remains. Conversely, the divisive method starts with all points in a single cluster and splits them into smaller clusters until each point forms its own cluster.

Key Steps in Agglomerative Clustering:

Initialization: Each data point starts as its own cluster, resulting in N clusters for N data points.
Similarity/Distance Measurement: Calculate distances between clusters using Euclidean, Manhattan, or maximum distance metrics. The most used distance measure is Euclidean distance in this step [44]. Cluster distances can be computed as:
- Single linkage (nearest neighbor): $\min {d (a, b) : a \in A, b \in B}$
- Complete linkage (furthest neighbor): $\max {d (a, b) : a \in A, b \in B}$
- Average linkage is calculated as demonstrated in Equation (7):

\frac{1}{|A| |B|} \sum_{a \in A} \sum_{b \in B} d (a, b)

(7)

Ward’s method: Minimize the increase in within-cluster variance after merging.

In this study, the Euclidean distance metric was used for similarity measurement, while Ward’s method was employed for inter-cluster similarity.

3.: Merging: Merge the two closest clusters and update the distance matrix.
4.: Repeat: Continue similarity measurement and merging until only one cluster remains.
5.: Dendrogram Construction: Visualize the hierarchical structure with a dendrogram.

2.5. Performance Metrics

Model selection and evaluation are crucial steps in machine learning, with performance metrics playing a key role in assessing a classifier’s efficacy [45,46]. In this research, accuracy rate is the primary metric, supplemented by F-score, precision (PPV), and recall (TPR) to provide a comprehensive evaluation. The metrics are defined as follows:

Accuracy Rate: In Equation (8), TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative) represent classification outcomes for binary labels (e.g., 0/1).

A c c u r a c y R a t e = \frac{T P + T N}{T P + T N + F P + F N}

(8)

2.: F-score: The F-score, which is the harmonic mean of the positive predictive value (PPV) and the true positive rate (TPR), is used to assess the classifier’s performance and compare different classifiers. This calculation is detailed in Equation (9).

F-score = \frac{2 \times P P V \times T P R}{P P V + T P R}

(9)

PPV (Precision): The proportion of TP instances to the total predicted positives, as shown in the following equation.

P P V = \frac{T P}{T P + F P}

(10)

TPR (Recall): The proportion of TP instances to the total actual positives, as shown in the following equation.

T P R = \frac{T P}{T P + F N}

(11)

3. Proposed Algorithm: Enhanced Three-Stage Cluster-Then-Classify Method (ETSCCM)

In this study, the enhanced three-stage cluster-then-classify method (ETSCCM) is proposed, as detailed in the steps below. Figure 1 displays a flowchart of the methodology. According to the flowchart, the process begins with the dataset, which is then normalized to ensure uniformity of scale and enhance the effectiveness of subsequent analyses. This dataset can subsequently be used concurrently to create clustering and feature set scenarios.

The clustering phase is comprised of three steps, as detailed in the ’three-stage clustering’ section of Figure 1.

K-means clustering with varied cluster numbers: The K-means algorithm is executed on all samples with different numbers of clusters. The optimal number of clusters is determined based on the elbow method.
Hierarchical clustering for center determination: Subsequently, the dataset undergoes hierarchical clustering to facilitate the determination of cluster centers. The dendrogram obtained from this clustering is processed to match the number of clusters identified in the previous step and the cluster centers are established according to this cluster quantity.
Re-application of K-means clustering: With the number of clusters and their centers now defined, the K-means algorithm is rerun on the dataset. This step segregates the elements belonging to each cluster.

The development of feature subset scenarios is also divided into three stages, as shown in the ’developing subset scenarios’ section in Figure 1.

Different classification algorithms are run using the normalized dataset, and the algorithm that provides the best results is selected.
The selected classification algorithm is applied to different feature subsets.
Feature subsets that provide the highest performance metrics are determined as the scenarios to be worked on in subsequent stages.

The last stage, referred to as the ’classification’ stage in Figure 1, where performance outcomes are derived, is composed of the following three steps.

After determining the numbers and centers of the clusters, along with scenarios showcasing various feature subsets, hyperparameter optimization of the selected classification method is carried out for each cluster within each scenario.
Classification is performed using the best hyperparameter values for each cluster and scenario combination.
Performance metrics are calculated. The Steel Plates Faults dataset was evaluated through a 10-fold cross-validation procedure, where the entire dataset is divided into ten equally sized folds. In each round, one-fold is designated as the testing set, while the remaining nine folds are used for training. Performance metrics are then calculated. This approach provides a robust, unbiased estimate of the model’s predictive accuracy.

3.1. Preprocessing and Normalization of the Dataset

The Steel Plates Fault dataset was analyzed for missing data analysis, and no missing data were found.

According to the results of the Kolmogorov–Smirnov test, the dataset conforms to a normal distribution; therefore, the Z-score method can be used for normalization and outlier analysis. As shown in Figure 2, outlier analysis using Z-scores reveals the presence of outliers, particularly in three attributes (Y_Minimum, Y_Maximum, and Sum_of_Luminosity). Despite the outlier analysis performed on the dataset obtained from the UCI database, the significance of removing any specific outliers is unknown; thus, it has been decided that it is more appropriate to not remove outliers from the dataset.

To understand the relationships between features and create feature subsets, a correlation matrix has been constructed, as shown in Figure 3. Features exhibiting correlations exceeding 90% in absolute value (i.e., above 0.90 or below −0.90) are regarded as highly correlated.

According to the correlation matrix obtained for the dataset, the following features exhibit an absolute correlation greater than 90% (i.e., |r| > 0.90).

Y_Minimum and Y_Maximum,
X_ Minimum and X_Maximum,
Pixels_Areas and Sum_of_Luminosity,
Pixels_Areas and X_Perimeter,
X_Perimeter and Sum_of_Luminosity,
X_Perimeter and Y_Perimeter,
TypeOfSteel_A300 and TypeOfSteel_A400

These highly correlated features will be considered in the next stage when creating scenarios and will be used to exclude certain features from the cluster.

3.2. Developing Subset Scenarios

3.2.1. Classifier Performance with Different Features

Table 3 presents the ranking of features based on their importance in predicting the target variable. The ranking was determined using the information gain metric, which measures how much each feature reduces uncertainty (entropy) regarding the target variable. Features with higher information gain values were ranked in descending order and incrementally added to the model. The performance was then compared to the benchmark accuracy of 65.48% obtained with all 27 features. For instance, using the top five features yielded an accuracy of 57.13%, 12.75% lower than the benchmark. Significant improvement was observed with the first 19 features, while adding more features did not meaningfully enhance accuracy. Thus, the 19 most important features were selected for further analysis.

3.2.2. Selection of the Classification Method

To determine which algorithm performed best with the dataset, classification was conducted using several classifiers mentioned in Table 4, utilizing 10-fold cross-validation and the 19 features detailed in Table 3. Notably, the random forest (RF) algorithm demonstrated the highest performance using these 19 features, and therefore, further analysis will continue with the RF algorithm. Additionally, it should be noted that the F-score for the 19 features is 78%.

3.2.3. Determination of Feature Subsets (Scenarios)

The ‘Original’ row presented in the tables of the manuscript represents the result obtained when all 27 features are utilized without any modifications. Scenarios were developed by addressing redundancy among features with high correlation. Using the correlation matrix, features with correlations exceeding 90% were identified. For each scenario, one of the highly correlated features was excluded from the subset. This ensured that the scenarios maintained the most impactful features while minimizing redundancy caused by high correlation.

Scenario 1: Contains 19 features.

Scenario 2: Obtained by removing the Xmax feature from the 19 features.

Scenario 3: Obtained by removing the Pixel Area feature from the 19 features.

Scenario 4: Obtained by removing the X_Perimeter feature from the 19 features.

Scenario 5: Obtained by removing the TypeOfSteel_A400 feature from the 19 features.

Table 5 shows these scenarios, the accuracy rates and improvement percentages obtained using the random forest algorithm.

Based on Table 5, Scenario 2, where the Xmax feature was removed from 19 features, achieved the highest improvement. To further enhance the results, clustering will be integrated into the process.

3.3. Determining the Number of Clusters and Centers with Three-Step Clustering

3.3.1. Selection of the Number of Clusters with K-Means

Clustering was conducted on the normalized dataset using the K-means method. As can be seen from Figure 4, the clustering performance varies for different numbers of clusters. The elbow method graph shown in Figure 4 displays the sum of the squares of the distances to the cluster center with a blue line, the elbow point with a black dashed line, and time with a green dashed line. According to the elbow method used to determine the number of clusters, the number of clusters should be four, as indicated by the elbow point where the decrease in the sum of the squares of the distances to the cluster center begins.

3.3.2. Determining Cluster Centers with Hierarchical Clustering

A dendrogram, as shown in Figure 5, can be obtained with hierarchical clustering. The four clusters determined in the previous stage were used to identify the number of vertical lines intersected by a horizontal red line drawn across the dendrogram at this stage. The horizontal line in Figure 5 was adjusted to intersect with four vertical dendrogram lines, and the intersection points seen in Figure 5 were selected as the centers of the four clusters. These centers will serve as the central points for the four clusters to be created in the subsequent K-means algorithm stage.

3.3.3. Creating Clusters by Reapplying K-Means

At this stage, a hybrid approach involving two well-known methods has been implemented. The number of clusters was determined using K-means, and a dendrogram was used to identify the centers of clusters. In the next stage, the four clusters defined for this example will be evaluated for different scenarios.

3.4. Classification: Hyperparameter Optimization for Each Scenario and Cluster

To enhance the classification accuracy further, an initial review of the RF algorithm’s parameters was conducted to enable parameter optimization. The RF algorithm commonly utilizes four main parameters: n_estimators, min_samples_split, max_depth, and max_features. In this study, max_features parameter was not analyzed due to the feature selection process.

-: n_estimators (Default: 10): Indicates the number of trees to be generated. Increasing this value extends the program’s runtime.
-: max_depth (Default: None): Specifies the maximum depth of the trees (the maximum number of connections between the root and leaves), which should be adjusted to prevent overfitting.
-: min_samples_split (Default: 2): Represents the minimum number of samples required for a split to occur.

For the Steel Plates Fault dataset, the determined values and ranges for these parameters are provided in Table 6. Additionally, k-fold cross-validation within the model is set to k = 10.

Subsequently, the RF will be run using the best parameter values selected for each scenario, and performance metrics will be obtained for comparison with other studies.

4. Computational Results

This section consists of three main parts. The first part will examine the results obtained with the proposed ETSCCM method. In the second part, the results of other methods suggested in the literature for the focused dataset will be compared with the results obtained using the ETSCCM methodology. The final subsection will analyze the performance of the proposed method on some other datasets.

4.1. Performance of ETSCCM Method on Steel Plates Faults Dataset

The silhouette coefficient is a metric that evaluates clustering performance by measuring both intra-cluster cohesion and inter-cluster separation. When all available features were used, applying hierarchical clustering (Ward linkage) to form four clusters—whose centroids were then passed as initialization points to K-means—yielded a silhouette coefficient of 0.1924. To improve this metric, a reduced feature set (Scenario 2) was employed, raising the silhouette coefficient to 0.2660. Following this more acceptable level of clustering quality, classification analyses were conducted.

With the proposed method, applying the RF algorithm to predefined four clusters and scenario combinations was observed to yield good results for the Steel Plates Faults dataset. Parameter optimization was performed for each scenario, and 10-fold cross-validation results were also obtained. The results have been compiled in Table 7.

The first column of Table 7 represents the scenarios, the second column indicates which of the four predefined clusters is considered, the third column shows the accuracy rate obtained within each cluster, the fourth column presents the average accuracy rate per scenario, and the fifth column displays the improvement rate, calculated by comparing the results with the 65.48% accuracy rate obtained when considering all 27 features.

The highest value in the fourth column of Table 7 is 77.43%, achieved with Scenario 2, which is highlighted in bold in the table. Scenario 2 was created by removing the Xmax feature from the features. The achieved improvement results not only from feature subset selection but also from incorporating the appropriate number of clusters and performing parameter optimization in the process.

As can be seen from the improvement rates in the Table 4, using the proposed methodology can enhance accuracy rates by more than 16%.

4.2. Comparison with Results from Other Methods

Decision trees, random forest, C4.5, and CART are fundamentally based on the decision tree learning technique. Therefore, in the literature, algorithms that use the Steel Plates Faults dataset and operate with similar logic have been selected for comparison with the proposed algorithm. Shu et al. [34] propose an information gain-based semi-supervised feature selection algorithm for partially labeled hybrid data. The effectiveness of their algorithm, named extended decision labeled annotation (ELA), is assessed through experiments on various datasets from the UCI machine learning repository. For the steel plates dataset, the classification result of C4.5 has been chosen for comparison. Agrawal and Adane [35] introduced the principal component analysis-based decision tree forest (PDTF) model, which enhances diversity within each decision tree during forest creation. This is achieved by applying transformation techniques based on principal component analysis (PCA). Hengrong et al. [36] proposed the attribute reduction approach using the personalized neighbor of the D-S evidence theory-based justifiable granule selection method (PN-DSJG). For experimental validation, fifteen benchmark datasets were used to demonstrate the method’s effectiveness. For comparison, results obtained with the steel plates dataset and CART classifier were included.

In their study, Nkonyana et al. [47] employed random forest algorithms, support vector machines, and artificial neural networks to classify the steel plates dataset, achieving the best results with random forest. They used grid search for parameter optimization. In Table 8, their study is abbreviated as random forest with grid search (RFGS) for comparison. Table 8 compares the accuracy rates of four algorithms that have utilized the steel plates dataset in recent years with the algorithm proposed in this paper. As observed from the table, the proposed algorithm outperformed three of the algorithms and achieved results close to the best-performing algorithm. The findings confirm the performance of the algorithm.

In the next sub-section, the success of the algorithm on other datasets will be evaluated.

4.3. Performance of ETSCCM Method on Different Datasets

To observe the performance of the proposed method across different datasets, three well-known datasets were selected, as detailed below.

The Glass dataset, used in forensic analysis, contains nine attributes such as sodium and iron content, classifying samples into seven types like window glass or containers. The Vertebral Column dataset from biomedical research has six features, such as pelvic incidence and lumbar lordosis angle, and categorizes conditions into two main classes: normal and abnormal. The Seeds dataset, valuable in agriculture, includes seven traits such as kernel length and seed compactness to differentiate among three types of wheat: Kama, Rosa, and Canadian. Each dataset is essential in its field for classification and pattern recognition. Using the best-performing classifier for each dataset, as shown in the table, and repeating the steps of the proposed method, the accuracy rates listed in Table 9 were achieved. The improvement rate obtained using the ETSCCM method, compared to the accuracy rates achieved by directly applying the algorithms listed in the table to the datasets, is documented in the last column of the table. As can be inferred from the various improvement rates in the table, the proposed algorithm has significantly enhanced the accuracy metric across different datasets.

5. Conclusions

In this study, the enhanced three-stage cluster-then-classify method (ETSCCM) approach was applied to improve the classification accuracy of the dataset. During the three-stage clustering phase, after normalizing the dataset, the process advances through two main branches. In the first branch, the number of clusters is determined using the K-means algorithm and the elbow method, followed by hierarchical clustering to produce a dendrogram. This dendrogram then identifies the cluster centers in accordance with the previously determined number of clusters. Subsequently, clustering is performed considering these cluster centers and the number of clusters. In the other branch, different classification algorithms are applied to the dataset, and the best-performing algorithm is selected; then, the performance of the algorithm in different feature subsets is evaluated, and the best-performing feature subsets are defined as scenarios. In the final stage, the selected classifier undergoes parameter optimization for each number of clusters and scenario alternative, and accuracy rates are obtained. Preprocessing work was carried out on the focused steel plates dataset in the article. The findings reveal that, with 10-fold cross-validation, utilizing four clusters and the second scenario (from which the Xmax feature was removed from the 19 features), the random forest (RF) algorithm achieved the highest average accuracy rate. This configuration yielded a 77.43% accuracy, representing an 18.25% improvement over the baseline scenario (65.48% accuracy), thus demonstrating the percentage effectiveness of the proposed algorithm. The obtained accuracy rate of 77.43% was compared with the recent literature on the same dataset using random forest variants: it exceeded Shu et al. [34] with 75.42%, Agrawal and Adane [35] with 75.19%, and Hengrong et al. [36] with 62.99%, while nearly matching Nkonyana et al. [47], who reported 77.80%. These findings confirm the effectiveness of the proposed method in fault detection for steel plates. When applied to other commonly referenced datasets in the literature, the algorithm achieved accuracy improvements of 14.85% on the Glass dataset, 6.79% on the Vertebral Column dataset, and 6.20% on the Seeds dataset, thereby demonstrating its ability to enhance performance across multiple datasets.

Although our study focuses on steel plates, the ETSCCM framework is data-driven and can readily adapt to other metals provided similar defect data exist. In each case, determining the optimal number of clusters, classifier, feature subset, and parameter settings would depend on the numeric properties and distribution of the new dataset. Nonetheless, the core multi-stage clustering and classification pipeline remains broadly applicable.

Below is a brief outline of the primary outcomes arising from this work:

Proposed Method: The ETSCCM framework unifies clustering, feature selection, classification, and parameter optimization into a multi-stage pipeline for fault detection.
Performance Gains: Achieved an 18.25% improvement over the baseline on the steel plates dataset (65.48% → 77.43%), surpassing or closely matching other random forest-based methods applied to the Steel Plates Faults dataset in the literature.
Generality: Demonstrated additional accuracy improvements of 14.85%, 6.79%, and 6.20% on the Glass, Vertebral Column, and Seeds datasets, respectively, indicating its adaptability to diverse data domains.

It is important to acknowledge the limitations of this study. The study suggests that the effectiveness and practical application of ETSCCM may be constrained by its reliance on specific data and the particularity of scenario selection. Consequently, the requirement for substantial parameter tuning or significant computational resources could restrict the method’s feasibility for real-time or immediate processing applications.

In future studies, integration with more advanced classification and clustering algorithms could be achieved, automation of parameter selections could make it more user-friendly, and performance in large and streaming datasets could be reviewed.

Author Contributions

E.G. developed and coded the algorithms introduced in the study. D.Y.E. supervised the project and made important contributions to the study, the algorithms, and the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available in UCI Machine Learning Repository [48]. All data processing and modeling steps were performed using Python 3.10.2 [49].

Conflicts of Interest

Author Elif Guleryuz was employed by Tofaş Turk Automobile Factory Inc. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Tan, P.-N.; Steinbach, M.; Kumar, V. Introduction to Data Mining; Pearson Education India: Noida, India, 2016. [Google Scholar]
Cao, L. Data science: A comprehensive overview. ACM Comput. Surv. 2017, 50, 1–42. [Google Scholar] [CrossRef]
Denton, T. Advanced Automotive Fault Diagnosis: Automotive Technology: Vehicle Maintenance and Repair; Routledge: London, UK, 2020. [Google Scholar]
Thota, M.K.; Shajin, F.H.; Rajesh, P. Survey on software defect prediction techniques. Int. J. Appl. Sci. Eng. 2020, 17, 331–344. [Google Scholar]
Rathore, S.S.; Kumar, S. A study on software fault prediction techniques. Artif. Intell. Rev. 2019, 51, 255–327. [Google Scholar] [CrossRef]
Zolghadri, A.; Morando, M.; Henry, D.; Cieslak, J. Fault Diagnosis and Fault-Tolerant Control and Guidance for Aerospace Vehicles; Springer: London, UK, 2014; Volume 236. [Google Scholar]
Goswami, S.; Singh, A.K. A literature survey on various aspect of class imbalance problem in data mining. Multimed. Tools Appl. 2024, 83, 70025–70050. [Google Scholar] [CrossRef]
Ragothaman, B.; Sarojini, B. A multi-objective non-dominated sorted artificial bee colony feature selection algorithm for medical datasets. Indian J. Sci. Technol. 2016, 9, 1–5. [Google Scholar] [CrossRef]
García, S.; Luengo, J.; Herrera, F. Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl.-Based Syst. 2015, 98, 1–29. [Google Scholar] [CrossRef]
Corrales, D.C.; Ledezma, A.; Corrales, J.C. From theory to practice: A data quality framework for classification tasks. Symmetry 2018, 10, 248. [Google Scholar] [CrossRef]
Yilmaz Eroglu, D.; Sahin Pir, M. Hybrid oversampling and undersampling method (houm) via safe-level smote and support vector machine. Appl. Sci. 2024, 14, 10438. [Google Scholar] [CrossRef]
Wei, P.; Lu, Z.; Song, J. Variable importance analysis: A comprehensive review. Reliab. Eng. Syst. Saf. 2015, 142, 399–432. [Google Scholar] [CrossRef]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R. A decision-theoretic generalization of on-line learning and application to boosting. In Proceedings of the Second European Conference on Computational Learning Theory, Barcelona, Spain, 13–15 March 1995; pp. 23–27. [Google Scholar]
Li, X.; Wang, L.; Sung, E. A study of adaboost with SVM based weak learners. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Montreal, QC, Canada, 31 July–4 August 2005. [Google Scholar]
Ronowicz, J.; Thommes, M.; Kleinebudde, P.; Krysiński, J. A data mining approach to optimize pellets manufacturing process based on a decision tree algorithm. Eur. J. Pharm. Sci. 2015, 73, 44–48. [Google Scholar] [CrossRef] [PubMed]
Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT92), Pittsburgh, PA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar] [CrossRef]
Fix, E.; Hodges, J.L. Discriminatory Analysis—Nonparametric Discrimination: Small Sample Performance; USAF School of Aviation Medicine: Randolph Field, TX, USA, 1952. [Google Scholar]
Wu, X.; Kumar, V.; Quinlan, J.R.; Ghosh, J.; Yang, Q.; Motoda, H. Top 10 algorithms in data mining. Knowl. Inf. Syst. 2008, 14, 1–37. [Google Scholar] [CrossRef]
Müller, M.; Boehlke, T.; Brückner-Foit, A.; Korte-Kerzel, S. Overview: Machine learning for segmentation and classification of complex steel microstructures. Metals 2024, 14, 553. [Google Scholar] [CrossRef]
Yilmaz Eroglu, D.; Akcan, U. An adapted ant colony optimization for feature selection. Appl. Artif. Intell. 2024, 38, 2335098. [Google Scholar] [CrossRef]
Theng, D.; Bhoyar, K.K. Feature selection techniques for machine learning: A survey of more than two decades of research. Knowl. Inf. Syst. 2024, 66, 1575–1637. [Google Scholar] [CrossRef]
Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
Hossin, M.; Sulaiman, M.N. A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manag. Process 2015, 5, 1–11. [Google Scholar]
Soni, N.; Ganatra, A. Comparative study of several clustering algorithms. Int. J. Adv. Comput. Res. 2012, 2, 4–6. [Google Scholar]
Chen, X.; Jiang, S.; Gu, Z.; Zhou, S.; Zhong, Y.; Gao, W.; Li, Z. AHA-3WKM: The optimization of K-means with three-way clustering and artificial hummingbird algorithm. Inf. Sci. 2024, 672, 120661. [Google Scholar] [CrossRef]
Pham, D.T.; Dimov, S.S.; Nguyen, C.D. Selection of K in K-means clustering. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2005, 219, 103–119. [Google Scholar] [CrossRef]
Rani, Y.; Rohil, H. A study of hierarchical clustering algorithm. Int. J. Inf. Comput. Technol. 2013, 3, 1225–1232. [Google Scholar]
Ferreira, L.; Hitchcock, D.B. A comparison of hierarchical methods for cluster functional data. Commun. Stat.-Simul. Comput. 2009, 38, 1925–1949. [Google Scholar] [CrossRef]
Chen, B.; Tai, P.C.; Harrison, R.; Pan, Y. Novel hybrid hierarchical K-means clustering method (HK-means) for microarray analysis. In Proceedings of the IEEE Computational Systems Bioinformatics Conference, Stanford, CA, USA, 8–11 August 2005; pp. 505–506. [Google Scholar]
Wu, L.; Lin, Z. Research on customer segmentation model by clustering. In Proceedings of the 7th International Conference on Electronic Commerce, Xi’an, China, 15–17 August 2005; Association for Computing Machinery: New York, NY, USA, 2005; pp. 316–318. [Google Scholar]
Yang, C.; Shi, X.; Jie, L.; Han, J. I know you’ll be back: Interpretable new user clustering and churn prediction on a mobile social application. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), London, UK, 19–23 August 2018; pp. 914–922. [Google Scholar]
Khan, A.R.; Khan, S.; Harouni, M.; Abbasi, R.; Iqbal, S.; Mehmood, Z. Brain tumor segmentation using K-means clustering and deep learning with synthetic data augmentation for classification. Microsc. Res. Tech. 2021, 84, 1389–1399. [Google Scholar] [CrossRef]
Shu, W.; Zheng, Y.; Liang, G.; Ni, J.; Meng, D. Information gain-based semi-supervised feature selection for hybrid data. Appl. Intell. 2023, 53, 7310–7325. [Google Scholar] [CrossRef]
Agrawal, L.; Adane, D. Ensembled approach to heterogeneous data streams. Int. J. Next-Gener. Comput. 2022, 13, 1–11. [Google Scholar]
Hengrong, J.; Weiping, D.; Zhenquan, S.; Jiashuang, H.; Jie, Y.; Xibei, Y. Attribute reduction with personalized information granularity of nearest mutual neighbors. Inf. Sci. 2022, 613, 598–612. [Google Scholar]
Zhang, X.; Qian, X.; Liang, J.; Dang, C. A fuzzy rough set-based feature selection method using representative instances. Knowl.-Based Syst. 2018, 151, 216–229. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Fix, E. Discriminatory Analysis: Nonparametric Discrimination, Consistency Properties; USAF School of Aviation Medicine: Randolph Field, TX, USA, 1985; Volume 1. [Google Scholar]
Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Malini Devi, G.; Seetha, M.; Sunitha, K.V.N. A novel k-nearest neighbor technique for data clustering using swarm optimization. Int. J. Geoinformatics 2016, 12, 1–7. [Google Scholar]
Guido, R.; Crispino, A.; Boggia, L.; Tomei, M. An overview on the advancements of support vector machine models in healthcare applications: A review. Information 2024, 15, 235. [Google Scholar] [CrossRef]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–18 July 1965; University of California Press: Berkeley, CA, USA, 1967; Volume 1, pp. 281–297. [Google Scholar]
Murtagh, F.; Legendre, P. Ward’s hierarchical agglomerative cluster method: Which algorithms implement Ward’s criterion? J. Classif. 2014, 31, 274–295. [Google Scholar] [CrossRef]
Haixiang, G.; Yijing, L.; Shang, J.; Mingyun, G.; Yuanyue, H.; Bing, G. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 2017, 73, 220–239. [Google Scholar] [CrossRef]
Sidnov, K.; Tribukh, S.; Ylyash, M.; Tretiakov, O.; Puchko, A.; Tetelbaum, D. Machine learning-based prediction of elastic properties using reduced datasets of accurate calculations results. Metals 2024, 14, 438. [Google Scholar] [CrossRef]
Nkonyana, T.; Sun, Y.; Twala, B.; Dogo, E. Performance evaluation of data mining techniques in steel manufacturing industry. Procedia Manuf. 2019, 35, 623–628. [Google Scholar] [CrossRef]
Asuncion, A.; Newman, D. UCI Machine Learning Repository; University of California: Irvine, CA, USA, 2007; Available online: http://archive.ics.uci.edu/ml (accessed on 3 March 2025).
Python Software Foundation. Python, version 3.10.2; Python Software Foundation: Wilmington, DE, USA, 2022.

Figure 1. Flowchart of the proposed ETSCCM algorithm.

Figure 2. Outlier analysis results.

Figure 3. Correlation matrix.

Figure 4. Number of clusters using the elbow method.

Figure 5. The dendrogram obtained by hierarchical clustering.

Table 1. Steel Plates Faults dataset class labels.

Class Label	Class Label Name	# of Instances
1	Pastry	158
2	Z_Scratch	190
3	K_Scatch	391
4	Stains	72
5	Dirtiness	55
6	Bumps	402
7	Other_Faults	673

Table 2. Features of the Steel Plates Fault dataset.

Feature Name	Description
X_Minimum	Minimum X coordinate value.
X_Maximum	Maximum X coordinate value.
Y_Minimum	Minimum Y coordinate value.
Y_Maximum	Maximum Y coordinate value.
Pixels_Areas	Total pixel count in the region.
X_Perimeter	Perimeter length along the X-axis.
Y_Perimeter	Perimeter length along the Y-axis.
Sum_of_Luminosity	Total luminosity in the region.
Minimum_of_Luminosity	Minimum luminosity value.
Maximum_of_Luminosity	Maximum luminosity value.
Length_of_Conveyer	Length of the conveyor belt.
TypeOfSteel_A300	Binary indicator for steel type A300.
TypeOfSteel_A400	Binary indicator for steel type A400.
Steel_Plate_Thickness	Thickness of the steel plate.
Edges_Index	Measure of edge sharpness.
Empty_Index	Proportion of zero-value pixels.
Square_Index	Measure of how square the region is.
Outside_X_Index	Longest side of rectangle over conveyor length.
Edges_X_Index	Ratio of edge pixels along the X-axis.
Edges_Y_Index	Ratio of edge pixels along the Y-axis.
Outside_Global_Index	Global index of the outer region.
LogOfAreas	Logarithm of the area size.
Log_X_Index	Logarithm of the bounding rectangle width.
Log_Y_Index	Logarithm of the bounding rectangle height.
Orientation_Index	Measure of horizontal orientation.
Luminosity_Index	Average luminosity index.
SigmoidOfAreas	Sigmoid function of the area size.

Table 3. Feature ranking by information gain metric.

Features	Accuracy	Improvement
1. Log_X_Index	47.03%	−28.18%
2. LogOfAreas	47.03%	−28.18%
3. Edges_Y_Index	49.20%	−24.86%
4. Outside_X_Index	52.24%	−20.22%
5. TypeOfSteel_A300	57.13%	−12.75%
6. TypeOfSteel_A400	57.75%	−11.81%
7. Sum_of_Luminosity	57.49%	−12.2%
8. Log_Y_Index	57.86%	−11.64%
9. X_Minimum	61.15%	−6.61%
10. Pixels_Areas	61.51%	−6.06%
11. SigmoidOfAreas	61.51%	−6.06%
12. Orientation_Index	61.72%	−5.74%
13. X_Maximum	62.03%	−5.27%
14. Length_of_Conveyer	64.19%	−1.97%
15. Minimum_of_Luminosity	63.99%	−2.28%
16. X_Perimeter	64.50%	−1.5%
17. Edges_Index	65.53%	0.08%
18. Steel_Plate_Thickness	65.89%	0.63%
19. Square_Index	66.36%	1.34%

Table 4. Comparison of accuracy rates by algorithm.

Algorithm	Accuracy %	# of Features
RF	66.36	19
SVM	55.59	5
KNN	53.12	5

Table 5. Scenarios and accuracy rates obtained via RF.

Scenarios	RF Accuracy (%)	Improvement (%)
Original	65.48%	0.00%
1	66.36%	1.34%
2	66.98%	2.29%
3	65.27%	−0.32%
4	65.69%	0.32%
5	65.73%	0.38%

Table 6. RF Parameters and levels.

n_estimator	max_depth	min_samples_split
[100,200,300,400,500,1000]	(10,30)	(2,20)

Table 7. Results and improvement rates of the ETSCCM method.

Scenarios	Cluster	RF Accuracy via ETSCCM (%)	RF Accuracy via ETSCCM (Average)	Improvement (%)
Original	1	73.33%	76.14%	16.28%
	2	75.22%
	3	85.00%
	4	71.00%
1	1	75.00%	77.32%	18.09%
	2	74.41%
	3	88.33%
	4	71.55%
2	1	75.00%	77.43%	18.25%
	2	74.73%
	3	88.33%
	4	71.66%
3	1	75.00%	76.46%	16.77%
	2	74.41%
	3	85.00%
	4	71.44%
4	1	73.89%	76.35%	16.60%
	2	74.73%
	3	85.00%
	4	71.77%
5	1	73.89%	76.24%	16.43%
	2	74.73%
	3	85.00%
	4	71.33%

Table 8. Comparison of ETSCCM’s accuracy rates with those found in the literature.

Reference	Year	Method	Accuracy %
Proposed Algorithm	2025	ETSCCM	77.43
Shu et al. [34]	2023	ELA	75.42
Agrawal and Adane [35]	2022	PDTF	75.19
Hengrong et al. [36]	2022	PN-DSJG	62.99
Nkonyana et al. [47]	2019	RFGS	77.80

Table 9. Performance of ETSCCM on different datasets.

Dataset	Algorithm	Accuracy Rate (%)	Improvement Rate (%)
Glass	Random Forest	87.48%	14.85%
Vertebral Column	Bagging	89.22%	6.79%
Seeds	Bagging	97.10%	6.20%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yilmaz Eroglu, D.; Guleryuz, E. Enhanced Three-Stage Cluster-Then-Classify Method (ETSCCM). Metals 2025, 15, 318. https://doi.org/10.3390/met15030318

AMA Style

Yilmaz Eroglu D, Guleryuz E. Enhanced Three-Stage Cluster-Then-Classify Method (ETSCCM). Metals. 2025; 15(3):318. https://doi.org/10.3390/met15030318

Chicago/Turabian Style

Yilmaz Eroglu, Duygu, and Elif Guleryuz. 2025. "Enhanced Three-Stage Cluster-Then-Classify Method (ETSCCM)" Metals 15, no. 3: 318. https://doi.org/10.3390/met15030318

APA Style

Yilmaz Eroglu, D., & Guleryuz, E. (2025). Enhanced Three-Stage Cluster-Then-Classify Method (ETSCCM). Metals, 15(3), 318. https://doi.org/10.3390/met15030318

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Three-Stage Cluster-Then-Classify Method (ETSCCM)

Abstract

1. Introduction

2. Preliminaries: Data Description, Classification, and Clustering Methods

2.1. The Dataset Used

2.2. Preprocessing

2.3. Classification Techniques Used in the Study

2.3.1. Random Forests (RF)

2.3.2. K-Nearest Neighbor (KNN)

2.3.3. Support Vector Machines (SVM)

2.4. Clustering Techniques Used in the Study

2.4.1. K-Means

2.4.2. Hierarchical Clustering

2.5. Performance Metrics

3. Proposed Algorithm: Enhanced Three-Stage Cluster-Then-Classify Method (ETSCCM)

3.1. Preprocessing and Normalization of the Dataset

3.2. Developing Subset Scenarios

3.2.1. Classifier Performance with Different Features

3.2.2. Selection of the Classification Method

3.2.3. Determination of Feature Subsets (Scenarios)

3.3. Determining the Number of Clusters and Centers with Three-Step Clustering

3.3.1. Selection of the Number of Clusters with K-Means

3.3.2. Determining Cluster Centers with Hierarchical Clustering

3.3.3. Creating Clusters by Reapplying K-Means

3.4. Classification: Hyperparameter Optimization for Each Scenario and Cluster

4. Computational Results

4.1. Performance of ETSCCM Method on Steel Plates Faults Dataset

4.2. Comparison with Results from Other Methods

4.3. Performance of ETSCCM Method on Different Datasets

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI