1. Introduction
Over the years, graph theory has expanded and gained significant advancements in various fields, such as chemistry, biology, and computer science [
1,
2,
3]. Likewise, in machine learning, many problems can be modeled as a graph, where nodes represent pixels or regions, and edges describe relationships between nodes. The graph-based methods can capture and exploit an image’s spatial values and relational structures, offering a rich and flexible framework for image analysis and classification tasks [
4]. Graph theory allows us to represent any graph in matrix form. The Laplacian matrix is one of the standard matrix forms used in graph representation. It conveniently represents a graph’s local and global properties. The Laplacian matrix can be formed in several ways; the most conventional matrix formation is by finding the adjacency matrix and its respective Degree matrix. Note that the Laplacian matrix grows larger in size with the increasing size of the image. This can lead to increasing computational time in postprocessing algorithms. Therefore, feature or dimensionality reduction is often a critical step when working with a large dataset. Additionally, it is vitally important to have a feature extraction algorithm that consumes less computational time.
In the past, the Laplacian Eigenmap (LE) was the most utilized nonlinear feature extraction method for the Laplacian matrix [
5]. In LE, Belkin and Niyogi first compute the eigenvalues of the Laplacian matrix of a graph, and then, corresponding to their eigenvectors, the smallest non-zero eigenvalues are selected. In contracts to the LE method, He and Niyogi [
6] proposed an algorithm called Locality Preserving Projections (LPP) that learns the linear mapping of data rather than a nonlinear mapping. Note that the LPP might not perform well on nonlinear structural data.
Besides calculating simple eigenvalues for feature extraction, Roweis and Saul [
7] introduced Locally Linear Embedding (LLE), a manifold learning algorithm to project high-dimensional data into low-dimensional space. The fundamental principle of LLE involves selecting a predetermined number of nearest neighbors for each data point, typically referred to as the “k-number”. After identifying these neighbors, LLE calculates the local geometric structures by determining the best linear combination of these k-neighbors to reconstruct each data point. When transforming to a low-dimensional space, LLE ensures that these data points maintain their original proximities, staying as close together (or as far apart) as they were initially, preserving their relative distances and relationships. The drawback of LLE is that the user must define the “k-nearest neighbors” in it, which is not ideal for non-supervised operations. Moreover, the LLE is sensitive to noisy data and outliers.
The Isometric Feature Mapping (Isomap) [
8] proposed by Tenenbaum et al. is another significant feature extraction method that finds the path with the shortest distance (also called geodesic distance) between all data point pairs in the local neighborhood. The geodesic distances help to capture the intrinsic manifold structure within the data. Similarly, He et al. [
9] presented the “Laplacian score”, where the initial nearest neighbor graph is constructed and converted into a weighted Laplacian matrix. After that, the Laplacian score is calculated by deducting one from the feature variance and dividing by its degree, i.e., the number of connected nodes. Both LPP and LLE require a particular nearest neighbor graph and the Laplacian matrix.
Besides the feature extraction methods that are mentioned so far, several other feature extraction methods have been proposed that can be directly implemented on the Laplacian matrix. For instance, the Principal Component Analysis (PCA) [
10] is the most commonly used linear feature extraction method in machine learning. In PCA, the data points are transformed orthogonally, and a new set of coordinates is generated, also known as principal components. The users select the number of principal components according to the data point’s total variance. However, increasing the number of data points increases the computation time for feature extraction. Another version of the PCA is called “kernel PCA” where the data points are mapped into higher dimensional space using the “kernel function”. After the data points are mapped, the principal components are computed. Then, as in standard PCA, the user selects the number of principal components according to the data point’s variance. Different types of kernel functions can be used for “kernel PCA”, such as the Radial Basis Function (RBF) [
11] or the polynomial kernel function [
12]. Note that the kernel PCA requires more computational time compared to the traditional PCA. Another alternative way to reduce computational time is by taking a smaller size of the dataset and reducing it to lower dimensions. Later, “dot-product” is used with the rest of the dataset to reduce the features. However, it might result in low classification performance. Another alternative way to reduce computational time is to reduce the dataset to smaller batches, such as Incremental PCA (I-PCA) [
13], and then apply feature extraction techniques. However, it remains a critical step to determine the optimal batch size that balances computational efficiency with the enhancement of classification performance in feature reduction.
Additionally, addressing the computational efficiency in processing high-dimensional matrices remains a considerable challenge in developing feature extraction algorithms. The feature reduction methods reviewed in the preceding sections suggest an increase in computational demands proportional to the expansion of dataset sizes and dimensionalities, as exemplified by a dataset comprising 100,000 images, each with a resolution of 150 × 150 pixels. Motivated by this issue, the current study introduces an innovative approach to mitigate the ‘curse of dimensionality’ and low computational time without significantly compromising classification accuracy. This paper presents the development and application of a novel dimensionality reduction algorithm that surpasses various established feature extraction techniques in terms of classification accuracy while also demonstrating a noticeable decrease in computational time requirements. Furthermore, this research shows how the performance of these feature extraction algorithms is influenced by variations in image patch sizes.
The proposed algorithm utilizes the Gershgorin circle (GC) theorem for dimensionality reduction or feature extraction. The GC theorem was developed by mathematician S. A. Gershgorin [
14] in 1931. The GC theorem estimates an eigenvalue inclusion of a given square matrix. The GC theorem has been used in several diverse applications, such as stability analysis of nonlinear systems [
15], graph sampling in Graph theory [
16], and evaluating the stability of power grids [
17]. Over time, several extensions of the GC theorem have provided a better close estimation of eigenvalue inclusion of matrices [
18,
19]. The GC theorem is more time-efficient in computation than other eigenvalue inclusion methods [
19]. However, none of the inclusion methods have been used for feature extraction tasks.
Once features are effectively extracted through any method, the subsequent pivotal step is to classify them by selecting an appropriate classification algorithm. The extracted features help not only to reduce computation time but also to reduce the number of training parameters that are required for the classification algorithms. In the fields of machine learning (ML) and deep learning (DL), many algorithms have been developed that provide state-of-the-art performance. In the field of ML, algorithms like Support Vector Machines (SVM) [
20] and Decision Trees [
21] are most commonly used, while in DL, algorithms such as artificial neural networks (ANN) [
22] and convolution neural networks (CNN) [
23] are some of the few algorithms that are commonly used.
This paper introduces a novel feature extraction method for the graph-weighted Laplacian matrix by utilizing a mathematical theorem known as the Gershgorin circle theorem.
Figure 1 shows the complete overview process of the proposed GCFE algorithm. The proposed algorithm modifies the weighted Laplacian matrix by converting it into a strictly diagonally dominant matrix termed a modified weighted Laplacian (MWL) matrix. Later, applying the GC theorem, the matrix’s P × N × N feature is reduced to P × N × 2 features, where P = no. of patches; N = no. of nodes, or total pixel size, accordingly. Finally, the reduced features are fed into the classification algorithm. For performance comparison, two classification algorithms, 1D-CNN and 2D-CNN, were utilized in this study. Detailed explanations of the proposed method, along with descriptions of the datasets used, are provided in
Section 2.
Section 3 discusses the results of the proposed methods, focusing on GCFE’s computational efficiency and performance accuracy compared to other feature extraction studies. This paper concludes with a summary of the findings and their implications in
Section 4.
3. Results and Discussion
This study compares the proposed method with seven feature reduction methods and one non-feature reduction algorithm with identical CNN classification architecture. In addition, while keeping the same environment all over the experiment, the true performance of the proposed method is evaluated. All the experiments were executed on a university supercomputer server, which was configured with 24 Core and 24 GB memory per core. The cross-validation technique is used to validate the model’s performance.
Table 2 displays a comparative analysis of the proposed method in three different ways with different datasets, graph types, classification architecture, and assessment metrics.
In the first approach, the GCFE performance was examined on different patch sizes of images using 2D CNN, as shown in
Figure 4. All the GCFE experiments in
Figure 4 were based on a 2D grid graph. In this experiment, the datasets were split into training, validation, and testing, with ratios of 70%, 15%, and 15%, respectively. The CNN models were trained with 10 epochs. The EMNIST datasets were tested with 2, 4, 7, 14, and 28 patch sizes, while the MC and CVD datasets were experimented with 2, 4, 5, 10, and 20 patch sizes. Also, from
Figure 4, it can be seen that the GCFE performance across different image patch sizes remains almost consistent, with an average standard deviation accuracy of ±0.4475. The average GCFE accuracy performance along with standard deviation (SD) for each dataset are 84.53 ± 0.714, 85.01 ± 0.281, 88.30 ± 0.269, 98.86 ± 0.154, 91.18 ± 0.473, 98.12 ± 0.124, 94.54 ± 0.404, and 69.62 ± 0.157 for E_Balanced, E_ByClass, E_ByMerge, E_Digits, E_Letter, E_MNIST, MC, and CVD, respectively (from
Figure 4). Besides each model’s accuracy, other evaluating metrics, such as the F1 score, Recall, and Precision, were also computed, which is illustrated in
Figure 4. Notably, the CVD dataset demonstrated lower performance, which can be attributed to its inherent characteristics—specifically, the significant presence of extraneous objects in the background as compared to the target foreground objects (cats or dogs). Further analysis revealed that in approach 2, the CVD dataset consistently showed lower accuracy across all feature extraction methods compared to other datasets.
In the second approach, additional experiments were conducted to compare the feature extraction algorithms with different graph types, their accuracy, and computational time presented in
Table 3. In addition, all the experiments in
Table 3 are performed on datasets with balanced distribution classes, which were E_Balanced, MC, and CVD datasets. Furthermore, the number of epochs and ratios of split datasets were kept similar to approach 1. However, due to memory resource limitations, only smaller image patch sizes were selected. For image-to-graph transformation, GCFE and the other experiments utilize two different graph types: 2D-grid and pairwise graphs, respectively. Each graph type had seven different experiments for comparison: GCFE (2D-CNN); Laplacian (2D-CNN); GCFE (1D-CNN); I-PCA (1D-CNN); kernel-PCA (1D-CNN); spectral embedding (1D-CNN); and Raw Image (2D-CNN), accordingly. In
Table 3, the letter “P” in the dataset name represents the patch size. For instance, “E_balanced_P2” means an E_Balanced dataset with patch size 2. In
Table 3, the experiment titled “Raw Image” is conducted to provide an approximate performance assessment for each dataset patch size.
Furthermore, similar to the findings in
Figure 4 regarding GCFE performance across different image patch sizes,
Table 3 also indicates similar accuracy performance trends for the GCFE method using the 2D-grid graph and the 1D-CNN model. In addition, for the E_Balanced patch size experiments, the accuracy deviated by merely ±0.4497 SD. In contrast, the accuracy performance for other feature extraction methods like PCA, kernel-PCA, and spectral embedding increased with increasing patch size. For instance, the accuracy increases from 76.468% to 78.223% with SD ± 1.0044 for I-PCA, 78.138% to 80.580% with SD ± 1.2467 for kernel-PCA, and 75.787% to 77.755% with SD ± 1.0166 for spectral embedding as the patch size varied from 2 and 4 to 7. Also, similar trends can be observed for the pairwise graph type for the E_Balanced dataset.
In the “Laplacian” experiment, the standard weighted Laplacian matrix was constructed and applied either as input data for various feature extraction methods or fed directly into the classification model. The feature extraction algorithms, such as I-PCA, kernel-PCA, and spectral embedding, were configured to produce the same quantity of features as the GCFE output. This configuration ensures a real performance comparison between the methods. For a detailed examination, configurations and associated code for all methods are available at
Supplementary Materials [
28].
Figure 5 shows each feature extraction method’s mean accuracy (ACC), utilizing both graph and 1D-CNN. It also displays the average Z-score between GCFE and other individual feature extraction methods. Both average, ACC, and average Z-score were computed for E_Balanced and patches P2, P4, and P7, accordingly. As can be seen in
Figure 5a, b, GCFE consistently outperforms all other methods across both graph types, as indicated by its predominantly positive Z-score values. The only exception is the spectral embedding with the pairwise graph, which has a Z-score of −0.01 in
Figure 5b. Additionally, the GCFE method ACC on CVD_P2 and MC_P2 is also much higher compared to other feature extraction methods for both graph types. For instance, when considering the 2D-grid graph type, the percentage difference between GCFE (1D-CNN) and I-PCA for CVD_P2 and MC_P2 was 5.16% and 37.04%, respectively.
Besides the comparison of feature extraction accuracy performance, the computational time for the feature extraction algorithm is an important criterion.
Table 3 also presents the computation time needed to perform feature extraction on all datasets. The presented times were in seconds. Note that
Table 3 has two types of time notation: “t” and “t*”. The “t” represents the time to compute all dataset instances simultaneously, while “t*” indicates the time taken when processing the dataset in smaller batches, obtaining their reduced features and subsequently implementing the dot product on the remaining instances. The GCFE computed 131,600 E_Balanced instances for a 2D-grid graph in approximately 6 s (actual 6.044 s), 6 s (actual 5.868 s) and 16 s with patch sizes 2, 4, and 7, respectively, and took 32 s and 36 s for 27,558 MC and 25,000 CVD instances with patch size 2. For the pairwise graph, it processed the same E_Balanced instances in approximately 5 s (actual 5.765 s), 5 s (actual 5.714 s), and 8 s and took 27 s and 33 s for the MC and CVD instances, all with patch size 2. Thus, the computational time for both graph types of GCFE is much lower compared to other methods, such as I-PCA, kernel-PCA, and spectral embedding. However, considering the small batch and “
dot-product” method for feature extraction, the computation time for a spectral embedding patch size of 2 (both graph types) was lower than the GCFE. Still, with the increasing patch size of the image, the small batch and dot-product method for computational time increased more than the GCFE method. For instance, the spectral embedding method with 2D-grid and pairwise graph—E_Balanced_P4, E_Balanced_P7, CVD_P2, and MC_P2 shown in
Table 3.
In the third comparison approach, the GCFE was compared with additional feature reduction, which included Isomap, LLE, Modified LLE (MLLE) [
29], and Hessian Eigenmap [
30]. In this approach, the feature reduction methods were compared by their accuracy and total computational time (generating graph till feature reduction), as shown in
Figure 6. The K-NN graph type is utilized to convert images to graphs. During these experiments, the “K” value for the graph was selected to match the image’s patch size for LLE, MLLE, and Isomap, while for the Hessian Eigenmap, K was set to 300. A total of 300 components (no. of reduced features) were chosen for the LLE, MLLE, and Isomap methods and 20 components for the Hessian Eigenmap method. Similarly, in approach 2, the LLE, MLLE, Isomap, and Hessian Eigenmap methods were applied to a small subset of the dataset comprising 1000 samples (100 samples for each E_MNIST class). Later, the rest of the datasets are transformed into reduced features by the dot product between the reduced feature of the small subset and the entire dataset. In
Figure 6, similar trends were noticed in approach 2, where the accuracy performance of LLE, MLLE, and Isomap decreased with a decrease in the patch size of the image, while the GCFE and Hessian Eigenmap did not have a major variation in accuracy performance. Moreover, the GCFE outperformed the LLE, MLLE, and Isomap in classification accuracy. The GCFE and Hessian Eigenmap methods showed only minor differences in accuracy performance. However,
Figure 6 indicates that the Hessian Eigenmap had a higher computational time compared to GCFE. Additionally, LLE and MLLE had lower computational times than GCFE due to the smaller dataset subset selected for feature reduction.
Figure 7 illustrates the number of training parameters of the 2D-CNN model for different patch sizes of the E_Balanced dataset—standard Laplacian (2D-CNN) features and GCFE (2D-CNN) in approach 2. In
Figure 7, each circle represents the number of training parameters, which are scaled down to
. The number of required training parameters for all standard Laplacian features is 3.51 for patch 2, 3.54 for patch 4, 9.93 for patch 7, 38.84 for patch 14, and 157.65 for patch 28. Comparatively, the GCFE has only 3.5 training parameters with an average percentage difference of only 0.684% (for 2D-grid type) and 0.952% (for pairwise type) compared to the standard weighted Laplacian method. Note that the number of training parameters for GCFE will remain the same for different patch sizes.
Additionally, the results demonstrate that the GCFE method offers robust and reliable feature extraction, with minimal variability in performance as indicated by its low SD and higher ACC across different datasets and graph types. This consistency is crucial for applications in computer vision, where the precision of feature extraction can significantly impact the accuracy of subsequent tasks such as image classification. In this study, the GCFE method exhibited an average SD of 0.3202 using the 2D-grid graph type across all datasets and an SD of 0.305 using the K-NN graph type on the E_MNIST dataset. These SD results demonstrate the method’s consistent ACC performance across different image patch sizes, reducing uncertainty in the GCFE method’s performance.