1. Introduction
Image classification is a pivotal component within the realm of computer vision [
1,
2,
3], a field dedicated to extracting meaningful information from images, transforming this information into features, and encoding it for computer processing. This process allows computers to train, learn, and categorise images into various groups. In the era of the internet, characterised by widespread image accessibility, the diversity and volume of image categories have seen substantial growth. Consequently, the efficient organisation, analysis, and accurate classification and prediction of large volumes of image data have become essential research pursuits in computer vision. The technology of image classification serves as an interdisciplinary field with diverse applications in sectors such as medical engineering [
4,
5,
6], environmental monitoring [
7,
8], industrial manufacturing [
9,
10], and autonomous driving [
11,
12]. For instance, in the field of medical engineering, it can be employed to identify and categorise cells and tissue structures in biomedical images, thereby contributing to medical research and treatment. In the context of environmental monitoring, image classification aids in the monitoring of image data related to the atmosphere, water bodies, land, and other environmental elements. In industrial manufacturing, it is used for product quality control and defect detection on production lines, enhancing the level of automation in manufacturing processes. In the domain of autonomous driving, image classification is applied to recognise and understand elements like traffic signs, pedestrians, and vehicles on roads, facilitating vehicle decision-making and operations. Furthermore, within maritime traffic [
13,
14,
15], image classification can be used for vessel identification, buoy recognition, maritime event monitoring, maritime boundary surveillance, and meteorological monitoring.
In recent years, deep learning technologies [
16,
17,
18], particularly Convolutional Neural Network (CNN) [
19], have achieved remarkable advancements in image classification tasks. Deep learning models, by learning to extract high-level features from raw pixel data, have led to groundbreaking progress in the accuracy and performance of image classification. Researchers such as Yu et al. took a comprehensive approach by integrating spectral–spatial features and extracting valuable information independently through two separate dense Convolutional Neural Networks (CNNs) [
20]. They introduced a spatial–spectral dense CNN framework with a feedback attention mechanism, specially tailored for hyperspectral image classification. Ozkaraca et al. developed a new modular deep learning model to preserve the existing advantages of established transfer learning methods, including DenseNet, VGG16, and basic CNN architectures, while eliminating their limitations in the classification of Magnetic Resonance (MR) images [
21]. Shamshad et al. investigated the applications of transformers in various medical image tasks such as segmentation, detection, classification, restoration, synthesis, registration, and clinical report generation [
22]. They have developed taxonomies for each application, identified challenges specific to each, provided insights into solutions, and highlighted emerging trends. Building upon the attention mechanism of the transformer, Roy et al. introduced a new morphological transformer (morphFormer) [
23]. This innovative approach integrates learnable spectral and spatial morphological networks, enhancing the interaction between structural and shape information in the hyperspectral image token and the CLS token. Zhou et al. proposed a novel Feature Learning network based on Transformer (FL-Tran), aiming to learn salient features and excavate potential useful features [
24]. Overall, these advancements in deep learning and attention mechanisms have significantly improved image classification methods and their applications across various domains.
From this, it can be seen that deep learning has achieved significant success in image classification tasks, but there are also some drawbacks. For example, deep learning models often require a large amount of annotated data for training, as well as substantial computational resources during the training process. This limitation hinders their application in certain industries [
25]. Secondly, the performance of deep learning models is often highly sensitive to the choice of hyperparameters and model tuning. Adjusting these parameters requires some level of expertise and computational resources, sometimes involving extensive experimentation [
26]. Finally, deep learning models are often considered black-box models, making it challenging to interpret their internal decision-making processes. This may pose a problem in applications where interpretability and explainability are crucial [
27].
Based on these drawbacks, traditional sparse coding models [
28,
29,
30] have some advantages in image classification. Sparse coding involves representing images sparsely, emphasising important local features in the images. This helps extract crucial information from the images and reduces redundancy. Secondly, sparse coding performs relatively well on small-sample data because it learns features by encoding training samples, demonstrating robustness with relatively fewer samples. Additionally, sparse coding can be used to reduce the dimensionality of images, extracting essential information and thereby reducing the complexity of the feature space. Finally, the sparse representations generated by sparse coding are relatively easy to interpret. The sparse coding for each image can be viewed as the weight allocation to a set of bases, aiding in understanding how the model learns discriminative features for images. Therefore, many scholars have conducted various studies on sparse coding models. Yang et al. proposed a sparse coding (SC) algorithm based on Spatial Pyramid Matching (SPM), which effectively reduces quantization error [
31]. Nevertheless, traditional SC exhibits instability in the encoding process, where similar features might be mapped to various codewords. Thus, Gao et al. introduced the Laplacian matrix to preserve the consistency of encoding similar local features, proposing the Laplacian Sparse Coding (LSC) algorithm to extract spatial geometric information from images, making the encoding process no longer independent [
32]. Considering the locality between features, Wang et al. proposed the Locality-constrained Linear Coding (LLC) image classification algorithm, ensuring that similar features receive similar encodings [
33]. Min et al. introduced the Laplacian matrix into LLC to maintain the consistency of encoding similar features [
34]. While LLC utilises K-nearest neighbour encoding, the absolute difference between certain positive and negative elements in the encoding increases with the increase in K. To address this issue, Liu et al. introduced non-negativity constraints and proposed the non-negative LLC image classification algorithm [
35]. In addition, since combinatorial optimisation problems involve a mixture of addition and subtraction operations, the application of subtraction might cancel out features from each other. To solve this problem, Lee et al. introduced non-negativity and employed Non-negative Matrix Factorization (NMF) to learn partial representations of objects, proposing a corresponding model [
36]. To improve the robustness of NMF, a novel algorithm named Robust NMF (RNMF) was proposed in [
37]. Hoyer combined NMF with SC to propose non-negative sparse coding [
38]. Cai et al. proposed a graph-regularised NMF based on data representation [
39]. Furthermore, Han et al. presented the SC method based on non-negativity and dependency constraints (Lap-NMF-SPM), which utilises NMF and Laplacian operators to preserve the relationships between local features [
40].
Among the mentioned encoding methods, Euclidean distance is commonly used to measure the similarity between features and the dictionary. However, the local features of images are based on the histogram of statistical variables. Therefore, Euclidean distance may not effectively measure the relationship between them. Wu et al. proposed a method that goes beyond Euclidean distance, called the Histogram Intersection Kernel (HIK), which more effectively measures the similarity between features and codebooks [
41]. Chen et al. introduced a histogram intersection-based LLC for scene image classification algorithm based on LLC [
42]. Wan et al. incorporated histogram intersection and the Elastic Net model into the optimisation problem, resulting in an Elastic Net and Histogram Intersection-based Non-negative Local Sparse Coding (EH-NLSC) method [
43].
In addition, the image feature representation and classification in these methods are two relatively independent processes. The feature quantization methods involved in encoding ignore potential semantic information, which can affect the effectiveness of image classification. To overcome these issues, the concept of semantic representation based on image representation [
44] has been introduced. Based on generative models, Rasiwasia et al. utilised a low-dimensional semantic space generated by Gaussian mixture models for scene classification and image retrieval [
45,
46]. On the other hand, based on discriminative models, Zhang et al. constructed the semantic space of images using a discriminative model to retain more semantic information [
47]. They combined the SC model to propose a joint image representation and classification algorithm in Random Semantic Space (RSS). Shen et al. used global image features and considered the semantic information of labels to propose a method that combines image segmentation and classification in a joint framework [
48].
Although previous studies have demonstrated the effectiveness of sparse coding in image classification under standard conditions, these methods often face three challenges, as follows:
- (1)
Sparse coding is highly sensitive to feature variations, leading to coding instability where similar features are encoded into different codewords. The previous studies only take into account two of the main three features in the optimisation problem: non-negativity, locality, and Laplacian regularisation.
- (2)
In addition, Euclidean distance could not effectively measure the relationship between feature vectors and codebooks.
- (3)
The processes of image representation and classification are relatively independent. The feature quantization methods involved in coding neglect the potential contextual information in local regions, resulting in the loss of visual and semantic information in images, thus impeding the effectiveness of image classification.
To enhance the extraction of comprehensive and effective information from images, and subsequently improve image classification accuracy, this paper integrates histogram intersection and semantic information. The specific research contributions are outlined as follows:
- (1)
Incorporate non-negativity and locality into the LSC model, constructing the NLLSC method. This method preserves local information among features and spatial geometric information, significantly improving the instability of encoding.
- (2)
Introduce histogram intersection to redefine the distance between feature vectors and the dictionary in the locality constraint of the sparse coding model. This redefinition provides a more accurate measurement of their similarity, ensuring that similar features can share their local bases.
- (3)
After obtaining the fused locality and non-negativity in Laplacian Sparse Coding, integrate image representation and classification, which incorporates semantic information to preserve the contextual relationship between image features. This approach more comprehensively and effectively captures the essence of the image.
- (4)
Conduct comparative experiments with the other six state-of-the-art methods in four standard image datasets and three maritime datasets to validate the performance of the proposed methodology.
The remainder of this paper is structured as follows:
Section 2 provides a review of related work on SC, LSC, and LLC;
Section 3 introduces the proposed coding method, Histogram intersection, and Semantic information-based Non-negativity Local Laplacian Sparse Coding (HS-NLLSC);
Section 4 presents experimental results on several datasets; and
Section 5 offers the conclusions.
2. Preliminary Methods
The proper encoding of local features is crucial for image classification, as it not only faithfully represents images but also improves the accuracy of image classification. Recently, numerous scholars have proposed various encoding methods, and these methods have demonstrated promising classification results. This section primarily introduces three typical encoding models: sparse coding, Laplacian Sparse Coding, and Locality-constrained Linear Coding.
Let the feature matrix of an image be denoted as , the dictionary be denoted as , and the corresponding sparse coding be represented as .
2.1. Sparse Coding
In light of the quantization errors arising from vector quantization methods and the potential lack of semantic information in the
C-means method, the SC method has been introduced. The central challenge it addresses is the learning of an over-complete dictionary
U in an
M-dimensional space (i.e.,
; namely, the number of base vectors significantly exceeds its dimension). The objective is to choose as few base vectors as possible to represent the feature vector. The particular optimisation problem is articulated below:
Here, λ represents the regularisation parameter, which balances the trade-off between reconstruction error and the sparsity of the coding. And represents the j-th column vector of dictionary U.
The general solution for Equation (1) is to alternately fix U (or V) and optimise V (or U) until the value of the objective function achieves the specified extreme value.
2.2. Laplacian Sparse Coding
To address the encoding instability in traditional SC, where similar features might be encoded into different codewords, LSC was introduced in [
32]. LSC incorporates the Laplacian matrix to maintain the stability of encoding similar local features, thus eliminating the independence of the encoding process. The specific optimisation problem is presented as follows:
where
is used to extract the spatial geometric information of the image and reduce quantization errors, and
represents the Laplacian matrix.
2.3. Locality-Constrained Linear Coding
The LLC method was introduced in [
33], highlighting that local non-zero coefficients are frequently assigned to bases in proximity to the coding feature data. This approach utilises multiple codewords from the codebook to enhance the accuracy of representing a feature descriptor. Additionally, similar features utilise similar coding patterns by sharing their local codewords, effectively addressing the instability issue present in SC. The specific optimisation problem is presented as follows:
where ⊙ represents element-wise multiplication (for column vectors) and
denotes a regularisation parameter.
denotes a local adaptor, defined as follows:
where
is the Euclidean distance between
and
, and
is a parameter used to adjust weight decay. The constraint condition
indicates the translation invariance of the LLC method.
To facilitate a more intuitive comparison of these three methods,
Table 1 provides an overview of their advantages, disadvantages, and applications.
3. Methodology
3.1. The Proposed Framework
The proposed framework comprises four primary components, as depicted in
Figure 1. The first involves the extraction of common SIFT features from images, while the second involves image representation using NLLSC based on Histogram Intersection (HI-NLLSC). In the third part, semantic information is integrated between image representation and image classification based on HI-NLLSC to acquire the final HS-NLLSC with its updated features. Finally, the SVM classifier is utilised to classify these images within the semantic spaces of the third part.
3.2. The Proposed HS-NLLSC Algorithm
In this paper, the NLLSC method is introduced, which incorporates non-negativity and locality constraints based on the LSC model. Additionally, histogram intersection is integrated into the locality constraint of the optimisation problem. Moreover, the HS-NLLSC method is proposed by considering both image representation and classification.
Firstly, the HI-NLLSC method is employed to encode the local features of the images, utilising Max Pooling (MP) to derive the original image representations.
Secondly, a subset of image representations is randomly selected to construct a semantic information-based space. Within this space, all training images are projected using a trained classifier, resulting in projected image feature representations that serve as the final image representations.
Finally, an SVM classifier is utilised for both training and classification, with the output providing class information. This comprehensive framework integrates non-negativity and locality constraints, histogram intersection, and semantic information to enhance image representation and classification within the HS-NLLSC method.
3.3. Image Representation Using HI-NLLSC
This section primarily outlines the process of deriving the original image representations through the HI-NLLSC method. By integrating histogram intersection into the optimisation problem of NLLSC, the distance between features and the dictionary is redefined, effectively quantifying their similarity.
3.3.1. Train Dictionary and Corresponding Coding
Due to the extensive number of extracted local features, constructing the local adaptors [
30] and the Laplacian matrix incurs high computation complexity. Thus, template features are employed to train the dictionary and corresponding coding, randomly selected from all local features. Firstly, the initial formulation of the HI-NLLSC method is presented. Given
X as the input non-negative feature matrix,
B as the non-negative dictionary, and
S as the corresponding non-negative sparse coding, where
,
,
, by incorporating locality and non-negativity into LSC, the optimisation problem is provided as follows:
where
,
, and
represent specified constants, and the sparseness
is defined based on the relationship between the
-norm and
-norm, which is represented as follows:
where
D is the dimensionality of
, i.e.,
.
In this paper, an improvement is made to calculate the Euclidean distance between features and the dictionary in the LLC model. A similarity measurement method based on histogram intersection is proposed.
is a local adaptor, defined as follows:
where
represents a parameter used to adjust weight decay.
represents the distance between
and
, which is measured using histogram intersection. The calculation method is defined as follows:
where
M is the dimensionality of the two histograms (size of the dictionary), and
and
, respectively, represent the
k-th elements of the features
and
.
The method of using alternating fixed
B (or
S) to optimise
S (or
B) is employed to solve Equation (5). Firstly,
X and
B are fixed,
S is optimised, and the following optimisation problem is obtained:
where
, and
represents element-wise multiplication (for matrices).
For Equation (9), the objective function is first transformed into a trace form of matrices. Then, utilising the Lagrange Multiplier Method (LMM) and Karush–Kuhn–Tucker (KKT) conditions, the update rule for
S can be obtained as follows:
where
is a diagonal matrix with
as its diagonal elements,
is an
N-dimensional row vector,
, and
.
Next,
B is optimised by fixing
X and
S. The optimisation problem is as follows:
For Equation (11), the diagonal matrix
can be obtained using the Lagrange dual problem and conjugate gradient method. After solving for
, it is substituted into the following equation to obtain
B, namely:
3.3.2. HI-NLLSC Based on New Features
Some template features
X are randomly selected to train
B and
S in
Section 3.3.1. When a new feature matrix
H of local features appears, the HI-NLLSC method proceeds by using
B and
S. So the optimisation problem can be written as follows:
where
represents the
i-th column vector of
S, and
V represents the sparse coding of the updated feature matrix
H. The elements
in the similarity matrix
W are obtained by calculating the K-nearest neighbour relationship between the new feature and the template feature, where the K-nearest neighbour relationship is measured using histogram intersection. If the template feature
and the new feature
have a K-nearest neighbour relationship,
; otherwise,
. The metric function is the same as in Equation (8).
The update rule for
V can be obtained using the Lagrange Multiplier Method (LMM) and KKT conditions as follows:
where
A is the diagonal weight matrix with its diagonal elements being
,
is the same as the definition in Equation (10), and
. After obtaining
B and
S from the template features, Equation (14) can be utilised to perform HI-NLLSC on the new features.
For the feature fusion stage, this paper employs the MP method. The specific approach is as follows:
where
is the
l-th element of the sparse coding
, and
is the
l-th element of the vector
z. Thus, the image of a single spatial pyramid region can be described by an
M-dimensional vector
z, as shown below:
.
After obtaining the image representation for each region, the final image representation is obtained using the SPM method. In the image classification stage, this paper utilises a multi-class linear SVM.
3.4. Image Representation Based on Semantic Information
To overcome the challenges posed by the relatively independent nature of the HI-NLLSC method and the semantic gap between visual features and human understanding, as well as the oversight of semantic information in local regions during feature quantization, we propose the HS-NLLSC image classification algorithm. This algorithm seeks to comprehensively integrate visual and semantic information in images, capturing the relationships between semantic objects and their surrounding environments. Initially, the HI-NLLSC method is utilised to encode the local features of images, producing the original image representation. Subsequently, a semantic space is constructed to generate the final representation of images. Finally, SVM is employed to classify the obtained image representations within the semantic space.
3.4.1. Construct Semantic Space
The semantic space is defined as the collective space obtained from all image representations during classifier training, serving the purpose of image representation. Each distinct semantic space is created by training the classifier on randomly selected images.
Assuming that the original image representations of
P training images using the HI-NLLSC method are obtained, denoted as
, with a total of
C classes and their corresponding class labels
, from these representations,
L (
) images are randomly selected to construct the semantic space and this selection process is repeated
T times. The corresponding results are denoted as
. For the
t-th random selection of images
, the SVM classifier is utilised to construct the corresponding semantic space, namely:
Then, the corresponding optimisation problem using the Hinge loss function is constructed, namely:
By solving Equation (17), the corresponding and are obtained.
Each dimension of the semantic space corresponds to a classifier trained using randomly selected samples. As there are C classes of images, the generated semantic space is C-dimensional.
3.4.2. Project Images into Semantic Space
After training the SVM classifier, all the training images are projected into the aforementioned semantic space, namely:
where the superscript ‘
ss’ represents ‘semantic space’.
3.4.3. Concatenate All Semantic Spaces
Upon acquiring knowledge of all joint spaces, the training images are projected into all the
T generated joint spaces. The connection of all image features in these spaces forms the final image representation, namely:
Following the acquisition of the final image representation using Equation (19), a multi-class SVM is employed for image classification.
3.5. Description of Three Algorithms
After obtaining expressions for matrices B and S, the dictionary and associated sparse codes for template features are acquired through the systematic use of the following algorithms.
Algorithm 1 is designed for the iterative update of the Lagrangian diagonal matrix to obtain the representation of
B. Subsequently, Algorithm 2 is employed to identify the optimal approximation
that ensures the appropriate sparsity of
B. In this process, B is replaced by
.
Algorithm 1 (Iteratively update to calculate B) |
Input: non-negative matrix X; sparse coding V; precision ; |
Output: diagonal matrix ; dictionary B |
1. Initiate and let ; |
2. while convergence is not achieved, do |
3. Let , if then |
4. is the desired extremum; |
5. else |
6. ; |
7. end if |
8. Set , if or then |
9. is the required extreme value; |
10. else |
11. , ; |
12. end if |
13. Determine the optimal step size using an approximate one-dimensional search, namely: ; |
14. ; |
15. Set ; return Step 8; |
16. end while |
17. Return , and obtain dictionary B according to Equation (12). |
Algorithm 2 (The optimal approximation for proper sparseness of B) |
Input: a random column vector b of matrix B |
Output: the nearest non-negative vector of |
1. Compute the sparseness of the column vector with Equation , and , where D represents the dimensionality of vector u or ; |
2. Map vector b into the constraint space, for namely make ; |
3. Let to be an initial negative element set; |
4. while not iteratively finding the closest non-negative vector and meeting do; |
5. Set midpoint in constraint space; |
6. Get the non-negative solution by solving a quadratic equation , and replace with to update the vector ; |
7. if all elements of are non-negative then |
8. Return ; |
9. else |
10. for each do |
11. Let all negative elements be zero by and set ; |
12. Recompute the projection, keep invariant in constraint space, namely: ; |
13. Go to Step 5; |
14. end for |
15. end if |
16. end while |
Following this, Algorithm 3 is applied to learn the dictionary of HS-NLLSC and its corresponding coding. This entails iterative updates of both B and S until the established stop criterion is satisfied. The implementation of Algorithm 3 integrates the functionalities of both Algorithms 1 and 2.
In summary, the algorithmic process can be outlined as follows:
Algorithm 3 (HS-NLLSC) |
Input: non-negative feature matrix X, original dictionary B, original sparse coding S, Laplacian matrix L, parameter , , , , number of training images P, number of training iterations T |
Input: dictionary B, sparse coding S, class labels |
1. Preprocessing: , ; |
2. While convergence is not achieved, do |
3. Update sparse coding S with Equation (10); |
4. Normalize B and S according to the following equations:, ; |
5. Update Lagrange dual matrix using Algorithm 1; |
6. Project each column vector of matrix B using Algorithm 2 to obtain , and let , thereby obtaining the optimal dictionary B and corresponding sparse coding S; |
7. Set ; |
8. If Step 8 in Algorithm 1 is satisfied then |
9. Return B and S; |
10. else |
11. Return Step 3; |
12. end if |
13. end while |
14. After obtaining B and S for the template features, calculate the sparse coding V for the new features according to Equation (14); |
15. Use SPM with Equation (15) to perform MP on the obtained coding and obtain the original image representation; |
16. After obtaining the image representation using HI-NLLSC, construct the semantic space according to Equation (16), compute and according to Equation (17); |
17. Project all training images into the semantic space according to Equation (18); |
18. Connect all semantic spaces and generate the final image representation according to Equation (19); |
19. Use multi-class linear SVM to classify the image in the semantic space. |
4. Experiments
This part mainly presents three experiments to validate the feasibility of the HS-NLLSC algorithm. The first subsection provides information about the datasets used in the experiments. The second subsection describes the parameter settings of the experiments. The third subsection mainly introduces the design and results analysis of the three experiments. Following this, the fourth subsection provides an analysis of algorithm stability. Finally, the last subsection discusses the complexity analysis of the algorithm.
4.1. Experimental Datasets
In this section, detailed descriptions of four standard datasets are provided, namely, Corel-10, Scene-15, Caltech-101, and Caltech-256 datasets. The specific information is presented in
Table 2. Additionally, partial images from the Caltech-101 dataset are displayed in
Figure 2.
Furthermore, three maritime datasets are discussed, namely, the Singapore Maritime Dataset (SMD), the Open Seaship dataset, and the Marine Image Dataset (MID). The SMD is divided into three parts, comprising on-shore videos, on-board videos, and near-infrared (NIR) videos. The distribution of the SMD is outlined in
Table 3. As for the Open Seaship dataset, it currently contains 31,455 images covering seven common ship types (i.e., ore carriers, bulk carriers, general cargo ships, container ships, fishing vessels, passenger ships, and mixed types). The specific information is detailed in
Table 4.
Moreover, the MID consists of eight video sequences for marine obstacle detection. It comprises 2655 labelled images with a resolution of
pixels captured from our Jinghai VIII USV. Partial images from the MID are shown in
Figure 3.
4.2. Experimental Settings
For the four standard datasets, different training and testing samples are selected for the experiments. Specifically, for the Corel-10 and Scene-15 datasets, 50 and 100 images from each category are randomly selected as training samples, while the remaining images in each category are regarded as testing samples.
Regarding the Caltech-101 dataset, 15 or 30 images from each category are randomly chosen as training samples, and the remaining images in each category are used as test samples. For the Caltech-256 dataset, 15, 30, 45, or 60 images are randomly selected from each category to be used as training images, while the remaining in each category are taken as test images.
For the three maritime datasets, 50, 100, 150, 200, or 250 images are randomly chosen from each category to be used as training images, and the remaining images in each category are taken as test images.
During the feature extraction stage, a step size of 8 and a window of
are used to extract SIFT features for each image. Each local feature descriptor is 128-dimensional, namely,
. Regarding the process of dictionary learning, the dictionary size is set to 1024. In the optimisation problem, there are four key parameters, namely,
, and
. As for
and
, in the SC algorithm,
. For instance, in the LSC algorithm,
and
are set for the Corel-10 and Scene-15 datasets, while for the Caltech-101 and Caltech-256 datasets,
and
are adopted. Then, it can be concluded that
and
. In the proposed method, after comparing several different values,
and
are ultimately set, as presented in
Section 4.3.3. Additionally,
and
are determined according to References [
33,
40]. In the generation of the semantic space,
and
are configured. Detailed information can be found in
Table 5.
4.3. Experimental Design and Result Analysis
This section comprises three experimental design components. Experiment 1 involves the visualisation of learned dictionaries for SC, LSC, and the proposed HS-NLLSC method. In Experiment 2, each dataset is randomly divided into 10 subsets, and a 10-fold cross-validation approach is utilised to determine the average classification accuracy and standard deviation of the proposed HS-NLLSC method. Experiment 3 investigates the influence of two parameters, and , on the classification performance across the four standard datasets.
4.3.1. Visualisation of Learned Dictionaries
In this subsection,
Figure 4 illustrates the dictionaries learned using the SC, LSC, and HS-NLLSC methods. These images are displayed in grayscale format to effectively highlight the original features’ attributes, specifically non-negativity, locality, bandpass characteristics, and directionality.
Non-negativity ensures that the pixel values in the dictionary images are non-negative. Dictionaries with strong non-negativity typically appear brighter, with minimal dark regions. This characteristic is essential for ensuring that the dictionaries accurately represent features in a physically interpretable manner.
Locality refers to the concentration of dictionary atoms in specific regions, appearing as localised patches rather than being distributed across the entire image. Dictionaries with good locality effectively capture local patterns, which are critical for tasks such as object recognition and texture analysis.
- (3)
Bandpass characteristics
Bandpass characteristics represent a balance between high-frequency and low-frequency features in the dictionaries. High-frequency features, such as fine textures, coexist with low-frequency features, such as smooth regions. This balance ensures that the dictionaries can capture both detailed and broader structural elements in the data.
Directionality reflects the ability of dictionary atoms to capture specific directional patterns, such as horizontal, vertical, or diagonal edges. Dictionaries with strong directionality exhibit clear streaks or gradients, indicating their sensitivity to directional features in the input data. This characteristic is particularly valuable for applications involving edge detection and orientation analysis.
These four characteristics are critical for evaluating the quality of dictionaries learned by different methods. As shown in
Figure 4, the dictionaries obtained by the HS-NLLSC method demonstrate a superior representation of these characteristics compared to the SC and LSC methods, highlighting its ability to effectively capture complex patterns in the data.
From
Figure 4, it can be seen that the dictionaries generated by the SC and LSC methods (as shown in
Figure 4a,b) exhibit common attributes such as locality, bandpass characteristics, and directionality. However, due to the differential operations used in their optimisation processes, negative bases may exist in these dictionaries, leading to a lack of non-negativity. In contrast, as shown in
Figure 4c, the dictionary obtained using the HS-NLLSC method exhibits more discernible characteristics, which encompasses locality, non-negativity, bandpass characteristics, and directionality. Furthermore, considering the sparseness of the NLLSC dictionary, it can be concluded that the sparser the dictionary, the weaker its directionality and bandpass characteristics (see [
40]). Therefore, in
Figure 4c, the appropriate sparseness of the dictionary
is 0.4, which results in better performance in characterising the features of such images compared to other methods.
Additionally, to showcase the performance of HS-NLLSC, the image of code
V obtained by the non-negative dictionary in Scene-15 is generated using different methods. Subsequently,
V is visualised, as depicted in
Figure 5. The representation of HS-NLLSC is presented in
Figure 5c, while SC and EH-NLSC are illustrated in
Figure 5a and
Figure 5b, respectively.
In
Figure 5, the non-zero elements in
V are depicted as white pixels. The distribution of
V depicted in
Figure 5c demonstrates a more uniform pattern, indicating the incorporation of locality, sparsity, and semantic information. This reflects the consideration of both group effect and topology information.
4.3.2. Comparison of Average Classification Accuracy
Table 6 presents a comparison of the HS-NLLSC method to six state-of-the-art sparse coding methods, including SC, LSC, LLC, Lap-NMF-SPM, RSS, and EH-NLSC, across four standard datasets: Corel-10, Scene-15, Caltech-101, and Caltech-256. The comparative results clearly indicate that the HS-NLLSC approach exceeds the performance of state-of-the-art methods on four datasets. This superiority can be attributed to several factors. Firstly, the HI-NLLSC method integrates non-negativity, locality, Laplacian regularisation, and histogram intersection, ensuring the accurate encoding of similar local features and the precise measurement of feature-dictionary similarity, thereby reducing coding instability. Secondly, in generating the semantic space, the proposed method comprehensively addresses both image representation and classification, maximising the utilisation of semantic information for a more effective representation. In contrast, previous methods such as SC, LSC, and LLC may suffer from feature cancellation due to the use of addition and subtraction in their optimisation problems, while Lap-NMF-SPM may lack local information, leading to inaccurate representations. Lastly, EH-NLSC, despite using histogram intersection, lacks Laplacian regularisation, impacting its ability to extract spatial geometric information effectively and rendering the encoding process independent.
Moreover, these five methods operate relatively independently regarding image representation and classification, neglecting the semantic information of the images. This oversight can lead to a lack of semantic details. Although the RSS method utilises semantic information to construct the semantic space during the image representation process, it still relies on the traditional SC method in the encoding stage. This reliance leads to instability and a lack of discriminative original image representation. In contrast, the proposed method incorporates non-negativity, locality, Laplacian regularisation, histogram intersection, and semantic information. This comprehensive approach preserves more features and ensures consistency in encoding among similar features, ultimately enhancing the performance of image classification. Overall, the HS-NLLSC method significantly improves the accuracy of image classification.
As shown in
Table 6, the average classification accuracy of the HS-NLLSC method is higher than that of the other methods, indicating that this algorithm performs well overall on these four datasets. Furthermore, except for slightly higher variances on the Caltech-256 (45) and Caltech-101 (30) datasets, the variances of classification accuracy are generally low, demonstrating that our algorithm is relatively stable and exhibits strong robustness.
For a more intuitive comparison, the classification results have been transformed into
Figure 6.
Figure 6 displays the classification results, including the average value ± standard deviation, for seven different methods across four standard image datasets. It is evident that the HS-NLLSC method has significantly enhanced the classification accuracy, ranging from about 5% to 19% compared to several other methods.
Focused on ship classification, the analysis is centred on the Seaship and Singapore Maritime datasets, as outlined in
Table 7, which illustrates the average classification accuracy for these maritime datasets. To offer a clearer visual representation of the classification outcomes, the data from
Table 6 have been transformed into a visual depiction, as shown in
Figure 7.
From
Table 7 and
Figure 7, it becomes apparent that the HS-NLLSC method yields favourable classification results across the three maritime datasets. Generally, with an increase in the number of training images, there is an observable improvement in classification accuracy, typically ranging approximately from 1% to 16%. Notably, for the MID, the classification accuracy remains at 100%, regardless of the number of training images.
4.3.3. Sensitivity Analysis of Different Parameters
In this experiment, the influence of various parameter settings on the classification accuracy across four standard datasets is examined. Specifically, different values are assigned to the parameters
and
, namely, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, and 0.4, and the corresponding classification accuracies are illustrated in
Figure 8. From
Figure 8, we can see that the classification accuracy is highest when
and
.
4.4. Algorithm Stability Analysis
The image representations in Caltech-256 of the NMF, SC, and LLC methods are compared with HS-NLLSC, as shown in
Figure 9.
As depicted in
Figure 9, the target data are represented by black circles, while data from three distinct image categories (watermelon, cake, and tomato) in the Caltech-256 dataset are denoted by red circles, blue squares, and green triangles, respectively.
Figure 9a illustrates the effect of non-negative constraints applied to both the dictionary and encoding in NMF, resulting in non-zero coefficients appearing only in specific regions. However, these coefficients may lack sparsity within the same region. In
Figure 9b, the image representation of the SC method is displayed, where only a few coefficients are non-zero for a given target data, leading to sparsity in the coefficient vector. In contrast,
Figure 9c depicts the representation generated by the EH-NLSC method, which lacks semantic information and tends to select codewords near the input feature matrix for encoding. Finally,
Figure 9d showcases the image representation produced by the HS-NLLSC method, which incorporates non-negativity, locality, and semantic information. This approach ensures similarity between the input data and neighbouring codewords, enhancing the stability and consistency of the encoding process. By addressing the limitations observed in other methods, the HS-NLLSC method offers a more robust and comprehensive image representation.
4.5. Algorithm Complexity Analysis
Given the number of local features in an image (N), the number of template features (N1), and the size of the dictionary (M), the total complexity of similarity calculation between all local features and template features is . The complexity of the feature sign search algorithm is . Therefore, the overall complexity of the coding stage in LSC is .
After adding histogram intersection and local constraints, the complexity of the HI-NLLSC coding stage is . The computational complexity of the non-negativity constraint is . Thus, the total complexity of the HI-NLLSC encoding stage is . In the MP stage, since the SPM process involves the number of pyramid levels (pLevels) and the number of histogram bins (nBins), the complexity of this process is . Hence, the total complexity of the HI-NLLSC stage is .
Regarding the semantic information stage, given the number of cross-validation folds (nRounds) and the number of categories in the image dataset (C) for SVM classification, the complexity of this stage is . In summary, the overall computational complexity of the proposed HS-NLLSC algorithm is .
The complexity of the above-mentioned different stages is listed in
Table 8.
5. Conclusions
This study presents HS-NLLSC, an innovative approach to image classification that addresses key limitations of traditional sparse coding methods, such as their inability to effectively link image representation with classification and to fully capture the relationships between features and dictionaries. By integrating non-negativity, locality, and Laplacian regularisation, HS-NLLSC improves feature retention and ensures the coherence and interdependence of the coding process. A major advancement in HS-NLLSC is its use of histogram intersection to accurately measure the similarity between feature vectors and codebooks, enabling it to construct a semantic space that bridges the gap between image representation and classification. This comprehensive strategy allows for a contextual and semantic representation of images aligned closely with classification objectives.
The key findings of this study demonstrate that HS-NLLSC provides a more precise and comprehensive depiction of original images. By leveraging the similarity and interdependence of local features, the method enhances classification accuracy. Its effectiveness is validated across four benchmark image datasets, where it outperforms existing methods. Additionally, its robust classification capabilities are confirmed through its application to three maritime datasets, underscoring its versatility and practical utility.
Despite its promising results, HS-NLLSC has some limitations. It can be sensitive to noise and outliers, which may affect stability and performance. Additionally, its computational demands and storage requirements pose challenges for large datasets. To address these challenges and expand its applicability, future research could focus on the following areas: (1) Develop strategies to improve the robustness of sparse coding, enabling it to effectively handle noise and outliers, thus enhancing stability and performance. (2) Further explore methods to integrate richer semantic information into the classification process, enabling more accurate and meaningful image representation. These future directions aim to overcome current limitations, broaden the application scope of HS-NLLSC, and enhance its performance in various domains. By addressing these areas, HS-NLLSC has the potential to become a cornerstone in the field of sparse coding and image classification, contributing to advancements in both academic research and practical applications.