1. Introduction
In computer vision and object recognition applications, the extraction of texture features plays a significant role [
1]. The machine learning algorithm is trained to recognize objects using texture features that are extracted from the image. The analysis of satellite or aerial imagery, facial recognition, biometric object recognition, texture enhancement and robot vision for unmanned aerial vehicles, texture synthesis for computer graphics, and image compression are just a few examples of the numerous applications where texture analysis is important [
2]. Numerous texture extraction techniques have been proposed since 1960 [
3]. These methods convert an image’s texture into a feature vector that describes its characteristics. This feature vector can be applied to later tasks like classifying textures. Analyzing the pixel neighborhood is necessary to capture the spatial context because an image’s texture contains spatial context information. The majority of methods turn an image into a collection of small-scale features before representing it globally. The local texture features are aggregated to produce a global representation using operations like sum, max, min, etc.
The existing texture feature extraction algorithms can be classified into two categories: traditional and learning-based [
4]. Traditional feature extraction algorithms extract various statistical-, structural-, spectral- and model-based features from image. But these features are not adaptive to fine-level fabric textural differences and lack specialization to certain regions desired by applications. Learning-based approaches have recently been proposed for the extraction of texture features. The learning-based approaches can be categorized as vocabulary learning strategies, extreme learning approaches, and deep learning approaches. The learning-based method has higher capabilities compared to traditional feature extraction algorithms. Among the learning-based approaches, deep learning-based approaches have recently gained importance due to their ability to learn intricate features with much handcrafting, as needed in traditional feature extraction algorithms. Though the deep learning approaches avoid handcrafting, they still have problems related to generalized learning and lack selective intricate learning based on specialization regions desired by applications. This work proposes a solution to this problem using attention-based deep learning feature extraction. The proposed solution identifies the specialization regions in the image through frequency domain analysis and an LBP-based convolutional kernel is designed to extract more intricate features at salient regions compared to other regions. Following are the novel contributions of the work.
(i) A novel specialization region selection algorithm based on frequency domain analysis using Quaternion wavelet transform;
(ii) A novel LBP convolutional kernel to extract more intricate features at the specialization regions.
The discriminative ability of features for object recognition applications improves with specialization and more intricate features at the specialization region.
This paper is organized as follows:
Section 2 details the proposed attention-based deep-learning texture feature extraction technique.
Section 3 provides the results of the discriminative ability of the proposed deep learning feature for different applications.
Section 4 presents the conclusion and scope for further research.
2. Materials and Methods
Andrearczyk et al. [
5] introduced a deep learning-based technique for texture feature extraction, replacing traditional filters. They modified CNN architecture to reduce shape emphasis, but this led to higher-dimensional features and lacked compactness. They also used a modified AlexNet model for image texture extraction [
6].
Lin et al. [
7] studied CNN models for texture feature extraction, finding bilinear models superior but computationally complex. Li et al. [
8] combined deep learning with Gabor wavelets to extract rotation-invariant texture features without specific weights. Simple et al. [
9] proposed a novel texture feature as a Fisher vector, but lacked details on mapping features to specific regions. Liu et al. [
10] presented a CNN-based method that lacked compactness and region mapping. Dixit et al. [
11] combined deep learning with Whale’s optimization algorithm.
Kociołek et al. [
12] used CNN for texture directionality detection, but found it lacked granularity and compaction support. Sabine et al. [
13] explored deep CNN models, but not specific regions. Zhang et al. [
14] combined convolutional and encoding layers, experimenting with various encoders, but did not focus on extracting features from specific regions. Sabino et al. [
15] proposed a multilayer network for texture feature extraction, but faced computational issues. Barburiceanu et al. [
16] extracted textures from deep learning models, but lacked region specification. Anwer et al. [
17] combined LBP with a deep learning model to create TEX-Nets, but the fusion did not consider specific image regions.
Jia et al. [
18] used a two-stage recurrent neural network to extract shape and texture features, while Kasthuri et al. [
19] combined deep learning with Gabor filters for face recognition. Simon et al. [
20] combined deep architecture features with luminance information. Bello et al. [
21] utilized CNN to extract color texture features, revealing superior discriminative ability compared to hand-crafted descriptors.
The studies on deep learning methods for extracting texture features found that none are region-specific. However, for applications like fabric defect detection and manufacturing hairline defects, it’s crucial to extract features uniformly across the entire image, focusing on areas where defects are more likely to occur. This specialized feature extraction can significantly improve the accuracy of defect classification.
The proposed solution uses frequency domain analysis to generate salient regions from an image, an unsupervised clustering algorithm, a modified CNN with an LBP convolutional kernel, and texture feature extraction as sown in
Figure 1.
2.1. Proposal Generation
HIS transform is applied to the input RGB image before performing FDA. HSI transform on RGB color image is performed as follows:
where
is given as
where
R,
G,
B are the pixel values for red, green and blue component of the image.
The HSI image can be represented in quaternion form as
where (
n,
m) is the location of the pixel.
are the Hue, Saturation, and Intensity of pixel at (
n,
m). The value of
μ is selected as per the condition.
The quaternion Fourier transform is performed for pixel at (
u,
v) as
The Gaussian Quaternion high-pass filter in two dimensions is used to smooth images by removing high frequency components in specific ranges. A Gaussian Quaternion high pass filter function is defined as
where
σ is a measure of Gaussian spread and
D(
u,
v) is the distance of point (
u,
v) to the origin of the frequency rectangle (
M/2,
N/2). It is calculated as
The Gaussian high-pass filter offers superior smoothing compared to ideal filters, and the resulting coefficients are then processed using the inverse quaternion transform. The inverse transform is given as
The inverse transform image is evaluated for salient regions based on color contrast, with regions with sharp color contracts being marked as proposal regions. The probability of a salient region to a proposal region is found as
where
The weight
W is calculated based on the area of super pixel segment and spatial similarity as
When is greater than the threshold, then the salient region is selected as the proposal region.
2.2. Attention Matrix Generation
The attention matrix is used for feature learning with different intensities across regions. A dataset of images is collected, with each image split into grids of size n * n. A binary matrix representing each grid is created, with initial values of 0 if the grid falls in a proposal region. K mean clustering is performed on the binary matrix, and the attention matrix is generated from the binary matrices of the higher density cluster.
2.3. Modified CNN with LBP Kernel
This study proposes a modified CNN with a convolutional kernel between Gaussian and LBP, convolving pixel regions and applying a new kernel matrix for learning intricate features using LBP. For an image region marked by an attention matrix, multi-scale representation is formed, applying a Laplacian of Gaussian (
LoG) operator. It is calculated as
is usually 0 for a uniform image and it changes to positive or negative depending on the darkness level of the image.
In each scale,
LBP is calculated. The
LBP of each scale is logically
AND with the region to generate the feature map as
The effect of
LoG is approximated using a discrete convolutional kernel, as shown in
Figure 2a. The 2-D
for different values of
σ is shown in
Figure 2b.
The modified CNN uses an attention matrix-based convolution to extract texture features from an input image, resulting in 1024 dimension texture features. The architecture includes default convolution in unmarked regions and convolution using the calculated value as shown in
Figure 3. The convolution flow discussed so far is summarized in
Figure 4.
3. Results
The attention-based deep learning texture feature and SVM classifier were tested on two datasets: the texture database Out-ex_TC_00013 [
22] and the plant village dataset. The texture database offers 68 RGB images in 68 categories, while the plant village dataset contains numerous plant species in both healthy and diseased states. The proposed solution was compared to CNN, transfer learning, a deep convolutional neural network, and Deep Lumina. Texture features were classified using SVM classifier, and performance was measured in terms of accuracy, precision, recall, and the F1-score [
23,
24]. Results are presented in
Table 1 and
Table 2.
The proposed solution has an average accuracy of 97.41%, which is at least 4.78% higher than previous works. As shown in
Table 3, the proposed solution consistently outperformed existing works for all nine plant cases.
The proposed features’ performance was compared with other deep learning models like Alexnet, VGG16, and Resnet for the Texture dataset, as shown in
Table 4.
The proposed attention-based deep feature extraction model outperforms other deep learning models by at least 9% for texture datasets, and its performance was compared to Alexnet, VGG16, and Resnet for the plant village dataset; the result is given in
Table 5.
From
Figure 5, compared with other deep learning models, the proposed attention-based deep feature extraction had at least 7% higher accuracy for texture datasets.
The proposed solution significantly reduces feature extraction time by 48%, primarily due to fewer CNN layers, with the modified CNN functionality contributing the most to this reduction.
The time taken for the feature in the proposed solution in compared against other deep learning feature extraction methods and result is shown. Compared to other deep learning models, the proposed solution took less time due to use a lesser number of layers.
The accuracy of the SVM classifier with proposed feature extraction is tested for different SVM kernels and the result is given in
Table 6.
The proposed texture features performed better with a radial-basis kernel of SVM compared to Linear and Polynomial kernels.
The performance of the proposed features along with the radial-basis kernel was measured for various values of C, and the result is given in
Figure 6.
The proposed solution achieved an accuracy of 0.1 at C = 0.1, with no significant increase in accuracy after increasing C, as shown in
Figure 7. The ROC curve plot in
Figure 8 reveals a highest ROC area of 90.7%, indicating higher sensitivity. The modified CNN’s peak accuracy is achieved at epoch 10, indicating its superior sensitivity.