Next Article in Journal
Psychological Care of Health Workers during the COVID-19 Outbreak in Italy: Preliminary Report of an Occupational Health Department (AOUP) Responsible for Monitoring Hospital Staff Condition
Next Article in Special Issue
A Review of Deep-Learning-Based Medical Image Segmentation Methods
Previous Article in Journal
Italian Social Farming: the Network of Coldiretti and Campagna Amica
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Sustainable Deep Learning Framework for Object Recognition Using Multi-Layers Deep Features Fusion and Selection

1
Department of Computer Science, HITEC University Taxila, Taxila 47080, Pakistan
2
College of Computer Science and Engineering, University of Ha’il, Ha’il 55211, Saudi Arabia
3
School of Architecture Building and Civil Engineering, Loughborough University, Loughborough LE11 3TU, UK
4
Department of EE, COMSATS University Islamabad, Wah Campus, Wah 47040, Pakistan
5
College of Computer and Information Sciences, Prince Sultan University, Riyadh 11586, Saudi Arabia
*
Authors to whom correspondence should be addressed.
Sustainability 2020, 12(12), 5037; https://doi.org/10.3390/su12125037
Submission received: 18 April 2020 / Revised: 5 June 2020 / Accepted: 17 June 2020 / Published: 19 June 2020
(This article belongs to the Special Issue Research on Sustainability and Artificial Intelligence)

Abstract

:
With an overwhelming increase in the demand of autonomous systems, especially in the applications related to intelligent robotics and visual surveillance, come stringent accuracy requirements for complex object recognition. A system that maintains its performance against a change in the object’s nature is said to be sustainable and it has become a major area of research for the computer vision research community in the past few years. In this work, we present a sustainable deep learning architecture, which utilizes multi-layer deep features fusion and selection, for accurate object classification. The proposed approach comprises three steps: (1) By utilizing two deep learning architectures, Very Deep Convolutional Networks for Large-Scale Image Recognition and Inception V3, it extracts features based on transfer learning, (2) Fusion of all the extracted feature vectors is performed by means of a parallel maximum covariance approach, and (3) The best features are selected using Multi Logistic Regression controlled Entropy-Variances method. For verification of the robust selected features, the Ensemble Learning method named Subspace Discriminant Analysis is utilized as a fitness function. The experimental process is conducted using four publicly available datasets, including Caltech-101, Birds database, Butterflies database and CIFAR-100, and a ten-fold validation process which yields the best accuracies of 95.5%, 100%, 98%, and 68.80% for the datasets respectively. Based on the detailed statistical analysis and comparison with the existing methods, the proposed selection method gives significantly more accuracy. Moreover, the computational time of the proposed selection method is better for real-time implementation.

1. Introduction

Object recognition is currently one of the most focused areas of research due to its emerging application in intelligent robotics and visual surveillance [1,2]. The researchers, however, are still facing problems in this domain for correct object recognition, such as in recognizing an object’s shape and spotting a minor difference among several objects. Therefore, a sustainable system—the one that maintains its performance against a change in the object’s nature—is required for the correct recognition of complex objects [3]. Object classification is the key to a sustainable visual surveillance system [4]. Besides the latter, object classification finds its application in numerous domains, including intelligent robotics, face and action recognition, video watermarking, pedestrian tracking, autonomous vehicles, semantic scene analysis, content-based image retrieval, and many more. We believe that a genuinely sustainable object recognition system still has to overcome numerous challenges, including complex background, different shape and same color for different objects, continuously moving objects, different angles, and many more, since the conventionally used—unsustainable systems—did not prove to work well for complex object classification [5].
Many techniques have been introduced in computer vision to overcome the previously discussed challenges related to complex objects. What most of them tried to accomplish was an optimal method that would perform the same for many types of problems, but this was a considerable challenge. Although in the past few decades, the conventional approaches, such as Hand-Crafted Features (HCF), were used, as the time passed, however, the objects and their backgrounds became more confusing, thereby restricting their use. Handcrafted features included Histogram of Oriented Graph (HOG) [6], geometric features [7], Scale Invariant Feature Transformation (SIFT) [8], Difference of Gaussian (DoG) [9], Speeded-Up Robust Features (SURF) [10], and texture features (HARLICK) [11]. Recent techniques, in contrast, proposed to exploit a hybrid set of features to get a better representation of an object [12]. Unfortunately, those techniques were unable to recognize the growing complexities of objects and images as well.
In the face of the challenges as mentioned earlier, the concept of deep learning has been recently introduced in this context, which has also shown improved performance against reduced computational time. With this, a large number of convolutional neural networks (CNN) pre-trained models have been proposed. This includes AlexNet [13], VGG (VGG-16, VGG-19 [14], GoogleNet [14], ResNet (Resnet-50, ResNet-102, and ResNet-152) [15], and Inception [16]; all these models are trained on the ImageNet dataset. Even with these contributions, however, acceptable accuracy has been difficult to achieve. This has given rise to the concept of features fusion [15,16]—a process of combining several feature populations into a single feature space, which has been adopted in various applications ranging from medical imaging to object classification [17,18,19]. The concept of features fusion does manage to achieve increased classification accuracy, but only at an increased computational cost. In addition, some of the recent works have shown that the fusion process may add irrelevant features that are not important for the classification task [17,18]. We believe that if the irrelevant features were selected and removed from the fused vector, then the computational time could be minimized with an increased accuracy.
Feature selection can be categorized into three: Filter-based, wrapper-based, and embedded. The filter-based selection selects the features from subsets independently. The wrapper-based methods initially assume the features, and then select them based on predictive power. The embedded selection initially utilizes the selection in the training phase, which enjoys the advantages of both filter-based and wrapper based [19]. Some of the famous feature selection techniques include Principle Component Analysis (PCA) [20], Linear Discriminant Analysis (LDA) [21], Pearson Correlation Coefficient (PCC) [22], Independent Component Analysis (ICA) [22], Entropy Controlled [23], Genetic Algorithm-based (GA) [24], and many more.
In this work, an entire sustainable framework based on a deep learning architecture is proposed. While we summarize our challenges and highlight our contributions in response to those in Section 3, the details on the proposed framework are explicitly given in Section 4. Section 5 presents the simulation results before we conclude the manuscript in Section 6. In what follows, however, we review some of the existing related works, in Section 2.

2. Related Work

Many strategies are performed for image classification, as investigated in the area of computer vision and machine learning. Object categorization is the most emergent field of computer vision because of its enormous applications in video surveillance, auto-assisted vehicle frameworks, pedestrian analysis, automatic target recognition, and so on. In the literature, very few fusion-based techniques are presented for the classification of complex objects. Features fusion is the process of combining two or more feature spaces into a single matrix. By fusion, there is a chance to get a higher accuracy vector having properties of multiple feature spaces. Roshan et al. [25] presented a new technique for object classification. They applied the presented algorithm on the VGG-16 architecture and performed training from scratch. Additionally, they used transfer learning (TL) on the top layers. They utilized the Caltech-101 dataset and achieved an accuracy of 91.66%. Jongbin et al. [26] introduced a new DFT-based technique for feature building by discarding the pooling layers among the fully connected and convolutional layers. Two modules were implemented in this technique: The first module, known as DFT, replaced max-pooling from the architecture by a user-defined size pooling. The second module, known as DFT+, was the fusion of multiple layers to get the best classification accuracy. They achieved 93.2% classification accuracy on the Caltech-101 dataset using the VGG-16 CNN network, and 93.6% accuracy on the same dataset using the Resnet-50 model. Qun et al. [27] used a pre-trained network with associative memory banks for feature extraction. They extracted the features using ResNet-50 and VGG-16. Later, the K-Means clustering was used on the memory banks to perform unsupervised clustering. Qing et al. [28] presented a fused framework for object classification. They extracted the CNN features and applied three different types of coding techniques onto the fused vector.
Two pre-trained models, namely VGG-M and VGG-16, were used for feature extraction from the 5-Conv-Layer. Subsequently, PCA-based reduction was applied, and features were fused into a final vector using the proposed coding techniques. Results showed an improved accuracy of 92.54% on the Caltech-101 database. Xueliang et al. [29] presented a late fusion-based technique for object recognition. Three pre-trained networks, namely AlexNet, VGGNet, and ResNet-50, were used for the purpose. Firstly, they evaluated that the middle-level layers of the CNN architecture contained more robust information for visual representation, and then features were extracted from these layers. Features fusion from these three models showed the improved result, and reported 92.2% accuracy on the Caltech-101 dataset. Hamayun et al. [30] proved that the most robust features were extracted from the fully connected layer-6 (FC-6) instead of the FC-8. In the presented approach, they exploited the CNN output and modified it at a middle-level layer instead of the deepest layer. VGG-16 and VGG-19 pre-trained models were used to illustrate the proposed technique. They extracted 4096 features from the FC-6 layer and then applied reduction using PCA. For the experimental process, they used the Caltech-101 dataset and attained an accuracy of 91.35% using the reduced features from the layer FC-6. Mahmood et al. [31] gave an idea for object detection and classification using pre-trained networks (ResNet-50 and ResNet-152). After feature extraction, they performed features reduction using PCA. The Caltech-101 database was selected for evaluation and achieved an accuracy of 92.6%. Emine et al. [32] used convolutional architecture for fast feature embedding (Caffe) for object recognition. About 300 images from the Caltech-101 dataset were used to test the proposed technique. Results showed that 260 images were correctly classified, and 40 were misclassified. Chunjie et al. [33] introduced a new technique, called Contextual Exemplar, to handle the drawbacks caused by the local features. The method comprised three phases: In the first, they combined the regions-based image, followed by constructing the relationship between those regions in the second phase, and they used the connection of those regions for semantic representation in the third phase. They selected 1000 features and achieved an accuracy of 86.14%. Rashid et al. [8] focused on multiple features fusion and selection of the best of them for efficient object classification. They used VGG and Alexnet pre-trained models for CNN feature extraction and SIFT as point features extraction. Both types of features were fused by a simple concatenation approach. Moreover, they introduced an entropy-based selection approach within their framework, which achieved an accuracy of 89.7% for the Caltech-101 dataset. Nazar et al. [34] fused HOG and Inception V3 CNN features and improved the existing accuracy up to 90.1% for the Caltech-101 dataset.

3. Challenges and Contributions

The computer vision research community is still facing various challenges for object classification, and most of them are due to the complex nature of objects. We do realize that it is not an easy task to classify objects into their relevant categories efficiently. To be able to tackle the challenges facing the community and achieve the required accuracies, in this work, we propose a deep learning architecture-based framework for object classification with improved accuracy. The highlights of the framework are as follows:
  • It uses two pre-trained deep learning architectures, namely-VGG19 and Inception V3, and performs TL to retrain the selected datasets. The FC7 and Average Pool layers of the CNN are utilized for feature extraction.
  • A parallel maximum covariance (PMC) technique is proposed for the fusion of both deep learning feature vectors.
  • While the Multi Logistic Regression controlled Entropy-Variances (MRcEV) method is employed for selecting the robust features, the Ensemble Subspace Discriminant (ESD) classifier is used as a fitness function.
  • A detailed statistical analysis of the proposed method is conducted and compared with recent techniques to examine the stability of the proposed architecture.

4. Materials and Methods

The proposed object classification architecture is presented in this section with detailed mathematical formulation and visible results. As shown in Figure 1, the proposed architecture consists of three core steps: Deep learning feature extraction using TL, fusion of various model features, and selection of the robust features for final classification. In the classification step, the ESD classifier is used, and the performance is compared with other learning algorithms. The details of each step, depicted in this figure, are discussed below.

4.1. Deep Learning Features Extraction

Since the past two decades, deep learning has proven itself as the best approach for image recognition and classification [8,35,36,37]. CNN is a method of deep learning, involving a series of layers. A simple CNN model consists of convolution and pooling layers. A few other layers are the activation layer named ReLu, and the feature layer called fully connected (FC). The first layer of CNN is known as the input layer. This layer takes images as input, and the convolutional layer computes the neurons’ response. The latter is calculated by the dot product of weights and smaller regions. While the ReLu layer helps in the activation function, the pooling layer between convolution layers removes the inactive neurons for the next phase. Finally, the high-level features are computed using the FC layers, which are classified through Softmax [8]. In this work, we are using two pre-trained CNN models, namely VGG19 and Inception V3, for feature extraction. In what follows, we present a brief description of each model.
VGG19: VGG-19 [38] consists of 16 convolutional layers, 19 learnable weights layers, which are utilized for transfer learning, 3 FC layers, and an output layer. This model is already trained on the ImageNet dataset. The input size for this model is 224 × 224 × 3 , as given in Table A1 (Appendix A Section). The learnable weights and bias of the first convolution layer are 3 × 2 × 3 × 64 and 1 × 1 × 64 . The total learnable at this layer is 1792 . For the second convolution layer, the total learnable is 36,928 . This layer extracts the local features of an image.
V i ( M ) = B i ( M ) + k = 1 n 1 ( M 1 ) ψ i , k ( M ) × h k ( M 1 )
where, V i ( M ) is the output layer L y , B i ( M ) is the base value, ψ i , k ( M ) denotes the filter mapping the k t h   feature value, and h k means the M 1 output layer. The learnable weights and bias of the first FC layer are 4096 × 25,088 and 4096 × 1 . The dropout layer is added between FC layers, where the dropout rate is 50%. For FC layer 7, the total learnable is 16,781,312, and learnable weights are 4096 × 4096 . For the last FC layer, the total learnable is 4,097,000, and learnable weights are 1000 × 4096 . Hence, when the activation is applied, it returns a feature map vector of dimension 1 × 1 × 1000 . For fully connected layers 1 and 2, the feature map vector dimension is 1 × 1 × 4096 .
Inception V3: It is an advanced pre-trained CNN model. It consists of 316 layers and 350 connections. The number of convolution layers is 94 of different filter sizes, where the size of the first input layer is 299 × 299 × 3 . A brief description of this model is given in Table A2 (Appendix A Section). In this table, it is shown that a scaling layer is added after the input layer. On the first convolution layer, activation is performed and obtained a weight matrix of dimension 149 × 149 × 32 , where 32 denotes the number of filters. Later, the batch normalization and ReLu activation layers are added. Mathematically, the ReLu layer is defined as:
Re i ( l ) = max ( h v , h v i ( l 1 ) )
Between the convolution layers, a pooling layer is also added to get active neurons. In the first max-pooling layer, the filter size is 2 × 2 . Mathematically, the max-pooling is defined as:
m x 1 ( q ) = m x 1 ( q 1 )
m x 2 ( q ) = m x 2 ( q 1 ) F ( q ) S q + 1
m x 3 ( q ) = m x 3 ( q 1 ) F ( q ) S q + 1
where, S M denotes the stride, m x 1 M , m x 2 M , and m x 3 M are defined filters for feature set maps such as 2 × 2 , 3 × 3 . Moreover, a few other layers are also added in this architecture, such as addition and concatenation layers. In the end, an average pool layer is added. The activation is performed, and in the output, a resultant weight matrix is obtained as a features map of dimension 1 × 1 × 2048 . The last layer is FC, and its learnable weight matrix is 1000 × 2048 , and the ensuing feature matrix is 1 × 1 × 1000 . Mathematically, the FC layer is defined as follows:
F c i ( l ) = f ( z i ( l ) ) w i t h z i ( l ) = j = 1 n 1 ( l 1 ) r = 1 n 2 ( l 1 ) s = 1 n 3 ( l 1 ) w i , j , r , s ( l ) ( F c i ( l 1 ) ) r , s
Feature Extraction using TL: In the feature extraction step, we employ TL, by which we retrain both the specific CNN models (VGG19 and InceptionV3) on the selected datasets. For training, we set a 60:40 approach along with labeled data. Furthermore, we perform preprocessing, in which we resize the images according to the input layer of each model. Later, we select the input convolutional and output layers as feature mapping. For VGG19, we choose the first convolutional layer as an input layer, and the FC7 as the output. After that, the CNN activation is performed, and we obtain the training and testing vectors. On the feature layer FC7, a resultant feature vector is obtained of dimension 1 × 4096 denoted by φ ( k 1 ) and utilized in the next process. A modified architecture of VGG19 is also shown in Figure 2. For Inception V3, we select the first convolutional layer as input, and the average pool layer as a feature map. Similar to VGG19, we perform TL and retrain this model on the selected datasets, and apply the CNN activation on the average pool layer. On this layer, we obtain a feature vector of dimension 1 × 2048 , denoted by φ ( k 2 ) . Both training and testing vectors proceed for the next features fusion process. The modified architecture of Inception V3 is shown in Figure 3. In this figure, it is shown that the last three layers are removed before being retrained on the selected datasets for this work.

4.2. Features Fusion

The fusion of multiple features in one matrix is the latest research area of pattern recognition. The primary purpose of features fusion is to obtain a stronger feature vector for classification. From the latest research, it is noticed that the fusion process improves the overall accuracy, but on the other side, its main disadvantage is high computational time (s). However, our usual priority is to improve the classification accuracy. For this purpose, we implement a new Parallel Maximum Covariance (PMC) approach for features fusion. In this approach, we need to equalize the lengths of both extracted feature vectors. Later, we find the maximum covariance for fusion in a single matrix.
Consider two deep learning feature vectors φ ( k 1 ) and φ ( k 2 ) of dimensions n × m and n × q , where n denotes the number of images, m indicates VGG19 deep learning feature vector length of n × 4096 and q denotes Inception V3 feature vector of dimension n × 2048 , respectively. To make the length of vectors equal, we first find out the maximum length vector and perform average value padding. The average feature is calculated from a higher length vector. Let a be an arbitrary unit column m vector presenting a pattern in φ 1 field, and b indicates a random unit column vector representing a pattern in the φ 2 field, respectively. The time series projections on row vectors are defined as follows:
x 1 = φ 1 T   φ ( k 1 )
x 2 = φ 2 T   φ ( k 2 )
For optimal solutions φ 1 and φ 2 , maximize their covariance as follows:
c ˜ = C o v [ x 1 , x 2 ]
c ˜ = C o v [ φ 1 T   φ ( k 1 ) , φ 2 T   φ ( k 2 ) ]
c ˜ = 1 n 1 [ φ 1 T   φ ( k 1 ) ( φ 2     T φ ( k 2 ) ) ]
c ˜ = φ 1 ( C φ 1 φ 2 ) φ 2
C φ 1 φ 2 = 1 n 1 ( φ ( k 1 )   φ ( k 2 ) T )
where, C φ 1 φ 2 is the covariance value among φ 1 and φ 2 whose i th and j th features are φ i ( t ) and φ j ( t ) . Hence, the feature pair i and j of maximum covariance C φ 1 φ 2 is saved in the final fused vector. However, it is possible that few of the feature pairs are redundant. This process is continued until all pairs are compared with each other. In the end, a fused vector is obtained, denoted by φ ( f u ) of dimensions N × K , where K denotes the feature-length, which varies according to the selected features. In this work, the fused feature-length is N × 3294 for the Caltech-101 dataset, N × 2981 for the Birds dataset, and N × 3089 for the Butterflies dataset.

4.3. Feature Selection

Feature selection is an exciting research topic in machine learning (ML) nowadays, and shows significant improvement in the classification accuracy. In this work, we propose a new technique for feature selection, namely, Multi Logistic Regression controlled Entropy-Variances (MRcEV). It exploits a partial derivative-based activation function to remove the irrelevant features, and the remaining robust features are passed to the entropy-variances function. Through the latter, a new vector is obtained, which only contains positive values. Finally, this vector is presented to the ESD fitness function, and the validity of the proposed technique is determined. Mathematically, the formulation is given as:
For a given dataset, a fused vector is represented as Δ = { φ ( f u ) ,   y ( f u ) } f u = 1 N having N sample images, where φ ( f u ) denotes the fused feature vector, which is utilized as the input, and φ ( f u ) R p . The y ( f u ) indicates the corresponding labels and defined as y ( f u ) R . The probability among φ ( f u ) for the class i is then computed as follows:
p ( y ( f u ) | φ ( f u ) ) = exp { r i ( f u ) } j = 1 q e x p { r i ( f u ) }
r i ( f u ) = j = 1 p β i j φ j ( f u )
The parameter of logistic regression r i = ( r 0 ,   r 1 , ,   r p ) is obtained by minimizing the negative likelihood of features. If features are independent, then a multinomial distribution is computed as follows:
E Δ = f u n i = 1 q y i ( f u ) log   p ( y ( f u ) | φ ( f u ) )
To get a sparse model, a regularization parameter β ˜ is added to negative log-likelihood. The modified MLR criteria for the active features are defined as follows:
M = E Δ + β ˜ E r
E r = i = 1 p | r i |
where r i is regularization parameter.
At the minimum value of M , the partial derivative with respect to r i is formulated as follows:
{ | E Δ r i | = β ˜         i f       | r i | > 0 | E Δ r i | < β ˜         i f     | r i | = 0  
This expression shows that if the partial derivative of E Δ with respect to r i is less then β ˜ , then that feature value is set to zero, and removed from the final vector. Later, entropy-variances-based function is implemented to obtain a more robust vector. Mathematically, this function is formulated as:
H ( β ˜ ) = i = 0 N 1 p i ( β ˜ ) log p i ( β ˜ )
σ 2 ( β ˜ ) = ( β ˜ i β ˜ ¯ ) n 1
E n t ( F V ) = ( ln ( H ( β ˜ ) i + 1 ) + σ 2 ( β ˜ ) l n ( H ( β ˜ ) i + σ 2 ( β ˜ ) ) + l n ( H ( β ˜ ) i σ 2 ( β ˜ ) ) )
where, H ( β ˜ ) is an entropy function, σ 2 ( β ˜ ) denotes variance of the selected vector, and E n t ( F V ) represents the final entropy-variances function. The selected features are passed to this function to get a clear difference among all the features based on the classification classes. This proposed selection technique picks almost 50% to 60% robust features from the fused feature vector. The selected features are finally verified through the ESD classifier [39]. In the ensemble learning classifier, the subspace discriminant method is used. The proposed system’s predicted results are shown in Figure 4, Figure 5 and Figure 6.

5. Results

This section presents the simulation results with detailed numerical analysis and visual plots. As stated above, in this work, we utilize four publicly available datasets for evaluation of the proposed framework, including Caltech-101, Birds database, Butterflies database, and CIFAR-100 [40]. A brief description of the selected datasets is given in Table 1, where we have highlighted the total number of images, their specific classes (categories), and the number of images that each class comprises. As understandable, the Caltech-101 and CIFAR-100 are relatively more challenging for object classification. For validation, the 60:40 approach is employed along with ten-fold cross-validation. We used various classifiers for the experimental process, such as Ensemble learning, SVM, KNN, and Linear Discriminant classifiers. The performance of each classifier is validated using three essential measures, including accuracy, FNR, and computational time. All the simulations are conducted in MATLAB2019a installed on a 2.4 Gigahertz Corei7 processor with 16 Gigabytes of RAM, 128 SSD, and a Radeon R7 graphic card.

5.1. Caltech-101 Dataset Results

The results achieved on the Caltech-101 dataset are presented in three different ways: In the first method, both VGG19- and inceptionV3-based deep features are fused using a serial-based method, and the classification is performed without features selection. In the second method, the fusion of deep features is conducted using the proposed fusion approach, as presented in Section 4.2. In the third method, the feature selection is performed on the proposed fused vector, followed by classification. The results are shown in Table 2, where it is evident that the ESD classifier yields the best results against the rest for each method. However, it may be noticed that a massive difference exists among the accuracies achieved using M1 and the other methods. For example, consider the case of the ESD classifier, where the achieved accuracy rises from 79% to 90.8% upon using the proposed fusion method, which further jumps to 95.5% once the proposed selection method is applied. Additionally, observe that the computational time drops by around 74% between M1 and the P-selection method, making the latter more superior to the other two methods. The accuracy of the P-Selection method may also be verified through Figure 7. The effectiveness of the proposed P-Fusion and P-Selection methods while using other classifiers is also evident in Table 2. Observe that the best accuracies are provided by the P-Selection method irrespective of the classifier, while the P-Fusion stands second, both in terms of accuracy and computational time. Overall, the proposed selection method shows significant performance on ESD classifier for the Caltech-101 dataset.

5.2. Birds Dataset Results

The classification results using the Birds dataset are presented in this section. As before, three methods are applied for the evaluation, and all the results obtained previously hold true in this case as well. Table 3 summarizes these results, and verifies that the ESD classifier yields the best results for all the three methods when compared with various classifiers. Irrespective of the classifier used, it may also be verified that the proposed fusion method outperforms the M1 both in terms of the achieved accuracies and computational time, while the proposed selection method even surpasses the fusion method in both metrics. Its accuracy is also confirmed by Figure 8. Due to the simplicity in the dataset, the accuracies achieved by the three methods are relatively comparable, unlike in the case of Caltech-101, where the proposed methods outperformed the M1 by a considerable margin. The computational time, however, gives the proposed methods a substantial edge on the equivalent techniques.

5.3. Butterflies Dataset

The results for the Butterflies dataset are given in Table 4. It may be observed that the ESD classifier gives better outcomes for all three feature methods. For M1, the ESD classifier achieves an accuracy of 95.1%, which is improved to 95.6% after using the P-Fusion method. The computational time of M1 is 46.05 (s), but after P-Fusion, the time is reduced to 31.95 (s). In comparison, the P-Selection method achieves an accuracy of 98%, which is better than the M1 and P-Fusion. Moreover, the computational time of this method is 19.53 (s), which is also the minimum. The performance of the ESD classifier for the P-Selection method may also be verified through Figure 9. The performance of the ESD classifier is also compared with a few other well-known techniques such as SVM, KNN, and LDA, as given in Table 4. From the results, it can be clearly seen that all the classifiers provide better accuracy on the P-Selection method. Moreover, it is also concluded that W-KNN performs better in terms of computational time.

5.4. CIFAR-100 Dataset

This dataset consists of 100 object classes such as bus, chair, table, train, bed, and each class consists of 100 samples, making this dataset more challenging. There are 50,000 images available for the training of this dataset, while there are 10,000 images for testing. In this work, we utilize this dataset for the evaluation of the proposed technique. The results are given in Table 4 and Table 5. In Table 4, the proposed training results are provided, which show the maximum accuracy of 69.76% and an error rate of 30.24%. For the simple fusion method (M1), the noted accuracy is 51.34%, and the computation time is 608 (min). After employing the proposed fusion, it takes the time of 524 (min) for execution, and achieved an improved accuracy of 63.97%. The proposed P-Selection method further improves the accuracy and reached 69.76%, whereas the execution time is also minimized to 374 (min). The testing results are given in Table 6. The maximum achieved accuracy of the testing process is 68.80% using the P-Selection method and ESD classifier. The accuracy is not impressive, but in the view of dataset complexity, it is acceptable. The accuracy of the ESD using the P-Selection method can be further verified through Figure 10 (confusion matrix).

5.5. Analysis and Comparison with Existing Techniques

A comprehensive analysis and comparison with existing techniques are presented in this section to examine the authenticity of the proposed method results. The proposed fusion and robust feature selection methods give a significant performance of 95.5%, 100%, 98%, and 68.70%, respectively, for ESD classifier on the selected datasets. Results can be seen in Table 2, Table 3 and Table 4. However, it is essential to examine the accuracy of ESD against each classifier based on a detailed statistical analysis. For Caltech-101 dataset, we run the proposed algorithm 500 times for each method and get two accuracies: average (76.3%, 87.9%, and 92.7%), and maximum (79%, 90.8%, and 95.5%). These accuracies are also plotted in Figure 11a. In this figure, it is shown that a minor change is occurring in the accuracy after 500 iterations. For the Birds database, two accuracies are also obtained: minimum (97.2%, 98.9%, and 99.4%) and maximum (99%, 99.5%, and 100%). These values are also plotted in Figure 11b. In this figure, it can be observed that the change in M1 is a bit higher as compared to P-Fusion and P-Selection. In the end, the statistical analysis is conducted for the Butterflies dataset, as shown in Figure 11c. This figure shows a slight change in the accuracy of each method.
We performed the classification using other deep neural nets such as VGG16, AlexNet, ResNet50, and ResNet101 to compare the proposed scheme classification performance. The results are computed from the last two layers, such as Vgg16 (FC7 and FC8), AlexNet (FC7 and FC8), and ResNet (Average Pool and FC Layer). The features extracted from these layers are fused using the proposed approach and later perform the selection technique. For the classification of these neural nets, we used the original classifier named Softmax. Results are given in Table 7 and Table 8 below for Caltech-101 and CIFAR-100 datasets. In these tables, we noticed that the P-Fusion and P-Selection techniques are performed well using the proposed scheme. A brief comparison with existing techniques is also presented in Table 9. From this table, we computed the results on different training/testing ratios and get a variety of results. Based on the results, it is show that the increase in a training ratio minimizes the error rate. For example, in this table, accuracy of CIFAR-100 is 65.46%, 68.80%, 73.16%, and 77.28% for training/testing ratio 50:50, 60:40, 70:30, and 80:20, respectively. The minimum error rate is 22.72% for 80:30 approach whereas for standard approach (70:30), error rate is 26.84%. From this table, it is evident that the proposed method gives improved accuracy.

6. Conclusions

A new multi-layer deep features fusion and selection-based method for object classification is presented in this work. The major contribution of this work lies in the fusion of deep learning models, and then selection of the robust features for final classification. Three core steps are involved in the proposed system: Feature extraction using transfer learning, features fusion of two different deep learning models (VGG19 and Inception V3) using PMC, and selection of the robust features using Multi Logistic Regression controlled Entropy-Variances (MRcEV) method. An ESDA classifier is used to validate the performance of MRcEV. We utilize three datasets for the experimental process and demonstrate an improved achieved accuracy. From the results, we conclude that the proposed method is useful for large, as well as small datasets. The fusion of two different deep learning features shows an impact on classification accuracy. Additionally, the selection of robust features shows an effect on both computational time and classification accuracy. The main limitation of the proposed method is the quality of features—by using low-quality images, it is not possible to get strong features. In the future, this problem will be rectified through contrast, stretching deep learning architecture. Moreover, for the improvement of experimental process, the Caltech-256 and CIFAR-100 datasets will be considered.

Author Contributions

M.R. and M.A.K. developed this idea, and they were responsible for the first draft. M.A. was responsible for mathematical formulation. S.-H.W. supervised this work. S.R.N. gave technical support for this work. T.S. and A.R. were responsible for the final proofreading. All authors have read and agreed to the published version of the manuscript.

Funding

There was no funding involved in this work.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Detailed description of VGG19 pre-trained CNN model.
Table A1. Detailed description of VGG19 pre-trained CNN model.
Sr No.NameTypeActivationLearnableTotal Learnables
WeightsBias
1InputImage Input224 × 224 × 3---
2conv1_1Convolution224 × 224 × 643 × 3 × 3 × 641 × 1 × 641792
3relu1_1ReLU224 × 224 × 64---
4conv1_2Convolution224 × 224 × 643 × 3 × 64 × 641 × 1 × 6436,928
5relu1_2ReLU224 × 224 × 64---
6pool1Max Pooling112 × 112 × 64---
7conv2_1Convolution112 × 112 × 1283 × 3 × 64 × 1281 × 1 × 12873,856
8relu2_1ReLU112 × 112 × 128---
9conv2_2Convolution112 × 112 × 1283 × 3 × 128 × 1281 × 1 × 128147,584
10relu2_2ReLU112 × 112 × 128---
11pool2Max Pooling56 × 56 × 128---
12conv3_1Convolution56 × 56 × 2563 × 3 × 128 × 2561 × 1 × 256295,168
13relu3_1ReLU56 × 56 × 256---
14conv3_2Convolution56 × 56 × 2563 × 3 × 256 × 2561 × 1 × 256590,080
15relu3_2ReLU56 × 56 × 256---
16conv3_3Convolution56 × 56 × 2563 × 3 × 256 × 2561 × 1 × 256590,080
17relu3_3ReLU56 × 56 × 256---
18conv3_4Convolution56 × 56 × 2563 × 3 × 256 × 2561 × 1 × 256590,080
19relu3_4ReLU56 × 56 × 256---
20pool3Max Pooling28 × 28 × 256---
21conv4_1Convolution28 × 28 × 5123 × 3 × 256 × 5121 × 1 × 5121,180,160
22relu4_1ReLU28 × 28 × 512---
23conv4_2Convolution28 × 28 × 5123 × 3 × 512 × 5121 × 1 × 5122,359,808
24relu4_2ReLU28 × 28 × 512---
25conv4_3Convolution28 × 28 × 5123 × 3 × 512 × 5121 × 1 × 5122,359,808
26relu4_3ReLU28 × 28 × 512---
27conv4_4Convolution28 × 28 × 5123 × 3 × 512 × 5121 × 1 × 5122,359,808
28relu4_4ReLU28 × 28 × 512---
29pool4Max Pooling14 × 14 × 512---
30conv5_1Convolution14 × 14 × 5123 × 3 × 512 × 5121 × 1 × 5122,359,808
31relu5_1ReLU14 × 14 × 512---
32conv5_2Convolution14 × 14 × 5123 × 3 × 512 × 5121 × 1 × 5122,359,808
33relu5_2ReLU14 × 14 × 512---
34conv5_3Convolution14 × 14 × 5123 × 3 × 512 × 5121 × 1 × 5122,359,808
35relu5_3ReLU14 × 14 × 512---
36conv5_4Convolution14 × 14 × 5123 × 3 × 512 × 5121 × 1 × 5122,359,808
37relu5_4ReLU14 × 14 × 512---
38pool5Max Pooling7 × 7 × 512---
39fc6Fully Connected1 × 1 × 40964096 × 25,0884096 × 1102,764,544
40relu6ReLU1 × 1 × 4096---
41drop6Dropout1 × 1 × 4096---
42fc7Fully Connected1 × 1 × 40964096 × 40964096 × 116,781,312
43relu7ReLU1 × 1 × 4096---
44drop7Dropout1 × 1 × 4096---
45fc8Fully Connected1 × 1 × 10001000 × 40961000 × 14,097,000
46ProbSoftmax1 × 1 × 1000---
47OutputClassification---
Table A2. Detailed description of Inception V3 pre-trained CNN model.
Table A2. Detailed description of Inception V3 pre-trained CNN model.
S/NNameTypeActivationLearnable
WeightsBiasOffsetScale
1input_1Image Input299 × 299 × 3----
2scalingScaling299 × 299 × 3----
3conv2d_1Convolution149 × 149 × 32[3,3,3,32][1,1,32]--
4batch_normalization_1Batch Normalization149 × 149 × 32--1 × 1 × 321 × 1 × 32
5activation_1_reluReLU149 × 149 × 32----
6conv2d_2Convolution147 × 147 × 32[3,3,32,32][1,1,32]--
7batch_normalization_2Batch Normalization147 × 147 × 32--[1,1,32][1,1,32]
8activation_2_reluReLU147 × 147 × 32----
9conv2d_3Convolution147 × 147 × 64[3,3,32,64][1,1,64]--
10batch_normalization_3Batch Normalization147 × 147 × 64--[1,1,64][1,1,64]
11activation_3_reluReLU147 × 147 × 64----
12max_pooling2d_1Max Pooling73 × 73 × 64----
13conv2d_4Convolution73 × 73 × 80[1,1,64,80][1,1,80]--
14batch_normalization_4Batch Normalization73 × 73 × 80--[1,1,80][1,1,80]
15activation_4_reluReLU73 × 73 × 80----
16conv2d_5Convolution71 × 71 × 192[3,3,80,192][1,1,192]--
17batch_normalization_5Batch Normalization71 × 71 × 192--[1,1,192][1,1,192]
18activation_5_reluReLU71 × 71 × 192----
19max_pooling2d_2Max Pooling35 × 35 × 192----
20conv2d_9Convolution35 × 35 × 64[1,1,192,64][1,1,64]--
21batch_normalization_9Batch Normalization35 × 35 × 64--[1,1,64][1,1,64]
22activation_9_reluReLU35 × 35 × 64----
23conv2d_7Convolution35 × 35 × 48[1,1,192,48][1,1,48]--
24conv2d_10Convolution35 × 35 × 96[3,3,64,96][1,1,96]--
25batch_normalization_7Batch Normalization35 × 35 × 48--[1,1,48][1,1,48]
26batch_normalization_10Batch Normalization35 × 35 × 96--[1,1,96][1,1,96]
27activation_7_reluReLU35 × 35 × 48----
28activation_10_reluReLU35 × 35 × 96----
29average_pooling2d_1Avg Pooling35 × 35 × 192----
30conv2d_6Convolution35 × 35 × 64[1,1,192,64][1,1,64]--
31conv2d_8Convolution35 × 35 × 64[5,5,48,64][1,1,64]--
32conv2d_11Convolution35 × 35 × 92[3,3,96,96][1,1,96]--
33conv2d_12Convolution35 × 35 × 32[1,1,192,32][1,1,32]--
34batch_normalization_6Batch Normalization35 × 35 × 64--[1,1,64][1,1,64]
35batch_normalization_8Batch Normalization35 × 35 × 64--[1,1,64][1,1,64]
36batch_normalization_11Batch Normalization35 × 35 × 96--[1,1,96][1,1,96]
37batch_normalization_12Batch Normalization35 × 35 × 32--[1,1,32][1,1,32]
38activation_6_reluReLU35 × 35 × 64----
39activation_8_reluReLU35 × 35 × 64----
40activation_11_reluReLU35 × 35 × 96----
41activation_12_reluReLU35 × 35 × 32----
42mixed0Depth Concat35 × 35 × 256----
43conv2d_16Convolution35 × 35 × 64[1,1,256,64][1,1,64]--
44batch_normalization_16Batch Normalization35 × 35 × 64--[1,1,64][1,1,64]
45activation_16_reluFully Connected35 × 35 × 64----
46conv2d_14Convolution35 × 35 × 48[1,1,256,48][1,1,48]--
47conv2d_17Convolution35 × 35 × 96[3,3,64,96][1,1,96]--
----------------
307batch_normalization_94Batch Normalization8 × 8 × 192--[1,1,192][1,1,192]
308activation_86_reluReLU8 × 8 × 320----
309mixed9_1Depth Concat8 × 8 × 768----
310concatenate_2Depth Concat8 × 8 × 768----
311activation_94_reluReLU8 × 8 × 192----
312mixed10Depth Concat8 × 8 × 2048----
313avg_poolAvg Pooling1 × 1 × 2048----
314predictionsFully Connected1 × 1 × 10001000 × 20481000 × 1--
315predictions_softmaxSoftmax1 × 1 × 1000----
316classification layer_predictionsClassification Output----

References

  1. Ly, H.-B.; Le, T.-T.; Vu, H.-L.T.; Tran, V.Q.; Le, L.M.; Pham, B.T. Computational hybrid machine learning based prediction of shear capacity for steel fiber reinforced concrete beams. Sustainability 2020, 12, 2709. [Google Scholar] [CrossRef] [Green Version]
  2. Cioffi, R.; Travaglioni, M.; Piscitelli, G.; Petrillo, A.; De Felice, F. Artificial intelligence and machine learning applications in smart production: Progress, trends, and directions. Sustainability 2020, 12, 492. [Google Scholar] [CrossRef] [Green Version]
  3. Lin, F.; Zhang, D.; Huang, Y.; Wang, X.; Chen, X. Detection of corn and weed species by the combination of spectral, shape and textural features. Sustainability 2017, 9, 1335. [Google Scholar] [CrossRef] [Green Version]
  4. Zhou, C.; Gu, Z.; Gao, Y.; Wang, J. An improved style transfer algorithm using feedforward neural network for real-time image conversion. Sustainability 2019, 11, 5673. [Google Scholar] [CrossRef] [Green Version]
  5. Amini, M.H.; Arasteh, H.; Siano, P. Sustainable smart cities through the lens of complex interdependent infrastructures: Panorama and state-of-the-art. In Sustainable Interdependent Networks II; Springer: Berlin, Germany, 2019; pp. 45–68. [Google Scholar]
  6. Gupta, V.; Singh, J. Study and analysis of back-propagation approach in artificial neural network using HOG descriptor for real-time object classification. In Soft Computing: Theories and Applications; Springer: Berlin, Germany, 2019; pp. 45–52. [Google Scholar]
  7. Sharif, M.; Khan, M.A.; Rashid, M.; Yasmin, M.; Afza, F.; Tanik, U.J. Deep CNN and geometric features-based gastrointestinal tract diseases detection and classification from wireless capsule endoscopy images. J. Exp. Theor. Artif. Intell. 2019, 1–23. [Google Scholar] [CrossRef]
  8. Rashid, M.; Khan, M.A.; Sharif, M.; Raza, M.; Sarfraz, M.M.; Afza, F. Object detection and classification: A joint selection and fusion strategy of deep convolutional neural network and SIFT point features. In Multimedia Tools and Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2018; pp. 1–27. [Google Scholar]
  9. Wang, S.; Li, W.; Wang, Y.; Jiang, Y.; Jiang, S.; Zhao, R. An improved difference of gaussian filter in face recognition. J. Multimed. 2012, 7, 429–433. [Google Scholar] [CrossRef]
  10. He, Q.; He, B.; Zhang, Y.; Fang, H. Multimedia based fast face recognition algorithm of speed up robust features. Multimed. Tools Appl. 2019, 78, 1–11. [Google Scholar] [CrossRef]
  11. Suhas, M.; Swathi, B. Significance of haralick features in bone tumor classification using support vector machine. In Engineering Vibration, Communication and Information Processing; Springer: Berlin, Germany, 2019; pp. 349–361. [Google Scholar]
  12. Khan, M.A.; Akram, T.; Sharif, M.; Saba, T.; Javed, K.; Lali, I.U.; Tanik, U.J.; Rehman, A. Construction of saliency map and hybrid set of features for efficient segmentation and classification of skin lesion. Microsc. Res. Tech. 2019, 82, 741–763. [Google Scholar] [CrossRef]
  13. Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2013, arXiv:1312.6199. [Google Scholar]
  14. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  15. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
  16. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
  17. Arshad, H.; Khan, M.A.; Sharif, M.I.; Yasmin, M.; Tavares, J.M.R.; Zhang, Y.D.; Satapathy, S.C. A multilevel paradigm for deep convolutional neural network features selection with an application to human gait recognition. Expert Syst. 2020, e12541. [Google Scholar] [CrossRef]
  18. Majid, A.; Khan, M.A.; Yasmin, M.; Rehman, A.; Yousafzai, A.; Tariq, U. Classification of stomach infections: A paradigm of convolutional neural network along with classical features fusion and selection. Microsc. Res. Tech. 2020, 83, 562–576. [Google Scholar] [CrossRef] [PubMed]
  19. Jiang, B.; Li, C.; Rijke, M.D.; Yao, X.; Chen, H. Probabilistic feature selection and classification vector machine. Acm Trans. Knowl. Discov. Data (Tkdd) 2019, 13, 21. [Google Scholar] [CrossRef] [Green Version]
  20. Xiao, X.; Qiang, Z.; Zhao, J.; Qiang, Y.; Wang, P.; Han, P. A feature extraction method for lung nodules based on a multichannel principal component analysis network (PCANet). Multimed. Tool Appl. 2019, 8, 1–19. [Google Scholar] [CrossRef]
  21. Wen, J.; Fang, X.; Cui, J.; Fei, L.; Yan, K.; Chen, Y.; Xu, Y. Robust sparse linear discriminant analysis. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 390–403. [Google Scholar] [CrossRef]
  22. Mwangi, B.; Tian, T.S.; Soares, J.C. A review of feature reduction techniques in neuroimaging. Neuroinformatics 2014, 12, 229–244. [Google Scholar] [CrossRef] [PubMed]
  23. Khan, M.A.; Akram, T.; Sharif, M.; Shahzad, A.; Aurangzeb, K.; Alhussein, M.; Haider, S.I.; Altamrah, A. An implementation of normal distribution based segmentation and entropy controlled features selection for skin lesion detection and classification. BMC Cancer 2018, 18, 638. [Google Scholar] [CrossRef]
  24. Afza, F.; Khan, M.A.; Sharif, M.; Rehman, A. Microscopic skin laceration segmentation and classification: A framework of statistical normal distribution and optimal feature selection. Microsc. Res. Tech. 2019, 82, 1471–1488. [Google Scholar] [CrossRef]
  25. Gopalakrishnan, R.; Chua, Y.; Iyer, L.R. Classifying neuromorphic data using a deep learning framework for image classification. arXiv 2018, arXiv:1807.00578. [Google Scholar]
  26. Ryu, J.; Yang, M.-H.; Lim, J. DFT-based transformation invariant pooling layer for visual classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 84–99. [Google Scholar]
  27. Liu, Q.; Mukhopadhyay, S. Unsupervised learning using pretrained CNN and associative memory bank. arXiv 2018, arXiv:1805.01033. [Google Scholar]
  28. Li, Q.; Peng, Q.; Yan, C. Multiple VLAD encoding of CNNs for image classification. Comput. Sci. Eng. 2018, 20, 52–63. [Google Scholar] [CrossRef] [Green Version]
  29. Liu, X.; Zhang, R.; Meng, Z.; Hong, R.; Liu, G. On fusing the latent deep CNN feature for image classification. World Wide Web 2019, 22, 423–436. [Google Scholar] [CrossRef]
  30. Khan, H.A. DM-L based feature extraction and classifier ensemble for object recognition. J. Signal Inf. Process. 2018, 9, 92. [Google Scholar] [CrossRef] [Green Version]
  31. Mahmood, A.; Bennamoun, M.; An, S.; Sohel, F. Resfeats: Residual network based features for image classification. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 1597–1601. [Google Scholar]
  32. Cengil, E.; Çınar, A.; Özbay, E. Image classification with caffe deep learning framework. In Proceedings of the 2017 International Conference on Computer Science and Engineering (UBMK), Antalya, Turkey, 5–7 October 2017; pp. 440–444. [Google Scholar]
  33. Zhang, C.; Huang, Q.; Tian, Q. Contextual exemplar classifier-based image representation for classification. IEEE Trans. Circuits Syst. Video Technol. 2017, 27, 1691–1699. [Google Scholar] [CrossRef]
  34. Hussain, N.; Khan, M.A.; Sharif, M.; Khan, S.A.; Albesher, A.A.; Saba, T.; Armaghan, A. A deep neural network and classical features based scheme for objects recognition: An application for machine inspection. Multimed. Tool. Appl. 2020. [Google Scholar] [CrossRef]
  35. Khan, M.A.; Akram, T.; Sharif, M.; Javed, M.Y.; Muhammad, N.; Yasmin, M. An implementation of optimized framework for action classification using multilayers neural network on selected fused features. Pattern Anal. Appl. 2019, 22, 1377–1397. [Google Scholar] [CrossRef]
  36. Liaqat, A.; Khan, M.A.; Shah, J.H.; Sharif, M.; Yasmin, M.; Fernandes, S.L. Automated ulcer and bleeding classification from WCE images using multiple features fusion and selection. J. Mech. Med. Biol. 2018, 18, 1850038. [Google Scholar] [CrossRef]
  37. Rauf, H.T.; Saleem, B.A.; Lali, M.I.U.; Khan, M.A.; Sharif, M.; Bukhari, S.A.C. A citrus fruits and leaves dataset for detection and classification of citrus diseases through machine learning. Data Brief 2019, 26, 104340. [Google Scholar] [CrossRef]
  38. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  39. Gomes, H.M.; Barddal, J.P.; Enembreck, F.; Bifet, A. A survey on ensemble learning for data stream classification. Acm Comput. Surv. (Csur) 2017, 50, 1–36. [Google Scholar] [CrossRef]
  40. Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
  41. Fei-Fei, L.; Fergus, R.; Perona, P. One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 594–611. [Google Scholar] [CrossRef] [Green Version]
  42. Lazebnik, S.; Schmid, C.; Ponce, J. A maximum entropy framework for part-based texture and object recognition. In Proceedings of the ICCV 2005 Tenth IEEE International Conference on Computer Vision, Beijing, China, 17–20 October 2005; pp. 832–838. [Google Scholar]
  43. Lazebnik, S.; Schmid, C.; Ponce, J. Semi-local affine parts for object recognition. In Proceedings of the British Machine Vision Conference (BMVC’04), Kingston, UK, 7–9 September 2004; pp. 779–788. [Google Scholar]
  44. Ma, B.; Li, X.; Xia, Y.; Zhang, Y. Autonomous deep learning: A genetic DCNN designer for image classification. Neurocomputing 2020, 379, 152–161. [Google Scholar] [CrossRef] [Green Version]
  45. Alom, M.Z.; Hasan, M.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Improved inception-residual convolutional neural network for object recognition. Neural Comput. Appl. 2018, 32, 1–15. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Proposed deep learning architecture for object classification.
Figure 1. Proposed deep learning architecture for object classification.
Sustainability 12 05037 g001
Figure 2. Modified VGG-19 architecture for features extraction.
Figure 2. Modified VGG-19 architecture for features extraction.
Sustainability 12 05037 g002
Figure 3. Modified Inceptionv3 architecture for features extraction.
Figure 3. Modified Inceptionv3 architecture for features extraction.
Sustainability 12 05037 g003
Figure 4. Proposed system’s predicted labeled output for the Caltech-101 dataset.
Figure 4. Proposed system’s predicted labeled output for the Caltech-101 dataset.
Sustainability 12 05037 g004
Figure 5. Proposed system’s predicted labeled output for the Birds dataset.
Figure 5. Proposed system’s predicted labeled output for the Birds dataset.
Sustainability 12 05037 g005
Figure 6. Proposed system’s predicted labeled output for the Butterflies dataset.
Figure 6. Proposed system’s predicted labeled output for the Butterflies dataset.
Sustainability 12 05037 g006
Figure 7. Confusion matrix of the proposed selection accuracy on ESD classifier.
Figure 7. Confusion matrix of the proposed selection accuracy on ESD classifier.
Sustainability 12 05037 g007
Figure 8. Confusion matrix for Birds dataset using proposed selection method on ESD classifier.
Figure 8. Confusion matrix for Birds dataset using proposed selection method on ESD classifier.
Sustainability 12 05037 g008
Figure 9. Confusion matrix for Butterflies dataset.
Figure 9. Confusion matrix for Butterflies dataset.
Sustainability 12 05037 g009
Figure 10. Confusion matrix of CIFAR-100 dataset for proposed P-Selection method.
Figure 10. Confusion matrix of CIFAR-100 dataset for proposed P-Selection method.
Sustainability 12 05037 g010
Figure 11. Statistical analysis of ESD classifier using all three methods. Where (a) represent M1 method, (b) denotes P-Fusion method, and (c) denotes P-Selection method, respectively.
Figure 11. Statistical analysis of ESD classifier using all three methods. Where (a) represent M1 method, (b) denotes P-Fusion method, and (c) denotes P-Selection method, respectively.
Sustainability 12 05037 g011
Table 1. Numerical description of selected datasets.
Table 1. Numerical description of selected datasets.
Image DatabaseSample ClassesTotal SamplesMin-Max
Caltech [41]101914431~800
Birds [42]6600100~100
Butterflies [43]761942~134
CIFAR-100 [40]1001000 (Testing)
50,000 (Training)
100
Table 2. Proposed classification results using the Caltech-101 dataset. M1 represents simple serial-based fusion and classification, P-Fusion represents the proposed fusion approach, and P-Selection represents the proposed selection method results. Where, ESD described ensemble subspace discriminant, LDA represent linear discriminant analysis, LSVM denotes linear support vector machine, QSVM denotes quadratic SVM, and Co-KNN describe cosine K-Nearest Neighbor.
Table 2. Proposed classification results using the Caltech-101 dataset. M1 represents simple serial-based fusion and classification, P-Fusion represents the proposed fusion approach, and P-Selection represents the proposed selection method results. Where, ESD described ensemble subspace discriminant, LDA represent linear discriminant analysis, LSVM denotes linear support vector machine, QSVM denotes quadratic SVM, and Co-KNN describe cosine K-Nearest Neighbor.
ClassifierM1P-FusionP-SelectionAccuracy (%)FNR (%)Time (s)
ESD--79.021.0180.00
--90.89.293.70
--95.54.547.00
ES-KNN--75.824.2665.80
--80.119.9286.45
--85.314.7191.27
LDA--75.025.0597.84
--81.818.2127.83
--94.45.5106.57
L-SVM--76.024.09723.70
--88.012.03154.70
--91.68.62045.00
Q-SVM--77.222.81896.00
--87.612.41341.00
--92.08.0753.57
Cu-SVM--77.922.17493.00
--87.712.33647.70
--92.37.71889.50
F-KNN--75.724.3152.06
--84.915.196.96
--89.910.171.57
M-KNN--74.825.257.95
--84.515.547.44
--89.610.433.90
W-KNN--76.823.2228.19
--85.714.3187.50
--90.59.5105.87
Co-KNN--52.421.061.35
--87.612.448.76
--92.87.223.83
Table 3. Proposed classification results using the Birds dataset. M1 represents simple serial-based fusion and classification, P-Fusion represents the proposed fusion approach, and P-Selection represents the proposed selection method results.
Table 3. Proposed classification results using the Birds dataset. M1 represents simple serial-based fusion and classification, P-Fusion represents the proposed fusion approach, and P-Selection represents the proposed selection method results.
ClassifierM1P-FusionP-SelectionAccuracy (%)FNR (%)Time (s)
ESD--99.015.585.09
--99.51.068.31
--100.00.042.45
E-S-KNN--96.73.345.09
--97.62.438.31
--97.42.625.54
LD--98.02.048.39
--99.01.031.11
--100.00.023.92
L-SVM--97.92.145.36
--99.00.520.00
--100.00.017.66
Q-SVM--84.51.051.03
--99.30.724.06
--100.00.015.25
Cub-SVM--99.01.054.59
--99.50.543.32
--100.00.021.29
F-KNN--96.23.841.47
--97.42.619.58
--99.50.514.89
M-KNN--97.62.432.30
--98.81.217.31
--100.00.015.82
W-KNN--97.92.123.96
--99.30.713.10
--100.00.09.16
Cos-KNN--95.74.331.08
--99.01.022.00
--99.80.216.11
Table 4. Proposed classification results using the Butterflies dataset. M1 represents simple serial-based fusion and classification, P-Fusion represents the proposed fusion approach, and P-Selection represents the proposed selection method results.
Table 4. Proposed classification results using the Butterflies dataset. M1 represents simple serial-based fusion and classification, P-Fusion represents the proposed fusion approach, and P-Selection represents the proposed selection method results.
ClassifierM1P-FusionP-SelectionAccuracy (%)FNR (%)Time (s)
ESD--95.19.446.05
--95.65.931.95
--98.02.019.53
E-S-KNN--85.714.328.56
--87.712.318.27
--88.711.313.08
LD--70.929.148.44
--94.14.622.42
--96.63.417.01
L-SVM--91.68.440.02
--94.65.429.65
--96.63.416.72
Q-SVM--94.15.939.46
--94.15.924.58
--96.63.418.80
Cub-SVM--90.64.944.23
--93.66.429.41
--97.03.021.51
F-KNN--85.714.330.82
--89.210.818.70
--94.15.913.79
M-KNN--82.319.729.29
--85.214.818.30
--92.17.910.83
W-KNN--85.214.815.06
--87.212.814.26
--94.65.410.12
Cos-KNN--81.818.216.02
--85.714.314.54
--94.15.910.55
Table 5. Proposed training results on CIFAR-100 dataset.
Table 5. Proposed training results on CIFAR-100 dataset.
ClassifierM1P-FusionP-SelectionAccuracy (%)FNR (%)Time (min)
ESD--51.3448.66608
--63.9736.03524
--69.7630.24374
Table 6. Proposed testing results on CIFAR-100 dataset.
Table 6. Proposed testing results on CIFAR-100 dataset.
ClassifierM1P-FusionP-SelectionAccuracy (%)FNR (%)Time (min)
ESD--47.8452.16258
--62.3437.66204
--68.8031.2111
Table 7. Classification results on Caltech-101 dataset using different neural nets.
Table 7. Classification results on Caltech-101 dataset using different neural nets.
MethodFeaturesMeasures
P-FusionP-SelectionAccuracy (%)FNR (%)
AlexNet-86.7013.30
-90.249.76
Vgg16-85.1614.84
-89.2410.76
ResNet50-88.5711.43
-92.367.64
ResNet101-89.9610.04
-92.837.17
Proposed-90.809.20
-95.504.50
Table 8. Classification results on CIFAR-100 dataset using different neural nets.
Table 8. Classification results on CIFAR-100 dataset using different neural nets.
MethodFeaturesMeasures
P-FusionP-SelectionAccuracy (%)FNR (%)
AlexNet-61.2938.71
-65.8234.18
Vgg16-60.9039.10
-64.0635.94
ResNet50-61.8238.18
-65.7134.29
ResNet101-61.9838.02
-66.2533.75
Proposed-62.3438.71
-68.8034.18
Table 9. Comparison of proposed accuracy with recent techniques. MLFFS = Multi-Layers Features Fusion and Selection.
Table 9. Comparison of proposed accuracy with recent techniques. MLFFS = Multi-Layers Features Fusion and Selection.
ReferenceTechniqueDatasetAccuracy (%)
Roshan et al. [25]Fine-tuning on top layersCaltech-10191.66
Jongbin et al. [26]Discrete Fourier transformCaltech-10193.60
Qun et al. [27]Memory banks-based unsupervised learningCaltech-10191.00
Qing et al. [28]PCA-based reduction on fused featuresCaltech-10192.54
Xueliang et al. [29]A fusion of mid-level layers-based featuresCaltech-10192.20
Rashid et al. [8]Fusion of SIFT and CNN featuresCaltech-10189.70
Svetlana [43]Local affine parts-based approachButterflies90.40
Ma et al. [44]Genetic CNN designer approach (70:30)CIFAR-10066.77
Alom et al. [45]IRRCNN (70:30)CIFAR-10072.78
IRCNN (70:30)CIFAR-10071.76
EIN (70:30)CIFAR-10068.29
EIRN (70:30)CIFAR-10069.22
ProposedMLFFSButterflies98.00
ProposedMLFFSBirds100%
ProposedMLFFSCaltech-10195.5
ProposedMLFFS (50:50)CIFAR-10065.46
-MLFFS (60:40)CIFAR-10068.80
-MLFFS (70:30)CIFAR-10073.16
-MLFFS (80:20)CIFAR-10077.28

Share and Cite

MDPI and ACS Style

Rashid, M.; Khan, M.A.; Alhaisoni, M.; Wang, S.-H.; Naqvi, S.R.; Rehman, A.; Saba, T. A Sustainable Deep Learning Framework for Object Recognition Using Multi-Layers Deep Features Fusion and Selection. Sustainability 2020, 12, 5037. https://doi.org/10.3390/su12125037

AMA Style

Rashid M, Khan MA, Alhaisoni M, Wang S-H, Naqvi SR, Rehman A, Saba T. A Sustainable Deep Learning Framework for Object Recognition Using Multi-Layers Deep Features Fusion and Selection. Sustainability. 2020; 12(12):5037. https://doi.org/10.3390/su12125037

Chicago/Turabian Style

Rashid, Muhammad, Muhammad Attique Khan, Majed Alhaisoni, Shui-Hua Wang, Syed Rameez Naqvi, Amjad Rehman, and Tanzila Saba. 2020. "A Sustainable Deep Learning Framework for Object Recognition Using Multi-Layers Deep Features Fusion and Selection" Sustainability 12, no. 12: 5037. https://doi.org/10.3390/su12125037

APA Style

Rashid, M., Khan, M. A., Alhaisoni, M., Wang, S.-H., Naqvi, S. R., Rehman, A., & Saba, T. (2020). A Sustainable Deep Learning Framework for Object Recognition Using Multi-Layers Deep Features Fusion and Selection. Sustainability, 12(12), 5037. https://doi.org/10.3390/su12125037

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop