1. Introduction
The effect of pattern classification is mainly determined by classification features and classifiers. For an image classification task, the classification model needs to extract image features that are easy to distinguish, and then design a suitable classifier that matches the features. Traditional artificial feature extraction methods have achieved good results, such as the local binary pattern (LBP) [
1,
2], the speeded-up robust features (SURF) [
3,
4] and the scale-invariant features (SIFT) [
5,
6]. However, these features are single features without any hierarchical structure, sometimes lacking good generalization abilities. In recent years, the convolutional neural network (CNN) has attracted the extensive attention of scholars in many fields [
7,
8,
9,
10], because CNNs can extract various shallow features and deep features of images.
Based on the MLP training algorithm, LeCun et al. [
11] designed a convolutional neural network called LeNet-5, which was an early classic convolutional neural network model. It was mainly used for handwritten digit recognition on checks. With a gradual increase in the number of convolutional layers, the CNN’s feature extraction classification abilities are becoming stronger and stronger. Similarly, graph convolution networks (Graph CNNs) have good feature extraction ability [
12,
13,
14], because a Graph CNN is also very effective for non-Euclidean data, and is relatively more complex. In this paper, we only discuss CNNs. Krizhevsky [
15] proposed the extended deep convolutional neural network AlexNet, which used the ReLU function instead of the saturated nonlinear function tanh. It also used the dropout technique to avoid overfitting in the training process. In 2014, Google proposed a convolutional neural network with more than 20 layers, GoogLeNet [
16], to resolve the gradient disappearance and two loss functions were designed at different depths of the network. The VGG [
17] network model proposed by Oxford University adopted a layer-by-layer training method. The trained VGG network showed a good ability to extract image features. The residual network ResNet proposed in [
18] allowed the original input information to be passed directly to the subsequent network layer by adding connected channels directly. ResNet could train hundreds or even thousands of layers while achieving a good classification accuracy. In general, the above various types of convolutional neural networks have achieved good results in different scenarios, mainly because the convolutional layer of a CNN can extract abstract features with good separability. However, because the distribution of these abstract features is unpredictable, designing a suitable classifier for these features is a problem worthy of further study.
The above CNNs can extract all kinds of image features that are good for image classification, but the MLP classifiers in CNNs sometimes fall into local optimal [
19,
20,
21,
22,
23,
24,
25,
26,
27]. To solve this problem, Niu et al. [
19] proposed using a pretrained CNN as the feature extractor to extract the hierarchical features of an image, and then using an SVM as the classifier. The beginning of the method mainly focused on the binary classification problem. When applied to multiclass classification problems, it mainly adopted multiple two-classifier voting methods. In a multiple-two-classification problem, if a sample has the highest number of votes in two different categories, the SVM divides the sample into either category, which may lead to classification errors. Duan et al. [
20] proposed using the pretrained CNN as a feature extractor to extract facial features, and then using an extreme learning machine (ELM) as a classifier. However, it was generally necessary to reset the number of hidden nodes when using ELM classifiers on different dataset. Tien et al. [
26] used depth features and artificial features to merge into new hybrid features, and then used an SVM as a classifier. The new hybrid features combined the characteristics of deep features and artificial features and had a high separability. Guo et al. [
27] proposed a multinetwork feature fusion structure that used merged features to classify features. Merged features have more information than a single feature, which is beneficial in classification tasks. Although studies have shown that convolutional neural networks have a good ability to extract image features and have good classification capabilities, training a deep convolutional neural network takes a long time and requires a large quantity of data. When a new dataset needs to be classified, the network needs to be retrained.
The above literature review shows that the convolutional layer as a CFM, which is fully trained by a multicategory source dataset, can extract image features that are beneficial to the classification, but there are no studies on the relationship between the generalization ability of the CFM-extracted features, the number of source dataset and target dataset categories. In this paper, the trained CNN on the multiclass source dataset does not need retraining. It directly extracts image features from new target tasks and then cascades the appropriate classifier. This can not only use the ability of the CNN convolutional layer to extract image features, but also makes use of the advantages of different classifiers.
The main contributions of this paper include the following three points. First, the general rule between the generalization ability of the CNN’s convolutional layer as a CFM, and the number of categories of the source dataset and target dataset are summarized. Then, the interpretability of the generalization ability of the CFM is given. Second, a new GHP-CNN is proposed to solve the problem of a target image classification. The GHP-CNN mainly includes a CFM that extracts the features of the target set and a classifier called the structure-optimized probability neural network (SOPNN). The SOPNN classifier is not restricted by the distribution form of the target feature. The GHP-CNN avoids the problem of MLP classifiers in CNN networks easily falling into local optima and improves the classification ability.
The organizational structure of the rest of this paper is as follows. In
Section 2, the generalization ability of the CFM to extract target image features is studied. In
Section 3, a structure-optimized probabilistic neural network classifier is designed. In
Section 4, the GHP-CNN is proposed, and its principle and characteristics are analyzed.
Section 6 presents the analytical and experimental results. The last section summarizes the conclusions of the study. Some important symbols used in this article are shown in
Table 1.
2. Research on the Generalization Ability of the CNN Feature Mapper (CFM) to Extract Image Features
In this paper, the dataset
for training the CNN is called the source dataset, and the dataset
for testing the CFM generalization ability is called the target dataset. The number of sample categories in
is
and the number of sample categories in
is
. In general, the sample categories in the source dataset
and the target dataset
are not the same. This paper comprehensively studies the relationship between category reduction, unknown category expansion and the generalization ability of CNNs to extract target image features. In this paper, the category reduction refers to the source dataset
containing the target dataset
(
). An unknown category extension means that there is no sample that is common to both the source dataset and the target dataset (
). The network used in this section was VGG-16 [
17] (stride =1, padding = 1, MaxPooling), and the dataset used was CIFAR-100 [
28].
The convolution neural network trained on the source dataset
can be simply divided into two parts: the convolutional layer used to extract the feature and the MLP layer as the classifier. The convolutional network is fully trained on the source dataset until it converged, and the convolutional layer migrates as the CFM of the target set. As shown in
Figure 1, after the source dataset
is trained by the convolutional network and the MLP layer, the corresponding classification accuracy
is obtained, the feature mapper corresponding to the source dataset
of the s class is
, and the target dataset
of the t class is passed through the
. The features are extracted, and then classification training is performed with the MLP classifier, and the corresponding accuracy
is obtained
For the different affiliations between the source dataset and the target dataset , and to make the experiment statistically valid, when training multiple , s different categories of the source dataset and t categories of the target dataset were selected, where each s and t category samples satisfied the uniform distribution in the CIFAR-100 dataset. Multiple source datasets were selected, where . When , multiple target datasets were selected, where .
For each t-class target dataset, there are an average classification accuracy
and a classification accuracy
of the s-class source dataset corresponding to the
. Comparisons between the magnitudes of
and
show the generalization ability of the
in the t-class target dataset.
The size of can measure the strength of the ’s generalization ability. That is, the greater the value of is, the stronger the generalization ability of the . When , the can extract features with better separability for the t-class targets and can obtain a higher accuracy than under subsequent classifiers. That is, a trained by the s-class source dataset has a better generalization ability for a t-class target dataset . When , the fails to extract features with better separability in the t-class target dataset, which results in a lower accuracy than the original under subsequent classifiers. The corresponding classification accuracy can be obtained by changing the membership relationship between and , and the generalization ability of the under different conditions can be further reflected by increasing or decreasing the magnitude of the classification accuracy compared with the original accuracy.
2.1. Analysis of the Generalization Ability of the Feature Mapper to the Target Set after Category Reduction ()
When , the trained on multiclass source dataset extracts the features of target dataset and trains the classification by the MLP classifier. In the CIFAR-100 dataset, we selected the source dataset with the number of categories for s () to be fully trained with the VGG-16 algorithm, we obtained different original precisions and the corresponding 15 s, and then randomly extracted the target dataset with the number of categories t () from the source dataset . The features of these target dataset were extracted by the , and then the training classification was performed by the MLP classifier. To make the results more universal, the target dataset of each category are randomly extracted eight times. The classification accuracy was counted each time, and the final result was the average value of the multiple classification results, where t represented the number of categories in the target dataset , and s represented the number of categories in the source dataset .
As shown in
Figure 2, it can be seen that for the determined
t categories from target dataset
, the value of
was positively correlated with
s: as the corresponding
s of the
increased,
increased monotonously. That is, for a certain target dataset with
t categories, the larger the value of
s was, the better the generalization ability of the
trained by the source dataset
. For a
trained on the determined
s source dataset
, the value of
was negatively correlated with
and with the number of
t categories in the target dataset;
decreased monotonously as
t increased. That is, for a
that determined the number of categories, a target dataset
with a smaller number of
t had a better the generalization ability than a target dataset with larger values of
t. When
was a certain value, that is, after the feature was extracted under the
for
t different categories, and the magnitude of the classification promotion was the same, the number of
for different values of
t corresponding to
s was different.
2.2. Analysis of the Generalization Ability of the Feature Mapper to Unknown Target Categories ()
When , the trained on a multiclass source dataset extracts the features of target dataset and the features are used for training the MLP classifier. In the CIFAR-100 dataset, 90 categories were randomly selected as a candidate source dataset, and the remaining 10 categories were used as a candidate target dataset. The sample sets with number of categories s () were selected from the 90 candidate source dataset to be fully trained by VGG-16, and the relatively best classification results and corresponding 14 were obtained and randomly selected from the 10 candidate dataset. The unknown target dataset with a number of categories t () was selected, and the features of the randomly selected unknown target dataset were extracted by the CFM and then trained by the MLP classifier. To make the results more universal, each number of unknown sample subsets was extracted three times, the classification results were counted each time, and the final result was the average of three results.
As shown in
Figure 3, for the determined
t category target dataset
, the value of
was positively correlated with
s; as the corresponding
s of the
increased,
increased monotonously. That is, for a certain
t category target dataset, the larger
s was, the better the generalization ability of the
trained by the source dataset
. For the
trained from the determined
s category source dataset
, the value of
was negatively correlated with the number of
t categories in the target dataset
; as
t increased,
decreased monotonously. That is, for a
that determined the number of categories, the generalization ability of the target dataset
with a smaller
t was better. When
was a certain value, that is, after the feature was extracted under the
for different
t categories from the target dataset, and the magnitude of the classification promotion was the same, then the number of
for
t categories corresponding to
s categories was different.
2.3. Analysis of the CFM Generalization Ability
Through the above experiments, which were from different angles, the generalization ability of CFM was analyzed in a large number of ways. From the analysis, the generalization ability of a trained CNN as a CFM was related to the number of categories in the source dataset(t). The more categories in the source dataset were trained by the CFM, the better the generalization ability of the CFM, and the closer it was to global optimum characteristics.
In the above experiments, the features extracted by the CFM were input to the MLP for intensive training, which could be regarded as a fine-tuning of the MLP classifier after parameter migration. Although the network after fine-tuning the MLP classifier also had a better classification accuracy, it might also fall into a local optimum, and the internal function mapping relationship would be difficult to explain. Since the image features extracted by are generally abstract, the feature distribution is unknown. The classifier based on the Bayesian probability density is not restricted by the distribution of target features, and the final classification surface is close to the optimal classification surface under the Bayesian criterion. We design an optimized Bayesian classifier in the next section.