1. Introduction
Plants can be seen everywhere in our daily life, such as soybeans, maple leaves, grasses and vegetables. We can get abundant resources from plants, for instance, vitamins, energy, medicines, protein, fiber and oxygen. It is estimated that there are about 400,000 plant species [
1] in existence in the world, and there is a certain percentage of plants that we do not recognize adequately. According to recent research [
2], human beings cause more than two kinds of plants to disappear from the earth every year on average. Hence, it is urgent to study and protect plants, which is beneficial to plant rediscovery, rare plant conservation and environment protection. The most critical and important step in protecting plants is identifying them. Additionally, once we have an accurate plant recognition software in our mobile phones [
3,
4], we can recognize and learn more about plants anytime and anywhere; this could further raise public awareness of plant protection. Consequently, plant recognition [
5,
6] is a significant research topic.
It is noteworthy that the most popular trait used for plant recognition is leaf; however, there are also many works that use other traits to accomplish plant identification and perception. Kritsis established a new dataset for Greek vascular plant recognition [
7], in which the image trait includes leaf, flower, fruit and stem, the plant images are collected from tree, herb and fern. Xu constructed a minirhizotron image dataset for plant root architecture understanding [
8]. Because leaf is the most basic and significant appearance feature for plants, we follow common plant recognition works [
5] to utilize leaf images as our research object.
According to the summary and analysis, the difficulty of plant recognition comes from three aspects. Firstly, biological categorization is a hierarchical structure from coarse to fine: kingdom, phylum, class, order, family, genus and species; so the leaf species from the same genus share similar appearance; their visual differences are small. What is more, plant leaf images are always degraded by viewpoint, illumination, occlusion, resolution and other factors in the collection and imaging process, as a result, the similarity of homogeneous plant images decreases and the similarity of heterogeneous plant images increases. Thirdly, the semantic gap between leaf image visual content and the corresponding category labels is huge for computers. The above difficulties render plant recognition an open and challenging research direction. How to learn effective plant leaf representation is a long-standing pattern recognition problem.
Extracting discriminative features from leaf images has been recognized as an indispensable method to reduce the semantic gap. Over the past two decades, considerable efforts have been dedicated to learning effective leaf representation. In the literature, the existing plant recognition methods can be grouped into two categories: handcrafted feature methods and learning feature methods. In the early stage, researchers mainly developed plant features from the raw pixels, shape and texture of leaf images manually. Representative handcrafted methods are height description [
3], shape context [
9,
10], local binary pattern and texture [
11], triangle-based representation [
12,
13] and Fourier description [
14]. Such features have the advantage of simple calculation and good performance for ordinary and unconstrained plant datasets; however, the accuracy will decrease dramatically for the plant leaf dataset collected in uncontrolled and wild environment.
Learning features in a data-driven fashion is gradually becoming the mainstream method of improving feature generalization abilities. Wang et al. proposed bag of fragment to discern middle-level shape features in the bag of visual words framework [
15]. Zeng et al. presented a robust plant leaf identification method via locality constrained dictionary learning and sparse representation [
16].
In recent years, with the renaissance of artificial intelligence technology, deep convolutional neural networks (CNN) has become one of the most important and popular technologies for computer vision tasks. CNN models have also been introduced in plant recognition [
17,
18,
19,
20], achieving great progress in classification and retrieval accuracy. It should be pointed out that the obvious disadvantage of the neural network is that it requires a large amount of learning data, a powerful computing platform and sophisticated training skills.
1.1. Motivations
Deep neural network (DNN) has been recognized as a feasible way to overcome the drawbacks of handcrafted methods in enhancing feature discrimination. However, when the sample number in an image dataset is limited, for instance, there are only 1125 and 1907 leaf images in the Swedish and Flavia datasets, training a deep convolution network from scratch is not a preferred choice, because the overfitting issue will occur definitely. Fortunately, transfer learning provides a solution to avoid the overfitting problem; however, the retraining or fine-tuning (FT) process also requires a large computational burden and sophisticated transfer experience and tricks.
Moreover, the past four years have witnessed the prosperity of attention mechanism [
21] in computer vision, which learns an attention map, that is, a weight matrix or tensor, to rearrange the feature map, aiming to improve the model’s capability. The squeeze-and-excitation network learns global channel attention weight to obtain more informative features [
22], which has become a plug-and-play module. We can observe and summarize that rearranging the features in a feature map via attention weight may be a plausible way to enhance discrimination ability.
In addition, different features can be learned by different CNNs, and with the deepening of the network layer, the level of learned features is gradually growing: from low-level to mid-level to high-level. Generally speaking, there always exist complementary elements between different levels of features; therefore, fusing them could improve the representation power of leaf image features.
Motivated by the analysis above, in this paper, we propose a novel method to learn features for plant leaf images, where the feature maps from multiple pretrained CNNs and multiple layers are adopted directly without the need of parameter fine-tuning; the spatial weighting and channel weighting are used to recalibrate the features without any parameter training. The feature distribution in two-dimensional space for 600 leaves from 9 species of the same genus Acer is shown in
Figure 1. Although they belong to the same genus and have similar appearance, the features learned via our method for leaves from the same species are gathered closely, and the learned features for leaves from diverse species are separated clearly. This can reveal that our learned leaf features have strong discriminative power.
1.2. Contributions
The major contributions of this paper are summarized as follows:
We present a novel feature learning method for plant leaf representation, which can exploit pretrained neural network features without time-consuming end-to-end retraining.
We recalibrate each leaf feature map by using spatial weighting and channel weighting, which is able to capture salient information and squash non-salient information.
We propose to leverage the feature maps of multiple pretrained CNNs and multi-layers; this strategy not only can combine the features from different networks, but also can explore the complementary elements between low-level and high-level features from different layers.
Extensive plant leaf recognition and retrieval experiments are conducted on eight popular and complicated datasets; the mean accuracies and mean average precisions can corroborate the effectiveness and feature discrimination of our method.
The remainder of this paper is arranged as follows. Four kinds of related works are presented in
Section 2. The three procedures of our proposed method are detailed in
Section 3. Plant recognition experiments on eight datasets are provided in
Section 4, as well as plant leaf retrieval experiments. Finally, the conclusions are presented in
Section 5.
3. Proposed Method
In this section, we present our method (our source code will be released at
https://github.com/dxtyut/plantleaf) in detail. It is composed of three main steps: feature map extraction via multiple CNNs and layers, feature recalibration in spatial and channel dimensions, feature representation and classification. The feature extraction and recognition procedures of our proposed method are shown in
Figure 2.
3.1. Multiple CNNs and Layers
Since the well known residual network was proposed, for the purpose of fair comparison, the input image size of CNN- or Transformer-based models seems to be the uniform size 224 × 224 × 3. Let
be a color plant leaf image; following [
20], we first resize
to make the shortest edge equal 224, then the center patch of size 224 × 224 × 3 is cropped, which is also denoted as
in this work. We feed it into a CNN model pretrained on the large-scale image dataset imagenet [
53] to extract the activated feature maps from several layers. Because there exist abundant complementary elements between low-, middle- and high-level features, we explore combining the feature maps at different layers. In addition, different CNN models distinguish plant images via learning different dominant features, so three pretrained CNN models, VGG16, VGG19 [
54] and ResNet50 [
55], are regarded as feature extractors with the hope of learning more discriminative cues for leaves. The features map at a specific layer
L is obtained as follows:
In this paper, there are 37, 43 and 175 layers for VGG16, VGG19 and ResNet50 when the network is completely unfolded, including fully connected layers and softmax probability layer; moreover, the correlation between the convolutional features of adjacent layers is relatively strong. Therefore, in order to extract more complementation information, the activated layers of {9, 16, 23, 30} are adopted for VGG16, the activated layers of {9, 18, 27, 36} are employed for VGG19 and the activated layers of {90, 100, 110, 120, 130, 140, 152} are utilized for ResNet50.
3.2. Feature Recalibration
To simulate visual cognition in mammals, attention mechanism has been acknowledged as an indispensable module in CNN models because it can assign large weights for important features and small weights for trivial features. As shown in
Figure 3, there are 24 image feature maps at one layer of ResNet for a leaf image. It is apparent that the activation responses of different channels are different, and the activation responses at different positions are also different. So recalibrating the features via attention weight is capable of highlighting discriminative features and suppressing redundant features. Most traditional attention weight matrices are fine-tuned together with the learning process of neural networks, for example, convolutional block attention module [
56] and squeeze-and-excitation module [
22]. For the sake of simplicity, we follow [
57] in adopting a non-parametric scheme to learn the attention weight from spatial and channel dimensions.
Assuming the tensor shape of feature map
at layer
L is
, we compute the sum matrix
of
along the channel dimension as follows. The
represents the
c-th channel.
Then the sum matrix
is power normalized with parameter
and
. The spatial attention weight is calculated with the following formula:
where
denotes the coordinate on
; the parameters
and
are set to 0.5 and 2, respectively, as indicated in [
57]. Finally, the weight matrix
is multiplied with each channel of
in an element-wise manner and the resultant values are thus summed. Thereby, we obtain the spatial recalibrated features via the dot multiplication operation ⊙:
To calculate the channel weight, we firstly compute the proportion of positive numbers to the total number in each channel feature map as follows:
where
is an indicator function; it returns 1 when the assertion is true and 0 otherwise. The authors in [
57] found that the images from the same class have correlated channel sparsity 1-
, that is to say, channel sparsity offers discriminative information, which can be utilized to reveal the significance of infrequently occurring features. Accordingly, the channel attention weight can be calculated with the following formula:
where
is a constant close to zero and is used to prevent the denominator from being zero. Finally, the weight vector
is multiplied with each position of
in an element-wise way and all the resulting vectors are summed, so the channel recalibrated features can be obtained as follows:
3.3. Feature Representation and Classifier
Having obtained the spatial and channel recalibrated features, we multiple them in an element-wise fashion instead of vector concatenation as follows:
It is obvious that one feature representation can be deduced for each feature map . For the feature maps , and at different layers, the corresponding recalibrated features are computed firstly and then concatenated to form the final representation of 11008 dimensionality for the leaf image , which is followed by the normalization. Afterwards, in order to remove redundant features, white principle component analysis is used to reduce the feature vector’s dimensionality. In the recognition phase, the linear support vector machine (SVM) classifier is employed, in which the parameters are set to s=1 and c=10 without fine-tuning for the sake of simplicity.
In summary, the procedures of our method can be summarized as follows: (1) Split a plant leaf dataset into training set and testing set randomly. (2) Download the pretrained CNN models from the vlfeat website. (3) Extract CNN features from multiple CNNs and layers for each leaf image using MATCONVNET toolbox, where network fine-tuning is not required. (4) Call the Liblinear library to train an SVM classifier with parameter setting [-s 1 -c 10 -q]. (5) Predict the label for each test leaf image. It is worth noting that the parameters of linear SVM are not fine-tuned; we did not even try the other c and s values. We believe that the performance of our method may be enhanced if we select the optimal parameters.
4. Experiment
In this section, the necessity of multiple CNNs, multiple layers and feature recalibration are analyzed first. Extensive plant recognition experiments are then conducted on eight simple and complex plant databases in order to evaluate the effectiveness of our learned features thoroughly; the datasets are Swedish, Flavia, MEW2012, ICL, ICL compound, CVIP100, Leafsnap and Turkey Plant. The dimensionality reduction is not performed on Leafsnap and Turkey Plant. The species, training ratio and image number of train and test set are summarized in
Table 1. What is more, leave-one-out test plant leaf image retrieval experiments are also carried out. The evaluation metrics are accuracy and mean average precision. Each experiment is repeated for five rounds with different training sample random selections; the average results and standard deviations are reported. The performances of our method are compared with seven handcrafted methods and nine neural network-based methods. The handcrafted methods include SC+DP [
9], IDSC+DP [
10], MDM [
24], MTCD [
13], MFD [
14], MTD+LBP-HF [
27] and MMNLBP [
28]. The deep learning related methods are AlexNet+relu5 [
58], VGG16+relu5_2 [
54], Dual-Path CNN [
19], Deep Plant [
18], HGO-CNN [
20], MaskCOV [
35], KernelPool [
36], SWP-LeafNET [
37] and IMTD+relu5_2 [
38]. The results of all the other comparative methods are quoted from the original articles or the references [
3,
27,
38]. The software, framework and hard configuration of our implementation are MATLAB 2020a, MATCONVNET, Intel(R) Xeon(R) Silver 4210R CPU, 2.4 GHz with 64 GB. The feature extraction time for one image, the linear SVM classifier training time and the prediction time for one image are 8.89, 2.35 and 0.001 seconds, respectively. Because the leaf feature extraction processes of our method for different CNNs and different layers are independent, we believe the feature extraction time can be greatly reduced if parallel computing is applied.
4.1. Parameter Analysis
4.1.1. Is Feature Recalibration Necessary?
One natural question is whether the feature recalibration can improve the discriminative capability of our learned features. In order to corroborate the effectiveness of feature recalibration, we conduct several experiments on the Flavia dataset with and without feature recalibration, where nine activated layers of {16, 26, 36, 48, 58, 68, 78, 90, 100} of ResNet are used. The comparison results are shown in
Figure 4; as expected, in all nine cases, our method always obtains higher recognition accuracy when the feature recalibration is applied. Therefore, we can get the conclusion that feature recalibration is a significant module in our method that can produce more informative features for leaf images.
4.1.2. Are Multiple CNNs Necessary?
In what follows, we study the effect of multiple CNNs on the performance of our method on the Flavia dataset. The recognition results of VGG16, VGG19, ResNet and their combination are reported in
Figure 5. The adopted layers for the three CNNs are {9, 16, 23, 30}, {9, 18, 27, 36} and {90, 100, 110, 120, 130, 140, 152}; each experiment is repeated five times. It can be seen that the combined model outperforms the three single models and the standard deviation of the combined model is the smallest. This indicates that a combination of multiple CNNs is an effective strategy to boost the recognition accuracy and stability for plant recognition.
4.1.3. Are Multiple Layers Necessary?
In this subsection, we examine the necessity of multiple layers; 19 kinds of layer configurations are studied, as shown in
Table 2, which includes various types of combinations: low–low–low, low–middle, middle–high, low–middle–high, etc. The layer number ranged from 3 to 16. It should be noted again that there are 175 layers of tensors for the network ResNet50 when it is unrolled, including the prediction probability layer of size 1 × 1 × 1000. The recognition accuracy and feature length are shown in
Figure 6. The greater the feature length, the higher the recognition rate generally, because more features are utilized. Obviously, one can conclude that more layers will lead to better recognition performance. Although the 9th, 12th, 14th and 15th configuration indexes obtained good enough accuracy, the 17th and 19th configuration indexes obtained higher recognition rates. Considering the trade-off between feature length and recognition rate, we select the 17th configuration index; in other words, the layers of {90, 100, 110, 120, 130, 140, 152} for ResNet are used in this work.
4.2. Experiments on Swedish
Swedish is a classical and relatively simple plant leaf dataset [
43], consisting of 1125 leaf images from 15 categories; each class has 75 leaves. Example images are shown in
Figure 7. Following the popular train–test split scheme, for each class, 25 leaves are chosen as the training set; the other 50 leaves are used to constitute the testing set. As a result, there are 375 and 750 images in total in the training and testing set, respectively. The comparison results are enumerated in
Table 3. Our method offers the highest classification rate of 99.97%, which is the average value of 100%, 100%, 99.87%, 100%, 100% for the five repeated experiments. That is to say, only one image is misclassified for the third experiment. It is evident that our learned features have strong distinguishing capability for the Swedish dataset.
4.3. Experiments on Flavia
As illustrated in
Figure 7, there are 1907 leaves from 32 plant species in the Flavia [
44] dataset; the image number is about 50-70 for each species. We follow [
38] to adopt the common setting: 70% images per species are selected as training images; the other 30% as testing images. There are 1352 and 555 leaf images in total in the training and testing set, respectively. The recognition results of our method and the competing methods are tabulated in
Table 4. Our method achieves the highest classification rate of 99.89%, which is the average value of 100%, 99.82%, 99.82%, 100%, 99.82% for the five repeated experiments. The accuracy gain of our method over the best handcrafted MMNLBP [
28] is 0.59%. The improvement of our method over the second best neural network-based method, KernelPool [
36], is 0.18%. The main reasons for the superior performance of our method are twofold: complementation information utilization between various CNNs and layers; the discriminative features highlighted via feature recalibration.
4.4. Experiments on MEW2012
The objective of conducting this experiment is to evaluate our method on the more complicated plant dataset, MEW2012 [
45]. There are 9745 leaf images from 153 species, containing 50 to 99 leaves for each species; example images are shown in
Figure 7. There are also intra-class differences and inter-class similarities caused by the variations of image scale, viewpoint, color, illumination, etc. The biggest challenge for MEW2012 is that many species come from the same genus, as shown in
Figure 8. In other words, the species belonging to the same genus share similar visual appearance that renders MEW2012 a challenging plant leaf dataset. The comparison of our method and the other 14 comparative methods is displayed in
Table 5. Our method obtains the highest recognition rate, 99.41%, which is the average value of 99.28%, 99.38%, 99.32%, 99.50% and 99.59% for the five repeated experiments. Our approach outperforms the best handcrafted method MTD+LBP-HF [
27] by a large margin of 3.77%. The improvements of our method over the famous deep learning methods Dual-Path CNN, Deep Plant and HGO-CNN are 4.83%, 7.25% and 5.39%, respectively. Although IMTD+relu5_2 [
38] combines multi-scale triangle descriptor and convolutional features, its accuracy is still inferior to our accuracy by a margin of 3.2%. Although KernelPool [
36] obtains results close to our method, its feature length is larger than ours. Its outstanding performance demonstrates the superiority of our method in learning features for plant leaf recognition.
4.5. Experiments on ICL
To further evaluate the potential power of our approach in plant recognition, in this experiment, we utilize the ICL dataset [
3,
48]; there are 16851 leaves from 220 classes; the leaf image number and species number are both more than in the MEW2012 dataset. The image number in each species ranges from 26 to 1078. The ICL dataset is constructed by the Intelligent Computing Laboratory at Hefei Botanical Garden, Hefei, Anhui province, China. From
Figure 7, we can see that the visual appearance of the images from the 15th, 23rd and 141st species are very similar; this implies ICL is a more challenging dataset. We follow [
3] in using the first 26 leaf images for each species and setting the training ratio to 50%; therefore, both have 2860 samples in the training set and testing set. The classification comparison results are shown in
Table 6; our method achieves the best accuracy result of 98.67%, which is the mean value of 98.71%, 98.88%, 98.64%, 98.43% and 98.71% for the five repeated experiments. Compared with handcrafted features, deep neural network-based methods obtain higher classification rates because they have automatic hierarchical semantic feature learning abilities. Our method outperforms the well known handcrafted leaf descriptor MARCH [
3] by a margin of 12.64%. The accuracy of our method is 11.75%, 4.09% and 0.81% higher than that of VGG16+FT, ResNet50+FT and KernelPool, which can be attributed to the usage of complementation information between the convolution features from various layers and CNNs and salient features boosted by feature recalibration in our method.
4.6. Experiments on ICL Compound
According to the statistics and analysis, the images in the 10th, 12th, 25th, 27th, 49th, 56th, 126th, 132nd, 168th, 169th and 215th species of the ICL dataset are compound leaves, where a compound leaf splits several times in the middle to form two or more leaflets. Obviously, the compound leaf images bring more challenges to plant recognition. Therefore, Wang et al. [
59,
60] collected those leaf images to construct the ICL compound dataset; there are 11 classes and 654 leaves in total; example images are shown in
Figure 9. In order to assess the effectiveness of our method on compound leaves, we conduct an experiment on the ICL compound dataset. The training ratio for each class is 70%; the remaining 30% is used for testing. As we can see from the comparison results in
Table 7, our method also obtains the amazing accuracy of 100% for all the five repeated experiments, which again corroborates the effectiveness of our method in learning discriminative features for plant leaves.
4.7. Experiments on CVIP100
The CVIP100 dataset [
46] contains 1208 leaf images from 100 species; each class has at least 12 images.
Figure 10 illustrates 24 images for 4 species; one can observe that the leaves for many species have very similar shapes, textures and visual appearance, and image rotation is also a variation factor, which makes CVIP100 a challenging plant dataset. In each category, 70% of the images are considered the training set; the rest are regarded as the testing set. The comparison results are reported in
Table 8. Our proposed method achieves 99.65% recognition accuracy, which is the average value of 100%, 99.75%, 99.50%, 99.75% and 99.25% for the five repeated experiments. AlexNet+relu5 [
58] and VGG16+relu5_2 [
54] only utilize the features from one layer, which leads to a lack of sufficient characteristics. As a result, these methods cannot achieve an extremely promising recognition rate. Similar to the results in the previous tables, the neural network-based methods generally obtain higher results than handcrafted methods, which reveals the advantages of convolutional features. We further combine the convolutional features from multiple CNNs and multiple layers; therefore, our approach can provide higher recognition results as expected. Our method outperforms the best state-of-the-art method IMTD+relu5_2 [
38] by a margin of 0.4%.
4.8. Experiments on Leafsnap
We further evaluate the performance of our method on a large-scale plant dataset Leafsnap [
47]. Leafsnap includes 23147 lab images and 7719 field images from 184 tree species. The image number varies from 10 to 183 for each tree species. As shown in
Figure 11, the leaf images from diverse categories have similar shape and visual appearance. Therefore, it is very challenging to distinguish the leaves in Leafsnap correctly. In our experiment, we use the field images; seventy percent of the images in each species are regarded as the training set, the other thirty percent as the testing set. All comparison results on this dataset are tabulated in
Table 9. Our method achieves 93.40% recognition rate, which is the average value of 93.55%, 92.80%, 92.80%, 94.49% and 93.36% for the five repeated experiments. Because large variations existed in the leaf images of Leafsnap, the recognition accuracies of all handcrafted methods are less than 75%. Among the neural network-based methods, our proposed method outperforms the second and third best methods by 2.11% and 3.28%. The experiment results can demonstrate the effectiveness of our method evidently.
4.9. Experiments on Turkey Plant
The Turkey Plant disease and pest dataset was established by the Agricultural Faculty of Bingol and Inonu Universities in Turkey [
52]; we call it Turkey Plant in this paper for the sake of simplicity. It is designed to promote the research of plant disease and pest recognition. There are 4447 images of size 4000 × 6000 from 15 categories. The minimal and maximum sample number for an image class are 69 and 1110, respectively. Example images for each class are shown in
Figure 12. The image background is very complex and in each class, for example, Apple Aphis spp, some images only contain many apple leaves inside, some images contains apple fruits and leaves inside, some images focus on the tree branch and a few leaves and some images do not have leaves. Therefore, it is hard to identify different plant diseases correctly; that is to say, Turkey Plant is an extremely challenging plant disease dataset. In order to test the performance of our method on Turkey Plant, we conducted an experiment to compare our method and other competing methods, the comparison results are presented in
Table 10; our proposed method achieves the highest recognition accuracy, 96.19%, which is the average value of 95.82%, 97.01%, 96.27%, 96.49% and 95.37% for the five repeated experiments. More importantly, our method is the only method with a recognition rate of more than 90%, which outperforms the second best method by nearly 10%. Compared with the results on previous plant datasets, the performance of the handcrafted methods decrease heavily; the reason is that those methods extract leaf shape information; however, it is difficult to estimate the shape features for the leaves in the Turkey Plant dataset. It is obvious that our learned plant features are very effective and discriminative for plant leaf disease recognition even if the plant images have a complex background, scale variations and viewpoint rotation.
Moreover, we display the confusion matrix for the recognition results on Turkey Plant in
Figure 13, in which
means the
i-th class. Among the 15 categories, there are 12 categories with accuracy beyond 90% and 3 categories with accuracy equal to 100%. The misclassified samples of each category are 6, 2, 10, 6, 1, 3, 0, 1, 0, 4, 1, 2, 2, 0, 2 respectively. The confusion matrix can reveal that our method is robust against dataset unbalancing.
4.10. Leaf Retrieval Experiment
In this section, in order to evaluate the feature discriminative ability of our method further, several experiments are carried out to compute the leaf retrieval results, where the leave-one-out test scheme is applied. Suppose that there are
N samples and
K categories in a leaf dataset and let
be the
i-th leaf image that belongs to class
k with
samples. Firstly, we compute the Euclidean distances between
and the other
N-1 leaf images. Secondly, the average precision [
61] for
is formulated as follows:
where
means the precision at cut-off
n;
is equivalent to 1 if the
n-th retrieval image is relevant to
and 0 otherwise. Finally, the retrieval evaluation metric mean average precision (MAP) can be obtained via the following equation:
Without loss of generality, two simple leaf datasets (Swedish, Flava) and two complicated datasets (Leafsnap, Turkey Plant) are used in our retrieval experiments. The retrieval MAP results of our method are compared with those of the newly published state-of-the-art approach IMTD+relu5_2 [
38]. One can see from
Figure 14 that our method gets higher MAP scores on Swedish and Turkey Plant than IMTD+relu5_2. The MAP scores of our method on Leafsnap is 49.16%, which is very close to the 49.44% of IMTD+relu5_2.
We randomly select five leaf images from Flavia dataset and display the top 10 retrieval results for each leaf image. It can be seen from line 2 of
Figure 15 that there is only one wrong retrieval result that shares similar appearance with the query image. What is more, we also display the top 10 retrieval results for five leaf images from the Leafsnap datastet. The closest retrieval images for all the five queries are correct, which is consistent with the identification results in
Table 9. All the 10 retrieval results are correct for the query images in lines 3 and 5 of
Figure 16; this can prove the feature representation ability of our method. Although there are several wrong retrieval results in lines 1, 2 and 4 of
Figure 16, the wrong retrieval leaves have visual appearance and shape very similar to the query images, especially for the results in lines 1 and 4.
4.11. Effect of Classifier
In this section, we study the effect of different classifiers on the performance of our method, including LinearSVM with parameters
c = 10 and
s = 1, ridge regression classifier (RRC) with parameter
equal to 0.005, nearest neighbour classifier with cosine distance and ensemble classifier Bagging (fitensemble (data, label, ‘Bag’, 100, ‘tree’, ‘type’, ‘classification’)). For comparison fairness, each experiment is repeated for 20 rounds here. It should be emphasised that we do not optimize the parameters of the four classifiers; the usual parameter values are applied. We believe that the performance of our method could be further promoted if the parameters are fine-tuned. Without loss of generality, the two complicated datasets Leafsnap and Turkey Plant are used. From the box plot
Figure 17, we can see LinearSVM yields the second best results. Surprisingly, the RRC obtains slightly better performance than LinearSVM. Even the results of nearest neighbour classifier decrease by 3–5%, but it is still promising compared with the results in
Table 9 and
Table 10. Bagging classifier obtains unsatisfying results; the reason is probably that the ensemble method and tree number should be chosen carefully; however, we just use the default setting.