In this section, we present a performance evaluation of the proposed method and other trend leading baselines on four logo datasets.
4.1. Experimental Setting
(1) Datasets. We conducted our experiments on four logo datasets with different scales. Most of the experiments were performed on the large-scale LogoDet-3K [
22] dataset, which contains 113,710 images for training, 28,432 for validation and 16,510 for testing (Training Set: A set of examples used for learning, which is to fit the parameters [i.e., weights] of the classifier. Validation Set: A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network. Testing Set: A set of examples used to assess the performance [generalization] of a fully specified classifier [
46]. It is noted that validation set which is independent of testing dataset is used for hyperparameter tuning so as to avoid any biasing in choice of hyperparameters. Thus, when the network is completely trained, evaluation is done on completely unseen testing set). To assess the robustness of the DSFP-GA method, experiments were also performed on the LogoDet-3K-1000 [
22] dataset, the middle-scale QMUL-OpenLogo [
47] dataset, and the small-scale FlickrLogos-32 [
48] dataset. The LogoDet-3K-1000 dataset is sampled from the LogoDet-3K dataset, and it consists of 53,049 images for training and 9559 images for testing. The QMUL-OpenLogo dataset contains 27,083 images from 352 logo categories (by aggregating and refining several existing logo datasets). The FlickrLogos-32 dataset consists of 2240 images from 32 logo categories. The detailed statistics of the four datasets are shown in
Table 1.
(2) Implementation Details. The proposed approach is implemented based on the ResNet-50 backbone, which is pre-trained on the ImageNet [
23]. For a fair comparison, all baseline detectors are re-implemented based on the publicly available mmdetection toolbox [
49] via the same codebase. All models are trained on the training set and validated on the validation set. We adopt the widely used mAP (mean Average Precision) [
50] to evaluate the performance of the logo detection. In order to highlight the performance of our method in different sized logos, we also adopt the following evaluation metrics:
is the Average Precision (AP) for small logo objects (area <
),
is the AP for medium logo objects (
< area <
),
is the AP for large logo objects (area >
). The threshold of Intersection over Union (IoU) between the predicted bounding box and ground-truth bounding box is 0.5. We train these detectors with an initial learning rate of 0.002 and the input images are resized to 1000 × 600. All other hyper-parameters follow the settings in the mmdetection toolbox.
4.2. Ablation Study
In this part, we provide empirical analysis for each component in DSFP-GA, DSFP, GA, and CIoU loss. We report the overall ablation studies on the LogoDet-3K dataset shown in
Table 2, in which the first row lists the experimental results conducted on Faster R-CNN with ResNet-50-FPN. Furthermore, the results of the ablation studies on the LogoDet-3K-1000, the QMUL-OpenLogo, and the FlickrLogos-32 datasets as are shown in
Table 3,
Table 4 and
Table 5. These results as illustrated in the tables above, clearly indicate the effectiveness of our method in many aspects on different logo datasets.
(1) DSFP. We evaluate the effect of the DSFP by comparing it with FPN. The proposed DSFP is mainly used to improve the ability to detect small logo objects, and also enhance the semantic information of feature maps. As shown in
Table 6, small logo objects and medium logo objects account for 1.8% and 29.8% on the LogoDet-3K dataset. The DSFP brings 0.7% mAP improvement over the Faster R-CNN on the LogoDet-3K dataset in
Table 2. Especially, the
scores increase by 7.1%, validating the effectiveness on small logo objects detection of the DSFP.
Clearly, the DSFP can perform better than the FPN for logo detection tasks. This is because that it can extract more discriminative semantic features. To verify this, we visualize the heatmap in
Figure 4, which demonstrates that the DSFP is more effective in extracting discriminative semantic features for logo detection. The results prove that the red areas of
and
in DSFP are more accurate and have richer semantic information than those in the FPN. It is noteworthy that feature maps that come from the DSFP are more representative since they have stronger activation values in the foreground and weaker activation values in the background.
Besides visualizing heatmaps, we also visualize the detection results of two images with small logo objects in
Figure 5. Compared with DSFP-GA, Faster R-CNN misses a small logo object in the first image. It further proves the strengths of DSFP-GA in small logo object detection. The second image has small and extremely tall objects in
Figure 6. Faster R-CNN lacks a good solution to deal with this kind of logo objects, and the detection result is less satisfactory. In contrast, our method has the strength of detecting the small and extremely tall logo objects.
Moreover, we validate the benefit of the DSFP on the other three datasets. Similar to the LogoDet-3K dataset, small logo objects and medium logo objects account for 1.7% and 33% on the LogoDet-3K-1000 datasets in
Table 6. As shown in
Table 3, the DSFP increases 0.6% mAP over Faster R-CNN, especially the
scores increase by 11.3%, which shows that our DSFP enriches discriminative semantic information of feature maps. For the QMUL-OpenLogo dataset, more than 23.1% are small logo objects and over 44% are medium logo objects as shown in
Table 6. It can be seen that the main challenge is the small logo objects on the QMUL-OpenLogo dataset. The DSFP has an obvious improvement over the baseline in
Table 4. It increases by 1.6% mAP, and yields 7.1%
scores improvement, that shows the effectiveness of the DSFP in small logo detection. For the FlickrLogos-32 dataset, we can observe that less than 5.4% are small logo objects and about 29.3% are medium logo objects in
Table 6. DSFP still brings 0.7% mAP improvement and 5.6%
scores improvement over Faster R-CNN baseline in
Table 5, which can indicate that the DSFP has enhanced discriminative semantic information of feature maps.
(2) GA. We evaluate the strength of the GA on the LogoDet-3K dataset. The GA does not implicitly limit the aspect ratio and the size of anchor boxes, thereby addressing the issue of large aspect ratio logo objects well. For the LogoDet3K dataset, more than 35% of logo objects have an aspect ratio greater than 3, and more than 11.8% of logo objects have an aspect ratio greater than 5 as shown in
Table 6. There are many large aspect ratio logo objects in the LogoDet3K dataset. As shown in
Table 2, the GA improves the mAP from 84.5% to 86.6% on the LogoDet-3K dataset, which indicates the strength of our method.
We visualize the results in
Figure 6 and
Figure 7 to show that our method is effective in dealing with large aspect ratio logo objects, and the detection results of DSFP-GA are better than that of Faster R-CNN. As shown in
Figure 6, for the first image, Faster R-CNN doesn’t detect the logo with a tilted angle on the right side of the image at all. On the contrary, DSFP-GA detects the logo object on the right side with high accuracy, which goes to prove that DSFP-GA is robust in detecting difficult logo objects. As for the logo object on the left side, the accuracy of Faster R-CNN is 16% lower than DSFP-GA. In the second image, we can see that the ground-truth boxes are small and extremely tall logo objects. DSFP-GA detects these two logo objects with good accuracy. As shown in
Figure 7, for the first two images, DSFP-GA has more accurate detection results than Faster R-CNN. In the third image, Faster R-CNN mistakenly identifies the logo category, and the detection accuracy of the correct box is 26% lower than DSFP-GA. This compellingly proves the superiority of DSFP-GA in detecting large aspect ratio logo objects as well as small logo objects effectively.
The ablation studies on the LogoDet-3K-1000 dataset can further validate the benefit of the GA. In
Table 6, logo objects of Range (3+) account for 36.1% and logo objects of Range (5+) account for 11.9% on the LogoDet-3K-1000 dataset. The GA yields 0.6% mAP improvement on the LogoDet-3K-1000 dataset in
Table 3, which indicates the effectiveness of GA when addressing the issue of large aspect ratio logo objects.
The ablation studies on the QMUL-OpenLogo dataset and the FlickrLogos-32 dataset shows the performance of the GA. As shown in
Table 6, we find that more than 81.5% of the logo objects have an aspect ratio between 1 and 2.9, and about 4.3% have an aspect ratio greater than 5 on the QMUL-OpenLogo dataset. The GA improves the mAP from 53.5% to 53.7% as shown in
Table 4. Furthermore, as shown in
Table 6, there are approximately 95% of the logo objects that have an aspect ratio between 1 and 2.9, and only about 0.9% have an aspect ratio greater than 5 on the FlickrLogos-32 dataset. Similarly, the GA increases by 0.1% mAP as shown in
Table 5. Hence, we can draw a safe conclusion that the GA has a better performance for detecting large aspect ratio logo objects than Faster R-CNN.
(3) CIoU Loss. We also evaluate the benefit of the CIoU loss on these four logo datasets. The CIoU loss can obtain more accurate regression results via solving the problem of inconsistency to improve detection performance. In
Table 2, the CIoU loss improves the mAP from 86.6% to 87.7% on the LogoDet-3K dataset. The CIoU loss also improves the mAP from 89.4% to 90.1% on the LogoDet-3K-1000 dataset in
Table 3. As shown in
Table 6, the CIoU loss increases the mAP from 53.7% to 54.0% on the QMUL-OpenLogo dataset. The CIoU loss improves the mAP from 86.7% to 87.1% on the FlickrLogos-32 dataset in
Table 5. These validate the effectiveness of our method when adopting the CIoU loss. However, the
and
scores decreased slightly on the FlickrLogos-32 dataset. Our observation is that FlickrLogos-32 contains fewer logo images, therefore, CIoU loss does not play significant roles in the
and
scores on this dataset.
In order to evaluate the better performance of DSFP-GA, we selected two images that contain both small size and large aspect ratio logo objects and visualized the detection results. As shown in
Figure 8, Faster R-CNN doesn’t detect the logo object that is small and wide in the first image. On the same image, DSFP-GA has better results in localization and classification. In the second image, Faster R-CNN mistakenly detects two logo objects and the accuracy of another logo object detected by Faster R-CNN is much lower than DSFP-GA. This amply demonstrates the superior performance of DSFP-GA in both small size and large aspect ratio logo object detections.
4.3. Comparison of State-of-the-Art Frameworks
To further validate the versatility of the proposed DSFP-GA, experiments were performed with multiple top of the trend approaches. We chose several one-stage frameworks that have good performance on general detection datasets in recent years. We also selected a series of standard two-stage frameworks which are improved based on Faster R-CNN and are state-of-the-art.
(1) Experiment on the LogoDet-3K. Our method DSFP-GA achieves the best performance on the LogoDet-3K datasets. We compared DSFP-GA with the state-of-the-art detection approaches on the large-scale LogoDet-3K dataset in
Table 7. Compared with the existing two-stage baselines Faster R-CNN, Libra R-CNN, and Dynamic R-CNN, etc., the DSFP-GA significantly outperforms these state-of-the-art frameworks. Our approach is based on modified on Faster R-CNN and achieves the best mAP of 87.7%, surpassing the Faster RCNN baseline of 3.9% mAP, which indicates the effectiveness of our strategy. Compared with Dynamic R-CNN, which ranks second of mAP the performance of our method is 0.3% mAP better than it. The
,
and
scores of Dynamic R-CNN are 53.7%, 82.0%, and 90.4% respectively, and our method scores are 2.5%, 1.1%, and 0.1% higher respectively. This clearly demonstrates the effectiveness of our method. In addition, our framework also improves by 4.6% mAP compared with PANet that is equipped with the excellent feature pyramid structure PAFPN. It can be observed that our DSFP has a better effect on fusing the features of logo objects than PAFPN. We also compare DSFP-GA with state-of-the-art one-stage approaches. Our framework brings 7.8% mAP improvement over ATSS [
51] and 6.5% mAP improvement over GFL [
52]. This superior performance is because, there are many large aspect ratio logo objects in this dataset, and our model is sufficiently equipped for this challenging issue.
Detection results of DSFP-GA given in
Figure 9, clearly demonstrate that our model has superior performance in detecting all kinds of logos of various sizes and shapes. We can see from
Figure 9 that our model has better detection results on large logo objects (category “cherry 7up”, category “waffle house”, etc.), medium logo objects (category “cheez whiz”, category “swiss miss”, etc.), and small logo objects (category “freia”, category “nioxin”, etc.). It is worth mentioning that there are multiple multi-scale objects in a test image from
Figure 9, where our model also can detect all logo objects accurately. These prove our method can well detect logo objects of different sizes and has the capacity to handle multiple logo objects within one image.
(2) Experiment on the LogoDet-3K-1000. Experiment on the LogoDet-3K-1000. DSFP-GA again has the best performance on the LogoDet-3K-1000 dataset. As shown in
Table 8, DSFP-GA achieves 90.1% mAP, which is an increase of 1.9% mAP over Faster R-CNN. In addition, our method also improves by 1.0% mAP compared with PANet. Dynamic R-CNN achieves 89.5% mAP, which ranks second to our method, in that, our method yields 0.6% mAP over Dynamic R-CNN. The
,
and
scores of Dynamic R-CNN are 49.7%, 82.2%, and 93.6% respectively, and our method’s scores are 6.3%, 1.7%, and 0.2% higher respectively. Compared with the one-stage frameworks, our work yields 2.3% mAP over ATSS and 2.4% mAP over GFL. The LogoDet-3K-1000 dataset contains a huge number of large aspect ratio logo objects, which account for the exceptional performance highlighting the effectiveness of our model in dealing with these kinds of logo objects. The experiments on the LogoDet-3K-1000 dataset further vindicate the superiority of the proposed DSFP-GA method over the evaluated state-of-the-art methods as illustrated in
Table 8 below.
(3) Experiment on the QMUL-OpenLogo. From
Table 9, we can see that our method achieves the best performance (by 54.0% mAP) on the middle-scale logo dataset. We also list the experimental results of baselines on the middle scale QMUL-OpenLogo dataset. Compared with Faster R-CNN, DSFP-GA obtains 2.1% mAP improvement. Our method also improves by 1.1% mAP compared with PANet. This further shows that DSFP-GA can handle the QMUL-OpenLogo dataset which contains small logo objects better than Faster RCNN and PANet. Compared with Cascade R-CNN, which ranks second (by 53.1% mAP), the performance of our method is 0.9% mAP improvement over it. The
,
and
scores of Cascade R-CNN are 32.7%, 54.0%, and 67.4% respectively, and our method’s scores are 0.5%, 2.4%, and 0.9% higher respectively, indicating the effectiveness of our method. Compared with the best performing one-stage method GFL, our method improves by 4.8% mAP than it (i.e. 54.0% over 49.2%). These results indicate that our model is efficient in dealing with the challenges of small logo objects.
(4) Experiment on the FlickrLogos-32. Our framework also has good performance on the small-scale FlickrLogos-32 dataset. The experimental results of baseline and our framework on the small-scale FlickrLogos-32 dataset are summarized in
Table 10. Our method achieves 87.1% mAP that is the same as Cascade R-CNN in
Table 10. Cascade R-CNN achieves 87.0% mAP by cascading multiple detection heads. However, our framework only uses one detection head, which achieves great performance 87.1% mAP. The
,
and
scores of Cascade R-CNN are 17.3%, 80.5%, and 93.4% respectively, Especially the
and
scores of our method are 11.2% and 2.8% better than it, This can show the effectiveness of our method. For the one-stage framework, our method increases 1.0% mAP than ATSS and 0.9% mAP than GFL. Two-stage detectors have one more region proposal network, which can improve the detection result. Although the FlickrLogos-32 dataset is small and simple, DSFP-GA still has better detection result than one-stage frameworks.