Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Efficient Shot Detector: Lightweight Network Based on Deep Learning Using Feature Pyramid

Appl. Sci. 2021, 11(18), 8692; https://doi.org/10.3390/app11188692

by Chansoo Park¹

, Sanghun Lee^2,* and Hyunho Han³

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2021, 11(18), 8692; https://doi.org/10.3390/app11188692

Submission received: 20 August 2021 / Revised: 4 September 2021 / Accepted: 10 September 2021 / Published: 17 September 2021

(This article belongs to the Special Issue Advances in Small Infrared Target Detection Using Deep Learning)

Round 1

Reviewer 1 Report

This paper proposes a lightweight network for efficient object detection, the work is based on the previous work - EfficientNet. In the network structure, the author replace the standard convolution operations with the depthwise and pointwise convolutions for reducing the number of parameters; an extra prior anchor box design is also proposed to improve the detection accuracy. From extensive experiments, the authors show that the proposed method achieve a good balance between the performance and the efficiency. Although the proposed method only achieves marginally improvement compared with EfficientNet, however, I don't think it's a big problem.

Even so, this paper still has some issues to be further addressed:

1). In page 1, line 4: "it is important to design a light network with detection performance." So, what kinds of performance? It can be "with reasonable performance".

2). In page 1, line 6: "based one deep learning for small parameters" Should the authors use "with" to replace "for" in the sentence?

3). In page 1, line 19-20: Could the authors add some citations of autonomous driving, air traffic control and image restoration.

4). In page 2, line 55-56: "Therefore, this study proposes a new object detection network efficient shot detector (ESDet) using a lightweight deep learning method." Should add "--" inside the sentence: "... a new object detection network -- efficient shot detector (ESDet)".

5). In page 4, line 122: "the correct answer data" seems strange, the authors could use "ground truth data / label".

6). In Table 4, if possible, could the authors also report GFLOPs / GMACs of the methods. These are also important metrics to decide if the algorithm can be run on a mobile device efficiently.

7) If possible, the source code of this paper should be released after publication.

Author Response

September 4, 2021

Applied Sciences Special Issue "Advances in Small Infrared Target Detection Using Deep Learning"

To Reviewer:

Hello, I sincerely thank the reviewer for reviewing my thesis with a lot of time and effort. We have sincerely reviewed the content pointed out by the reviewer, and have corrected the error after analyzing it. We will summarize the questions you have asked and answer them as follows.

Q1.

In page 1, line 4: "it is important to design a light network with detection performance." So, what kinds of performance? It can be "with reasonable performance".

In the case of the previously proposed deep learning-based detection network, studies have been conducted to improve detection accuracy or speed up inference. However, methods that improve accuracy usually suffer from a decrease in inference speed and efficiency due to a larger network size. And if you speed up the inference, there is a problem that the accuracy is somewhat lower. Therefore, this paper proposes a network that can be used universally by reducing the trade-off between accuracy and inference speed. As pointed out by the reviewer, it seems that the notation for the purpose of the proposed network is incorrect. Recognizing this, the contents of page 1 and line 4 were changed as follows.

Therefore, it is important to design a lightweight network with detection performance.

↓

It is important to design a lightweight network that can be used in general-purpose environments such as mobile environments and GPU environments.

Q2.	In page 1, line 6: "based one deep learning for small parameters" Should the authors use "with" to replace "for" in the sentence?
A.	I looked at what you pointed out. For the proposed method, we proposed a deep learning-based detection network using small parameters. Therefore, I think it is correct to use “with” in place of “for” to avoid confusing readers.

Q3.

In page 1, line 19-20: Could the authors add some citations of autonomous driving, air traffic control and image restoration.

References on autonomous driving, air traffic control and image restoration mentioned in the introduction were missing, so I checked the points you pointed out and added the following.

The rapid advancements and current level of computational power of deep learning based methods can be used in several applications, including autonomous driving systems [1], air traffic control [2], and image restoration [3], with high accuracy, which exhibit their capacity to replace the existing and traditional systems.

Q4.

In page 2, line 55-56: "Therefore, this study proposes a new object detection network efficient shot detector (ESDet) using a lightweight deep learning method." Should add "--" inside the sentence: "... a new object detection network -- efficient shot detector (ESDet)".

We have reviewed and corrected what you pointed out to convey a clear name to our readers.

after change :

Therefore, this study proposes a new object detection network – efficient shot detector (ESDet) using a lightweight deep learning method.

Q5.

In page 4, line 122: "the correct answer data" seems strange, the authors could use "ground truth data / label".

It seems to have been misrepresented in the text. Based on the content you pointed out, we have unified it as “groundtruth”.

after change :

Additionally, the groundtruth for CNN-based object detection methods should be designed and used in advance to fit the network head during training. Furthermore, the loss function that calculates the error between the prediction data and the groundtruth during training was explained.

Q6.

In Table 4, if possible, could the authors also report GFLOPs / GMACs of the methods. These are also important metrics to decide if the algorithm can be run on a mobile device efficiently.

The detection methods indicated in Table 4 were cited and recorded in the reference literature. The fair inference time measurement of the proposed detection network was measured by setting the batch size to 1 in the NVIDIA Geforece GTX Titan X environment. The measured GPU latency is 10.1 ms, which is equivalent to 99 FPS when converted to FPS. The quoted figures in Table 4 are considered fair comparisons as we measured inference speeds in NVIDIA Geforece GTX Titan X or Titan V environments. However, the points you pointed out were not indicated in the main text, so the following information was added at the beginning of Chapter 4.

When training the network in the experiment, it was conducted in Tensorflow 2.4.1, NVIDIA Geforce RTX 3090 X2 48GB environment. For all networks, we used the SGD optimizer and set the momentum value to 0.9. The initial learning rate was 0.005, and 0.0005 weight attenuation was applied to the weights and biases of the convolution filter. To measure the network fair inference time, we set the batch size to 1 in NVIDIA Geforece GTX TITAN X 12GB environment.

Q7.

If possible, the source code of this paper should be released after publication.

Currently using the github repository. However, since various comments in the README file or code are written in Korean, this part is currently being changed to English. We will add it when it is published in the paper.

Github address: https://github.com/chansoopark98/ESDet

Finally,

Looking back on your review, I think that the quality of the paper has improved even more. We will complete the github page until the paper is finally published.

The manuscript reflecting the amendments is attached as a PDF file.

Thank you.

Sincerely,

ChanSoo Park

Kwangwoon University

20 Kwangwoon-ro, Nowon-gu, Seoul 01897, Republic of Korea

Tel. No.: +82-2-940-5287

E-mail address: [email protected] (S.H. Lee) *Corresponding Author

E-mail address: [email protected] (C.S. Park)

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper designs a lightweight network to improve the detection performance. Validation looks good to support the main idea. The problem is

The contribution of the paper should be clearly presented. Due to the paper, the main problem they wanted to solve is to decrease the network parameter numbers, however, the used solution was proposed in eq(3) which is not a fresh idea. It’s better to explain how and why they use the operation and the corresponding contributions in a separate paragraph.
The written skills should be improved in case of leading misunderstanding. For example, the following sentences are hard to follow. “The information about a large object when the feature size is compressed because it is compressed from the viewpoint of the entire image.”
It seems like the second equal in eq(3) sign doesn’t work.
What’s the meaning of shrinkage?
Were the results of table 4 calculated from the same data division? It seems like the input resolution of the one-stage methods differs from each other. Do the authors reproduce the results or cite them from the reference? If just citing, how to compare the FPS fairly due to the different implement environment?

Comments for author File: Comments.docx

Author Response

September 4, 2021

Applied Sciences Special Issue "Advances in Small Infrared Target Detection Using Deep Learning"

To Reviewer:

Thank you very much for taking the time and diligence to review my thesis despite your busy schedule. As a result of reviewing the points pointed out by the reviewer, there were contents that could confuse the readers. Fortunately, it seems that we were able to improve the completeness of the paper by correcting the points pointed out by the reviewer. Thank you very much. Changes have been faithfully answered for each item reviewed. The changes are as follows.

Q1.

The contribution of the paper should be clearly presented. Due to the paper, the main problem they wanted to solve is to decrease the network parameter numbers, however, the used solution was proposed in eq(3) which is not a fresh idea. It’s better to explain how and why they use the operation and the corresponding contributions in a separate paragraph.

As you pointed out, there seems to be a lack of explanation of the main contribution of the paper. Accordingly, the sentence in the summary has been amended.

Therefore, it is important to design a lightweight network with detection performance.

↓

It is important to design a lightweight network that can be used in general-purpose environments such as mobile environments and GPU environments.

And, in the introduction, we added the main contributions of this paper.

1. A lightweight pyramid-structured object detection network with few parameters is proposed. Although it uses fewer channels than the existing pyramid structure, it is possible to efficiently extract features with a structure that repeats the number of times. In addition, it is designed to suppress unnecessary feature information by adding a feature refining process in the pyramid structure.

2. The one-stage detection method uses a prior box because it detects each feature map grid. In this paper, we redesigned the prior box to be robust to small and large objects.

3. Based on the ESDet-baseline, the experiment was conducted by expanding and reducing the network. It proves that the proposed network architecture can be used universally. It can be extended and used for tasks that require accuracy. And, when applied to mobile applications, it reports the efficiency that can be scaled down.

Q2.

The written skills should be improved in case of leading misunderstanding. For example, the following sentences are hard to follow. “The information about a large object when the feature size is compressed because it is compressed from the viewpoint of the entire image.”

We have reviewed what you pointed out. I have confirmed that the way of expressing my sentences is wrong. The intention of the original sentence is “the size of the object that can be detected is different for each scale of the feature map”. Since the one-stage object detector performs detection for each feature map grid, large-scale feature maps (e.g., 64x64) detect small objects. Conversely, a feature map with a small scale (e.g., 4x4) is used to detect large objects. The reason is that anchors are designed based on the center point of the grid. These are explained as in figure 4. Therefore, the content of the text has been modified as follows.

The information about a large object when the feature size is compressed because it is compressed from the viewpoint of the entire image

↓

The size of the object that can be detected varies depending on the feature map scale. To consider both small and large objects, feature maps S1, S2, and S3 with different scales are extracted from the backbone.

Q3.

It seems like the second equal in eq(3) sign doesn’t work.

The proposed method replaces the standard convolution and uses a combination of depthwise and pointwise convolution. This was written to compare the difference in calculation between the two methods through equation (2) and equation (3). When compared based on one convolution operation, standard convolution is calculated as the product of the width and height of the kernel, and generally uses a kernel of the same size. And, it is calculated as the product of the number of input channels and the number of channels to be output. In summary, the amount of computation can be calculated as the width of the kernel × the height of the kernel × the number of input channels × the number of output channels. On the other hand, depthwise deals with the depth dimension (channels) as well as the Spatial dimension. Taking RGB as an example, R operation, G operation, and B operation are applied separately, and each channel can be imaged with a specific interpretation of the image. Since the operation is performed only in a single dimension, the number of input channels and the number of output channels are the same.

Pointwise convolution is a convolutional layer with a fixed kernel size of 1×1 . The spatial features of the input image are not extracted, and only the calculations for each channel are performed. That is, it has the effect of compressing the input characteristics into one channel using a kernel of 1×1×C size. In the end, one filter expresses Linear Combination with coefficients for each channel. This means that it is possible to change the number of channels through Linear Combination on a per-channel basis. This can be understood as embedding multi-channel input video into video of fewer channels. By reducing the number of output channels, we can reduce the computational amount and parameters of the next layer. When performing Linear Combination on channels, unnecessary channels have low coefficients and may be diluted in the calculation results. Therefore, the pointwise of equation (3) is expressed as (C_input×C_output).

The calculation process applied in the actual proposed method is depthwise first and then pointwise.

Since the number of output channels in Depthwise is the same as the number of input channels, it should be expressed as (F_w×F_h×C_input) + (C_input×C_output). However, in the main text, pointwise convolution is misrepresented. I have corrected this part correctly.

Q4.

What’s the meaning of shrinkage?

Shrinkage refers to the reduction value based on the input resolution.

However, it seems that the explanation in the text is insufficient, so I added the contents when explaining equation (7).

Added:

And shrinkage means the scaling factor of the current feature map from the input resolution.

Q5.

Were the results of table 4 calculated from the same data division? It seems like the input resolution of the one-stage methods differs from each other. Do the authors reproduce the results or cite them from the reference? If just citing, how to compare the FPS fairly due to the different implement environment?

Q. Were the results of table 4 calculated from the same data division?

A. The results in Table 4 are the results of training the network using the PASCAL VOC07 train set and the PASCAL VOC12 train set together, and then evaluating it with the PASCAL VOC 07 test set. Some tasks (e.g. object segmentation) train and evaluate on a single data, but object detection tasks use both training datasets together. In this experiment, both datasets were used together for quantitative evaluation.

Q. It seems like the input resolution of the one-stage methods differs from each other.

A. In the case of a two-stage detector, the input resolution is often large because it uses a region proposal network. And it does not use a fixed input resolution because it samples the area where the object is most likely to exist no matter what size input image comes in.

On the other hand, the one-stage detector uses a fixed size input resolution because it samples the object region using a prior box. The input resolution sets the size of the input resolution according to the scale of the prior box. This is the “shrinkage” I mentioned earlier. In the case of SSD300, the extracted feature map scales are six {38X38, 19X19, 10X10, 5X5, 3X3, 1X1} according to the resolution of 300X300. In general, one-stage detectors generate anchors per grid, which means an increase in “recalls”. Also, the smaller the object, the larger the number of samples that can be detected. However, as the resolution increases, the number of proposals increases, so the number of parameters to be learned increases, which has the disadvantage of increasing the amount of computation and training time. In this paper, we designed a dictionary box and network based on an input resolution of 512X512 size to minimize speed while maintaining accuracy.

Q. Do the authors reproduce the results or cite them from the reference?

A. Yes, it is. After training using the PASCAL VOC 07+12 dataset, only the items evaluated with the PASCAL VOC 07 test set were cited.

Q. If just citing, how to compare the FPS fairly due to the different implement environment?

A. Normal speed measurement measures GPU latency and converts it to FPS (Frames Per Second). However, it seems that the description of FPS shown in Table 4 was insufficient.

The actual GPU latency is different from the equipment used for training. The fair inference time measurement of the proposed detection network was measured by setting the batch size to 1 in the NVIDIA Geforece GTX Titan X environment. The measured GPU latency is 10.1 ms, which is equivalent to 99 FPS when converted to FPS. The quoted figures in Table 4 are considered fair comparisons as we measured inference speeds in NVIDIA Geforece GTX Titan X or Titan V environments.

This section was added at the beginning of Chapter 4.

Finally,

I plan to attach the source code together before publishing the current paper. The content description (README) and comments are in Korean, so we are working on changing this part to English.

The current working github address is as follows:

Github address: https://github.com/chansoopark98/ESDet

The manuscript reflecting the amendments is attached as a PDF file.

Thanks again and I hope you have a nice day.

Sincerely,

ChanSoo Park

Kwangwoon University

20 Kwangwoon-ro, Nowon-gu, Seoul 01897, Republic of Korea

Tel. No.: +82-2-940-5287

E-mail address: [email protected] (S.H. Lee) *Corresponding Author

E-mail address: [email protected] (C.S. Park)

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The authors have solved my concerns. Therefore, I vote for acceptance.

Reviewer 2 Report

No more question.

Article Menu

Efficient Shot Detector: Lightweight Network Based on Deep Learning Using Feature Pyramid

Further Information

Guidelines

MDPI Initiatives

Follow MDPI