Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Evaluation of Deformable Convolution: An Investigation in Image and Video Classification

Mathematics 2024, 12(16), 2448; https://doi.org/10.3390/math12162448

by Andrea Burgos Madrigal

, Victor Romero Bautista

, Raquel Díaz Hernández

and Leopoldo Altamirano Robles^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Mathematics 2024, 12(16), 2448; https://doi.org/10.3390/math12162448

Submission received: 29 May 2024 / Revised: 25 July 2024 / Accepted: 2 August 2024 / Published: 7 August 2024

(This article belongs to the Special Issue Application of Artificial Intelligence, Machine Learning and Data Science in Industrial and Medical Domains)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In this paper, the authors present a guideline to use the DCON and contribute to understanding the impact of DCON on the robustness of CNNs. However, there are still areas in the paper that need to be explained in detail. I’ll state some comments that I hope the authors find them useful towards the improvement of their manuscript.

1. As readers, we find it a little puzzling that the authors did not explore DCN V3, which uses less memory and processing power. The work appears to be shallow, consisting solely of experiment-based investigation.

2. The introduction does not provide a thorough review to summarize recent research.

3. In Section 4.2.1, the authors compare accuracy using ResNet-18 vanilla model and DCNN configurations. Do discontinuous instances (like conv3,5) need to be taken into account?

4. Regarding the Basic block and Bottleneck block structures in Figure 3, there is no additional ReLU activation operation before the Add operation in the official ResNet network.

5. In the explanation of some variables in the formulas, such as in equations (2) and (5), the beginning of the subsequent paragraphs should not be indented. Some formulas do not have punctuation marks after them, such as formulas (1) and (6).

6. The title of Table 1 is left aligned, while the title of Table 2 is centered. The article should use a unified representation method. Adjust Table 5 and place it on the page.

Author Response

[Comment 1:] As readers, we find it a little puzzling that the authors did not explore DCN V3, which uses less memory and processing power. The work appears to be shallow, consisting solely of experiment-based investigation.

[Response 1:] An analysis was performed to decide which DCN to apply. We decided to include a table in Section 4.1 that shows the accuracy results between three implementations of deformable convolution which helped us to decide using DCNv2. Nonetheless the DCNv3 seems to be lighter and seems to capture long-range dependencies and adaptive spatial aggregation [1], we obtained a slightly lower accuracy. Also, in the structure of DCNv3 the number of input channels is the same than the number of output channels limiting using the DCNv3 in the small model without having to add more layers to augment the number of channels. Also, in section 2.2 we included the next explanation:

“DCNv3 and DCNv4 present a valuable approach for deformable convolution with promising results in large-scale models. In this study, we decided to use DCNv2 as the main operator instead of DCNv3 and DCNv4 since most of the adjustments made to DCNv2 to extend to DCNv3 and DCNv4 are for massive data and large-scale foundation models. However our study is focused on exploring the performance of deformable convolution on low and medium-scale models and datasets for image and action classification in video, and DCNv2 presents the main idea of deformable convolution

We conducted a comparative study between DCNv2 and DCNv3 for the image classification task using the Cats & Dogs, EyePacs, Spyder & Chicken, and Shapes datasets, with training data between 2.5K and 4K; and using low-scale CNNs model composed of 4 convolutional layers for feature extraction (small model). The results are shown in Table 2, where DCNv2 presented better results. “

[Comment 2:] The introduction does not provide a thorough review to summarize recent research.

[Response 2:]In Section 1 we have updated and enriched our list of references about related work that led to the gap where the deformable convolutions surges. Also, we extended the information related to the state of the art.

[Comment 3:] In Section 4.2.1, the authors compare accuracy using ResNet-18 vanilla model and DCNN configurations. Do discontinuous instances (like conv3,5) need to be taken into account?

[Response 3:]

- We have extended the experiments using discontinuous combinations. For the small model we added the evaluations at stages: stage1,3; stage1,4; stage2-4.

For the resnet models we extended the evaluations to the stages: conv2,4; conv2,5; conv3,5.

- The results of these new experiments are included in the paper in Tables 5 and 6 for the small model, and in Tables 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, and 19 for the ResNet models.

[Comment 4:]. Regarding the Basic block and Bottleneck block structures in Figure 3, there is no additional ReLU activation operation before the Add operation in the official ResNet network.

[Response 4:]We apologize for this mistake. Thanks for the observation. We have rectified our mistake in the graphical representation of the ResNet model blocks in Figure 3. We modified Figure 3 of the document by removing the ReLU activation function before the (residual) summation operation.

[Comment 5:]. In the explanation of some variables in the formulas, such as in equations (2) and (5), the beginning of the subsequent paragraphs should not be indented. Some formulas do not have punctuation marks after them, such as formulas (1) and (6).

[Response 5:]The suggested items were incorporated to improve readability.

[Comment 6:]. The title of Table 1 is left aligned, while the title of Table 2 is centered. The article should use a unified representation method. Adjust Table 5 and place it on the page.

[Response 6:]We unified the tables formats.

Reviewer 2 Report

Comments and Suggestions for Authors

No New novelty, already published same work in the following website:

1. https://towardsdatascience.com/review-dcn-deformable-convolutional-networks-2nd-runner-up-in-2017-coco-detection-object-14e488efce44

Author Response

Comments1:

No New novelty, already published same work in the following website:

1. https://towardsdatascience.com/review-dcn-deformable-convolutional-networks-2nd-runner-up-in-2017-coco-detection-object-14e488efce44

Response1: Thanks for your observation. We consider that our study differs from the link shared. First, we explore all the layers that conform the network and not only layer 5 and part of layer 4. We performed the analysis starting from a simpler model and extending the study to Resnets of different depths thus avoiding focusing on those of greater depth as do the related works mentioned in the article. The novelty of our work lies in analyzing in more detail the behavior and influence of deformable convolutions (and combinations of these). Also, we looked for the best kernel size. Finally, we apply the study to images and video.

Reviewer 3 Report

Comments and Suggestions for Authors

The paper aims to make clearer the optimal way to replace the standard convolution in a CNN model with its deformable counterpart (DCON), when dealing with binary balanced classes in 2D image classification (represented by public datasets such as: Cats & Dogs, EyesPACS, Spyders & Chickens, and Shapes) and 3D data for action recognition (UCF101 and Human2 datasets). The final declared contribution lies in a guideline to use DCON.

The proposed models are well known in the literature, therefore the novelty is tiny. Some aspects have to be clarified with respect to the employment and aims of the paper, since it claims to be a guideline, i.e. to provide indications of a future course of action. This is not well highlighted, heavily affecting the quality of the paper, which needs important improvements.

In the follows the main suggestions :

-The Abstract does not respect the standard (a total of about 200 words maximum, following the style of structured abstracts) and a reconstruction is needed.

- In the Introduction, rows 34-38 require an in-depth analysis, as well as rows 51-59, in which the most important papers that inspired this work are reported. In particular, it has to be clear what were the results in those works, with numerical indicators (for instance in “finding that accuracy improves with more DCONs” and so on) and what are the main differences and contribution of the proposed procedure from the others, anticipating something of Paragraph 3.

- Even if the paper is conceived for skilled readers, Sections 2.1, 2.2 and 2.3 require a more extensive presentation of the models, and figures have to be a better caption and a more strict link with the terms of the presented concepts/equations. For instance, what is MaxPool(2) in Figure 2? or how ReLU activation function and scale transformation were realized? In other words, parameters or configurations necessary to repeat the experiments are to be clearly stated and not hinted at or concealed in some figure in the text. I know very well that a large literature (theory and figures) on these topics already exist, but an effort in making a clear explanation is always useful, without being excessive in writing or distracting the reader from central ideas.

- What is gradient in row 129?

- In 2.5 e 2.6, the original datasets are described, but the most important characteristics (size, number of images/frames, etc.) are to be summarized in a table, especially for huge datasets, for instance cats & Dogs, it is important to declare not only the total number but above all how many images (training/test) were employed in the experiments.

- In 2.7 a link to the related figure is missed.

- Rows 209-216 merit to be clarified, since a guideline is addressed and how “The choice of stride is a trade-off that needs to be carefully considered based on the specific task and dataset” is to be pointed out.

- There is a bit of confusion among “small network” , “Small model” and “Vanilla (model base)” ? what are they constituted? Please, explain clearly and refer to them in a unique manner, if they are the same thing.

- Instead of “These configurations are shown in Table 1”, “The results on the considered configurations (insert here the number of the proper Section) are shown in Table 1” (row 245).

- Explain better the sentence in row 246.

- Always insert a numerical indicator in sentences such as “the best accuracy (??)is achieved…..”.

- What are str 1 and str 2 in Tables 15-17?

- Rows 431- 435 require an in-depth analysis. In other words, the promised guidelines are to be clearly listed and summarized at the end of the Section.

- Rows 451- 452 are a bit obscure or simplistic. Please, clarify.

- Typo errors, inhomogeneity in writing the same term need to be corrected.

Comments on the Quality of English Language

Moderate editing of English language required.

Author Response

[Comment 1:] The Abstract does not respect the standard (a total of about 200 words maximum, following the style of structured abstracts) and a reconstruction is needed.

[Response1:]We summarized to comply with the 200 words that must conform the abstract

[Comment 2:] In the Introduction, rows 34-38 require an in-depth analysis, as well as rows 51-59, in which the most important papers that inspired this work are reported. In particular, it has to be clear what were the results in those works, with numerical indicators (for instance in “finding that accuracy improves with more DCONs” and so on) and what are the main differences and contribution of the proposed procedure from the others, anticipating something of Paragraph 3.

[Response 2:] Thanks for the observation. In Section 1 we included information about the convolutional layers. Also, we included numerical indicators about the related work.

[Comment 3:] Even if the paper is conceived for skilled readers, Sections 2.1, 2.2 and 2.3 require a more extensive presentation of the models, and figures have to be a better caption and a more strict link with the terms of the presented concepts/equations. For instance, what is MaxPool(2) in Figure 2? or how ReLU activation function and scale transformation were realized? In other words, parameters or configurations necessary to repeat the experiments are to be clearly stated and not hinted at or concealed in some figure in the text. I know very well that a large literature (theory and figures) on these topics already exist, but an effort in making a clear explanation is always useful, without being excessive in writing or distracting the reader from central ideas.

[Response 3:]We have redefined sections 2.1, 2.2, and 2.3, where we extended the presentation of the traditional (standard) convolution models, deformable convolution, and the small model used for the experiments. In addition, we added Figures 1 and 2, which graphically represent the standard and deformable convolution operators; and we modified Figure 3, which shows the modules that compose the small model.

[Comment 4:]What is gradient in row 129?

[Response 4:]We specified the gradient definition by adding the following text:

The idea behind the residual block is to modify the flow of the gradient (that updates the weights of the nerwork during training) by adding a connection for an optimal gradient flow. Reducing the gradient vanishing problem, thus establishing a shortcut to avoid losing the input information. This process is done by adding the input signal to the output of the last convolutional layer of each residual block. In a residual block, the input is called residual information.

[Comment 5:]In 2.5 e 2.6, the original datasets are described, but the most important characteristics (size, number of images/frames, etc.) are to be summarized in a table, especially for huge datasets, for instance cats & Dogs, it is important to declare not only the total number but above all how many images (training/test) were employed in the experiments.

[Response 5:]In Section 2.7, we included a table that contains the total of samples and how many were employed during training and testing. Also, we averaged the number of frames that conform the videos in 3D datasets.

[Comment 6:] In 2.7 a link to the related figure is missed.

[Response 6:]Thanks for the observation, we added the link.

[Comment 7:] Rows 209-216 merit to be clarified, since a guideline is addressed and how “The choice of stride is a trade-off that needs to be carefully considered based on the specific task and dataset” is to be pointed out.

[Response 7:]We believe the sentence was incorrectly described so we added information for a better understanding:

“..More importantly, a more significant stride will capture more global features but can also overlook helpful information. In neural networks, since the stride specifies the number of pixels by which the filter matrix moves across the input matrix, the choice of stride is a trade-off that needs to be carefully considered based on the specific task and dataset. With stride 2 the dimensionality will be reduced, we analyzed if the deformable convolution takes advantage of this.”

[Comment 8:] .There is a bit of confusion among “small network” , “Small model” and “Vanilla (model base)” ? what are they constituted? Please, explain clearly and refer to them in a unique manner, if they are the same thing.

[Response 8:]We replaced small network with small model and added an specification of the vanilla model

[Comment 9:] Instead of “These configurations are shown in Table 1”, “The results on the considered configurations (insert here the number of the proper Section) are shown in Table 1” (row 245).

[Response 9:]We added the reference to the section thus clarifying the configuration.

[Comment 10:] Explain better the sentence in row 246.

[Response 10:]We agree with the reviewer and added an explanation.

“The accuracy below is after applying the scaling operation illustrated in Figure 6 and as a result, it reflects how robust is the configuration in the Triangle & Square classes.”

[Comment 11:] Always insert a numerical indicator in sentences such as “the best accuracy (??)is achieved…..”.

[Response 11:]We added numerical indicators throughout the entire document such as:

“increment by 7.63%”,”increments by 21.55% and combining two deformable stages in 3 and 4 the quantity of Flops reaches an improvement of 38.69% with an accuracy improvement between 0.01 or 0.03”,”by 0.001 to 0.006”,”with 0.951”,”by approximately 0.04”,”with 0.949”

[Comment 12:] What are str 1 and str 2 in Tables 15-17?

[Response 12:]Str was replaced to Stride to avoid any confusion

[Comment 13:] Rows 431- 435 require an in-depth analysis. In other words, the promised guidelines are to be clearly listed and summarized at the end of the Section.

[Response 13:]We complemented the information as follows:

Specifically in ResNets we highly recommend to use two stages of deformable convolution. The fourth stage consists of the convolutions incremented depending on the depth selected, we suggest to include de deformable convolution in the third and the fifth stages. On the other hand, in different models, it will depend on the structure but we still suggest to look for the later layers. Nonetheless, more studies about the data involved are necessary to understand the influence and needs of the application involved.

[Comment 14:]Rows 451- 452 are a bit obscure or simplistic. Please, clarify.

[Response 14:]We specify the future work with a better description:

we propose to look for a quantitative paradigm that permits measuring datasets to interpret the information learned by the networks and to evaluate the opportunities that deformable networks can offer for action recognition.

[Comment 15:]Typo errors, inhomogeneity in writing the same term need to be corrected.

Article Menu

Evaluation of Deformable Convolution: An Investigation in Image and Video Classification

Further Information

Guidelines

MDPI Initiatives

Follow MDPI