Next Article in Journal
Energy and Exergy Analysis of an Improved Hydrogen-Based Direct Reduction Shaft Furnace Process with Waste Heat Recovery
Previous Article in Journal
Invasive Prenatal Diagnostics: A Cornerstone of Perinatal Management
 
 
Article
Peer-Review Record

A Method for Enhancing the Accuracy of Pet Breeds Identification Model in Complex Environments

Appl. Sci. 2024, 14(16), 6914; https://doi.org/10.3390/app14166914
by Zhonglan Lin, Haiying Xia *, Yan Liu, Yunbai Qin * and Cong Wang
Reviewer 1: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2024, 14(16), 6914; https://doi.org/10.3390/app14166914
Submission received: 2 June 2024 / Revised: 27 July 2024 / Accepted: 2 August 2024 / Published: 7 August 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In this article, the authors have presented a method that utilizes transfer learning to enhance the accuracy of pet breed classification in complex backgrounds. For the experiments, the authors have developed a dataset for pet breed identification in complex backgrounds (CPI dataset) based on an existing (Oxford-IIIT Pet dataset) with images chosen based on specific criteria (such as only images with complex backgrounds, interference, or redundant information). In the experiments, three different architecture models pre-trained on ImageNet were employed and fine-tuned using the CPI and Oxford-IIIT Pet datasets. The authors have evaluated their performance on the same test set of complex images, and the results demonstrated that the Top-1 accuracy of the models fine-tuned on the CPI dataset was higher than that on the Oxford-IIIT Pet dataset, suggesting that the developed CPI dataset is better suited for models that operate in real-world environments.

As already explained, the results are promising. The study design and experiments are well presented. However, in section 4, where the results of experiments are presented and compared, I suggest the authors better explain what the differences in performances mean. Are these differences big or significant? And where they claim that the improvement is significant, please explain why the improvement must be considered significant. Please explain how these differences must be interpreted. 

The discussion part is rather short. Although the authors stated that they confirmed the validity of the improvements through ablation studies, I suggest the authors provide a more thorough discussion on the threats to the reliability and validity of the results in all steps of their research. The main limitations of the study must also be identified and discussed.

Author Response

July 27, 2024

Dear reviewer,

Thank you very much for your recognition of our work and your important suggestions. Based on your suggestions, we have made a series of modifications to the paper. Please find the corresponding revisions/corrections highlighted in the resubmitted document.

Comments 1: However, in section 4, where the results of experiments are presented and compared, I suggest the authors better explain what the differences in performances mean. Are these differences big or significant? And where they claim that the improvement is significant, please explain why the improvement must be considered significant. Please explain how these differences must be interpreted. 

Response 1: In subsection 4.3, we explained the purpose of the comparative experiment between the two datasets, and presented the results of the comparison between the improved CPI dataset and the original dataset. The top-1 accuracy of the three models improved by 0.36%, 0.09%, and 1.26% respectively. Furthermore, the significance of dataset improvement is also reflected in the improvement of F1 scores of the three models.

For the significance of model improvement, we described in detail the ablation experiment we conducted on PBI-EdgeNeXt in subsection 4.4 and presented the results of the ablation experiment. As mentioned in the article, the results of the ablation experiments show that each step of our model improvement enhances the classification ability of the model to a certain extent.

In addition, as described in subsection 4.5, we compared the improved model with five other models. We evaluated the classification accuracy using the validation set during the training of six models, and finally evaluated the model on the test set after training. The test results are presented in Table 4, and the accuracy rise curve during the training process is displayed in Figure 5. In subsection 4.5, we utilize these numerical indicators to analyze the differences between the models in detail, and further verify the importance of model improvement as a whole.

Comment 2: The discussion part is rather short. Although the authors stated that they confirmed the validity of the improvements through ablation studies, I suggest the authors provide a more thorough discussion on the threats to the reliability and validity of the results in all steps of their research. The main limitations of the study must also be identified and discussed.

Response 2: Based on your suggestion, in the revised version of the manuscript, we have provided explanations and introductions for each experiment in the "Discussion" section. Of course, reading the analysis of the experimental results in the "Experiment and Analysis" section is still indispensable. Additionally, the limitations of the proposed model and the possible solution are supplemented in the "Discussion" section.

Thank you again for taking time out of your busy schedule to review this manuscript!

Sincerely yours,

Zhonglan Lin, Ms

School of Electronic and Information Engineering, Guangxi Normal University, Guilin 541004, China

Email: [email protected]

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript is interesting, the methodology is well structured and the results have high statistical verification. The models are well justified and the presentation of the model results is very accurate. The discussion is brief but mentions the importance of the study. The conclusions are well written. The only thing missing is to mention with bibliography the true interest or application of this type of classification. I think it could have a very interesting application in identifying diseases in dogs and cats or something similar.

Author Response

July 27, 2024

Dear reviewer,

Thank you very much for your recognition of our work and your important suggestions. Based on your suggestions, we have made a series of modifications to the paper. Please find the corresponding revisions/corrections highlighted in the resubmitted document.

Comment 1: The only thing missing is to mention with bibliography the true interest or application of this type of classification. I think it could have a very interesting application in identifying diseases in dogs and cats or something similar.

Response 1: Following your suggestion, we have made the following additions in the Discussion section of the revised manuscript. We mentioned a scheme that utilizes a target detection model to identify pets within images. The areas where pets have been detected are then fed into our model as regions of interest (ROI) for breeds classification.

One promising idea is to utilize real-time target detection models like YOLO (You Only Look Once), in conjunction with PBI-EdgeNeXt, to make a pipeline that can be deployed on edge devices or mobile phones. This application will be both practical and engaging, catering not only to researchers in the field but also to pet enthusiasts. Moving forward, our team may consider implementing these applications using the NVIDIA Jetson Nano or FPGA.

Additionally, it is noteworthy that the research significance and interest in classifying pet cat and dog breeds, as well as animal breeds, are mentioned in bibliographies 23 and 26 of our paper. The author of bibliography 23 collected photos of cats and dogs of various breeds through multiple communication platforms for pet cat and dog enthusiasts, ultimately creating the Oxford-IIIT Pet dataset, which serves as the precursor to the CPI dataset we developed.

Thank you again for taking time out of your busy schedule to review this manuscript!

Sincerely yours,

Zhonglan Lin, Ms

School of Electronic and Information Engineering, Guangxi Normal University, Guilin 541004, China

Email: [email protected]

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript presents a dog breed classification tasks using convolutional neural network architectures. The paper is well written, here I shared a few comments for improvement.

 

1. Add numeric metrics in the abstract. How much the metrics improve with your arquitectural changes.

2. You mention training and testing sets. No validation set? Having a validation set is standard practice.

3. Increase the caption of the figures. They should be self-explanatory.

4. Give more details about the resulting dataset. What categories does it have ans how many samples per category?

5. Give more details about the positional encoding. That is an architectural block of transformers used to keep track of the token order. How is it used here? What positions are you using? Do you have position information in your dataset? Please elaborate on how it is used.

6. Give visual examples of where SOTA models fail and your model succeed.

7. Disclose the use of generative AI.

Author Response

July 27, 2024

Dear reviewer,

We appreciate for your recognition of our paper and your important comments. Based on your suggestions, we have made a series of modifications to the paper. Please find the corresponding revisions/corrections highlighted in the resubmitted document.

Comment 1: Add numeric metrics in the abstract. How much the metrics improve with your arquitectural changes.

Response 1: We have added numeric metrics in the abstract section of the new manuscript of the paper. In the comparative experiment across two datasets, in terms of the top-1 accuracy, DenseNet's performance improved from 89.10% to 89.19%, while Swin Transformer's performance increased by 1.26%, marking the most significant improvement. In the comparative experiment involving the proposed model and the five selected models, it was found that the proposed model achieved the highest top-1 accuracy of 87.12%.

Comment 2: You mention training and testing sets. No validation set? Having a validation set is standard practice.

Response 2: Following your suggestion, we have supplemented the description of the dataset division in the Experimental Setup section. Actually, we did divide the dataset into training, test, and validation sets. The training set comprises 3552 images, the validation set comprises 888 images, and the test set comprises 1110 images. Moreover, the analysis of Figure 5 in subsection 4.5 of our paper is based on the validation set that we divided.

Comment 3: Increase the caption of the figures. They should be self-explanatory.

Response 3: We agree with this comment. Based on your suggestion, we have added necessary captions below each figure in the new version of the manuscript. Please refer to our most recent submission for these additions.

Comment 4: Give more details about the resulting dataset. What categories does it have and how many samples per category?

Response 4: In section 3.1, we have supplemented a more detailed introduction to the CPI dataset and included a table describing the 37 categories it covers, along with the number of samples included in each category. Please refer to subsection 3.1 and Table 1 in the new version of the manuscript.

Comment 5: Give more details about the positional encoding. That is an architectural block of transformers used to keep track of the token order. How is it used here? What positions are you using? Do you have position information in your dataset? Please elaborate on how it is used.

Response 5: The original EdgeNeXt is a hybrid architecture that effectively combines the strengths of both CNN and Transformer models. EdgeNeXt retains the Positional Encoding (PE) adopted from ViT, adding it only once before the SDTA block in the second stage to encode spatial location information.

Our PBI-EdgeNeXt still utilizes the PE module. Moreover, as mentioned in section 3.2.2:

(1) Acquiring crucial positional information is typically lacking during the initial stages of convolutional neural network training. The lack of the explicit PE not only impedes the model's convergence but also potentially impacts its ultimate performance. (In the new version of the manuscript, we have supplemented the theory supporting this opinion in section 3.3.2, that is, the citation 36 we just added.)

(2) This paper aims to improve the accuracy of pet breeds classification in complex backgrounds. As mentioned in our paper, the images we use have complex backgrounds and low signal-to-noise ratios. Therefore, we additionally added the PE module in stages 3 and 4 of our PBI-EdgeNeXt.

(3) This addition provides additional spatial information for our model. As stated by the author of reference 3 (citation 3 in our paper) in the article, the additional spatial information will enhance the model's ability to comprehend the relative and absolute positional relationships of each part in the image.

We not only maintain the PE module of stage 2 but also incorporate it into stages 3 and 4. In the ablation experiment section, we demonstrated the necessity and effectiveness of these additions step by step. Additionally, in the comparative experiment section, we compared the enhanced PBI-EdgeNeXt with five models, showcasing the feasibility and effectiveness of our improvements from a holistic perspective through comparison.

Our CPI dataset don’t have position information like bounding box in object detection task or mask in segmentation task. We incorporate positional encoding into the model not to capture such information, but to establish the positional relationships among features within the image. For instance, only when the eyes, nose, and mouth are in their designated positions on the face (e.g., eyes below the forehead, lips beneath the nose), can the image be recognized as a face.

Last but not least, the PE we utilize performs calculations as described in Equations (1) and (2) of our paper.

Comment 6: Give visual examples of where SOTA models fail and your model succeed.

Response 6: According to your suggestion, in the subsection 4.6, we have supplemented the results and analysis of an additional comparative experiment involving the proposed model and the five selected models. We selected various complex images, had the six models extract image features, and then utilized the saliency map visualization technology to demonstrate the attention of the six models on different parts of the image. This comparison helps us better understand why SOTA models failed and why our model succeeded. We have selected three of the most representative results from this comparative experiment to showcase. Please refer to Figure 7 and subsection 4.6 in our most recently submitted manuscript.

Comment 7: Disclose the use of generative AI.

Response 7: In fact, we used generative artificial intelligence tools like txyz to aid us in searching for relevant literature. However, in other aspects, such as the writing of our paper, we did not use any generative artificial intelligence tools.

Thank you again for taking time out of your busy schedule to review this manuscript!

Sincerely yours,

Zhonglan Lin, Ms

School of Electronic and Information Engineering, Guangxi Normal University, Guilin 541004, China

Email: [email protected]

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have revised the manuscript based on the comments, given in the previous review step.

Back to TopTop