Next Article in Journal
Color Constancy Based on Local Reflectance Differences
Previous Article in Journal
Systematic Approach for Measuring Semantic Relatedness between Ontologies
 
 
Article
Peer-Review Record

Efficient Neural Network for Text Recognition in Natural Scenes Based on End-to-End Multi-Scale Attention Mechanism

Electronics 2023, 12(6), 1395; https://doi.org/10.3390/electronics12061395
by Huiling Peng, Jia Yu * and Yalin Nie
Reviewer 2: Anonymous
Reviewer 3:
Reviewer 4: Anonymous
Electronics 2023, 12(6), 1395; https://doi.org/10.3390/electronics12061395
Submission received: 20 February 2023 / Revised: 9 March 2023 / Accepted: 11 March 2023 / Published: 15 March 2023
(This article belongs to the Section Artificial Intelligence)

Round 1

Reviewer 1 Report

The authors have proposed a new Efficient neural network for text recognition in natural scenes based on an end-to-end multi-scale attention mechanism. The manuscript is interesting to read but has several technical concerns that limit its presentation and professionalism.  It is suggested to authors that they must improve the language of the text. There are many grammatical errors in the sentences which makes the reading difficult. 

What is the actual finding of the research, mention it in the abstract section. Numerical results should be included in the abstract.

Efficient deep learning network (EE-ACNN) , feature pyramid (FPN) - Abbreviations should be mentioned properly.

Mention the name of the dataset in the abstract section.

The introduction of the paper is lengthy, reduce the content.

We couldn’t find any citation in the introduction section, the author must include the recent year papers in the introduction section.

Arrange all the references in the sequential order.

Overall, state of the art methods and the discussion is missing in the introduction section. Purpose of end to end multi-scale attention model is discussed, but the background of these methods are not discussed well.

Related works sections should be rewritten properly with taxonomy. It's hard to understand the needs of the proposed model. The author should explain the limitations of each papers at the end of the discussion.

Figure 1- 6 : Quality of the image is pathetic, the text portion is not smooth and its hard to understand the figure. Many images are unexplained ( Example: Fir 2: Convolution should be represented with example numbers. Fig 4: Connection layers required information of layer (variable). Similarly, check all the figures.

Multi-scale attention mechanism is a well known and widely used model in CNN and other deep learning models. What is the novelty here, moreover the end to end framework is a path way to perform the deep learning function. So, the novelty of the paper should be explained properly.

The ablation study is required to understand the importance of the proposed model.

The author split the overall data into 90/10 training and testing, its not a fair way to conduct the research. Conduct cross fold and k fold validation with other statistical test (t,p test).

Figure : 10 and 11: X axis information is missing.

Author Response

#Response to Reviewers

 

Reviewer 1:

Comments and Suggestions for Authors

 

The authors have proposed a new Efficient neural network for text recognition in natural scenes based on an end-to-end multi-scale attention mechanism. The manuscript is interesting to read but has several technical concerns that limit its presentation and professionalism.  It is suggested to authors that they must improve the language of the text. There are many grammatical errors in the sentences which makes the reading difficult. 

 

Comments:

 

What is the actual finding of the research, mention it in the abstract section. Numerical results should be included in the abstract.1

 

Response:

 

Thank you very much for your suggestion. We have carefully examined the summary, found this problem, and compared the experimental results for the convenience of readers.

 

Comments:

 

Efficient deep learning network (EE-ACNN) , feature pyramid (FPN) - Abbreviations should be mentioned properly.

 

Response:

 

Thank you very much for your suggestion. We have carefully examined the full text, especially where there are professional terms. We have specially explained the professional terms in these places and attached abbreviations for the convenience of readers.

 

 

Comments:

 

Mention the name of the dataset in the abstract section.

 

Response:

 

Thank you very much for your suggestion. The data set used in this article is the data set independently collected in the natural scene. It is a mixed type of text data set. It is mentioned in the summary and does not have a fixed name. In the final experiment section, we also explained this and showed some data sets for the convenience of readers.

 

 

Comments:

 

The introduction of the paper is lengthy, reduce the content.

 

Response:

 

Thank you very much for your suggestion. We have carefully checked the introduction, reserved the methods to promote the rhythm of the paper, and deleted some redundant parts to make the article smoother and easier for readers to read.

 

Comments:

 

We couldn’t find any citation in the introduction section, the author must include the recent year papers in the introduction section.

 

Response:

 

Thank you very much for your suggestion. We have carefully examined the introduction and found this problem. We have added a supplementary explanation to the introduction, added an explanation of the recent year's paper, and cited it.

 

Comments:

 

Arrange all the references in the sequential order.

 

Response:

 

Thank you very much for your suggestion. We have carefully checked the full text, especially where there are references, and have arranged all references in order.

 

Comments:

 

Overall, state of the art methods and the discussion is missing in the introduction section. Purpose of end to end multi-scale attention model is discussed, but the background of these methods are not discussed well.

 

Response:

 

Thank you very much for your suggestion. We have carefully examined the introduction and found this problem. We have eliminated some relatively old methods, supplemented the most advanced methods and discussions in recent years, and introduced the background knowledge of these methods.

 

Comments:

 

Related works sections should be rewritten properly with taxonomy. It's hard to understand the needs of the proposed model. The author should explain the limitations of each papers at the end of the discussion.

 

Response:

 

Thank you very much for your suggestion. We have carefully examined the relevant work and found this problem. We will accurately describe the interpretation of the research status, discuss the papers in recent years, and summarize its limitations.

 

Comments:

 

Figure 1- 6 : Quality of the image is pathetic, the text portion is not smooth and its hard to understand the figure. Many images are unexplained (Example: Fig 2: Convolution should be represented with example numbers. Fig 4: Connection layers required information of layer (variable). Similarly, check all the figures.

 

Response:

 

Thank you very much for your suggestion. We have carefully examined these pictures and found this problem. We have re-described some pictures in detail, added some details, and reconstructed some pictures to make them clearer and easier to read.

Comments:

 

Multi-scale attention mechanism is a well known and widely used model in CNN and other deep learning models. What is the novelty here, moreover the end to end framework is a path way to perform the deep learning function. So, the novelty of the paper should be explained properly.

 

Response:

 

Thank you very much for your suggestion. The advantage of this article is to combine the advantages of the three together and integrate them into a whole to play a greater role. First, directly connect the input and output of the text through the end-to-end framework, that is, directly transfer the input data to the output of the model, rather than transfer the input data to multiple intermediate modules for processing, which can avoid complex and redundant intermediate processes, It does not need the combination of multiple modules, which simplifies the construction and implementation of the whole model and reduces the complexity of implementation. Secondly, the framework can learn the mapping relationship from the original data to the target results to reduce the dependence on prior knowledge, thus improving the accuracy of the model. It can also uniformly process the training data and test data, thus enhancing the generalization performance of the model. Because the whole process of end-to-end framework is continuous, without intermediate links, it can reduce the complexity of calculation and training time, and improve the speed of training and reasoning. Next, we will integrate multi-scale attention mechanism to distribute the attention of different scale features according to the weight, improve the attention of the model to different scale features, improve the detection accuracy, reduce missed detection and false detection, increase its robustness, and improve the performance of the model. Finally, in combination with CNN, text features are extracted through operations such as convolution layer and pooling layer, which have certain invariance for the position and size of the text, which can make the model have better detection effect for text regions with different positions and sizes. The convolution neural network can extract text features layer by layer through operations such as multi-layer convolution layer and pooling layer, thus improving the feature extraction ability of the model for text.

 

Comments:

 

The ablation study is required to understand the importance of the proposed model.

Response:

 

Thank you very much for your suggestion. We have carefully examined the experimental part and found this problem. We will supplement this ablation experiment to compare a single module with a comprehensive module, and then judge the impact of the module on the algorithm in this paper.

 

Comments:

 

The author split the overall data into 90/10 training and testing, its not a fair way to conduct the research. Conduct cross fold and k fold validation with other statistical test (t,p test).

 

Response:

 

Thank you very much for your suggestion. We have divided the data of the experimental part several times before, and carried out many experiments to verify the results. The results are very similar, and have not had much impact on the direction of the whole experiment. Limited by the length of the article, we chose the results of this division to display, and other results are placed in the link. In addition, We also considered using cross-validation method to verify, but in the specific experiment process, there were other effects on the use of the model, some deviations occurred, and more computing resources were needed, which led to the longer training time of the model. We made a comprehensive judgment in all directions. In order to balance the time and effect of the experiment, we continued to carry out the experiment according to the method in this paper.

 

 

Comments:

 

Figure : 10 and 11: X axis information is missing.

 

Response:

 

Thank you very much for your suggestion. We have carefully checked the picture, found this problem, and modified the picture.

 

Reviewer 2 Report

In this work Efficient neural network for text recognition in natural scenes based on end-to-end multi-scale attention mechanism, Peng and co-workers introduce an end-to-end workflow to realize text recognition in natural scenes with a high inference speed. This workflow combines the MORN network of MORAN to correct distorted raw images and then perform the recognition on the corrected images with a CNN-based network combined with FPN and MSA. This work innovatively integrates multiple existing methods to better solve a practical problem and will help future researchers in this area. Thus, I recommend publishing this work after a revision focusing on these points:

1. The English in this manuscript is not very easy to understand. It will be ideal if the authors would like to polish it with a native English speaker or equivalent.

2. The authors should explain why the test loss (Figure 11) is smaller than the train loss (Figure 12).

Minor ones:

3. In section 4.1, the author should specify the following information: the deep learning framework and version; the dataset used in this work.

4. In section 5, multiple works (EAST, PAVNet, etc.) are mentioned but not cited.

5. There are a lot of unmatched abbreviation-full version pairs, for example, the "efficient deep learning network" vs. "EE-ACNN" in line 13 page 1. Please correct all of them into the corresponding EXACT abbreviation-full version pairs.

Author Response

Reviewer 2:

Comments and Suggestions for Authors

 

 

In this work Efficient neural network for text recognition in natural scenes based on end-to-end multi-scale attention mechanism, Peng and co-workers introduce an end-to-end workflow to realize text recognition in natural scenes with a high inference speed. This workflow combines the MORN network of MORAN to correct distorted raw images and then perform the recognition on the corrected images with a CNN-based network combined with FPN and MSA. This work innovatively integrates multiple existing methods to better solve a practical problem and will help future researchers in this area. Thus, I recommend publishing this work after a revision focusing on these points:

 

Comments:

 

  1. The English in this manuscript is not very easy to understand. It will be ideal if the authors would like to polish it with a native English speaker or equivalent.

 

Response:

 

Thank you very much for your suggestion. We have carefully checked the full text, focusing on the grammar problems in the key chapters, corrected this, and invited professors from relevant fields of computer science to check and polish this article for the convenience of readers.

 

Comments:

 

  1. The authors should explain why the test loss (Figure 11) is smaller than the train loss (Figure 12).

 

Response:

 

Thank you very much for your question. We have carefully examined this part and found this problem. We will explain the loss function of this part to make the article smoother and easier for readers to read. Among the loss functions expressed in the model in this paper, the gap between the loss functions of the test set and the training set tends to be so small that it converges to zero to achieve the best effect. If the loss function on the training set is small but the loss function on the test set is large, it indicates that there is an over-fitting problem in the model, that is, the data is over-fitted on the training set and cannot be generalized to new data. On the contrary, if the loss functions of the training set and the test set are large, it indicates that the model has a problem of under-fitting, that is, it does not fit the data well.

 

 

Minor ones:

 

Comments:

 

  1. In section 4.1, the author should specify the following information: the deep learning framework and version; the dataset used in this work.

 

Response:

 

Thank you very much for your suggestion. We have carefully examined the experimental platform and setup in Section 4.1, and supplemented the experimental details in this regard. In terms of data sets, it is the text images collected under natural conditions, which are filtered and sorted, and it is a mixed-type data set. And we have also introduced the data sets in the following section for readers to read.

 

 

Comments:

 

  1. In section 5, multiple works (EAST, PAVNet, etc.) are mentioned but not cited.

 

Response:

 

Thank you very much for your suggestion. We have carefully examined this part, found this problem, and added relevant references.

 

 

Comments:

 

  1. There are a lot of unmatched abbreviation-full version pairs, for example, the "efficient deep learning network" vs. "EE-ACNN" in line 13 page 1. Please correct all of them into the corresponding EXACT abbreviation-full version pairs.

 

Response:

 

Thank you very much for your suggestion. We have carefully examined this part, found this problem, and explained it in the text for the convenience of readers.

 

Reviewer 3 Report

Overall authors have attempted  a good work for text recognition. However there are some major issues in the manuscript which needs to be addressed. 

 

1. Abstract is not well written, it is quite ambiguous and needs to be more precise and describe the actual contributions. 

2. Introduction lacks references. Authors must state if there is no reference required in the introduction section then it also puts a question mark on the information provided and issues highlighted in the introduction.

3. Contributions presented in the introduction lacks cohesiveness and all three contributions overlap with each other. These contributions should be revised for better clarity. 

4. Although authors have attempted to propose a good method and achieved better accuracy still I don't understand the contribution in the method in terms of novelty. 

5. Authors have mentioned different types of layers in the CNN however an important information is number of layers in the CNN, number of filters and their sizes, pooling size etc are missing. 

6. Another important information is number of hyperparameters and trainable parameters which is completely missing. 

7. In the conclusion section, future works need to be further elaborated with some more future directions. 

 

 

Author Response

Reviewer 3:

Comments and Suggestions for Authors

 

Overall authors have attempted a good work for text recognition. However there are some major issues in the manuscript which needs to be addressed.

 

Comments:

 

  1. Abstract is not well written, it is quite ambiguous and needs to be more precise and describe the actual contributions.

 

Response:

 

Thank you very much for your suggestion. We have carefully examined the summary, strengthened the elaboration of details, accurately explained the actual contributions, and more importantly, read the readers.

 

Comments:

 

  1. Introduction lacks references. Authors must state if there is no reference required in the introduction section then it also puts a question mark on the information provided and issues highlighted in the introduction.

 

Response:

 

Thank you very much for your suggestion. We have carefully examined the introduction and found this problem. We have supplemented the relevant parts of the introduction in detail and added references.

 

 

Comments:

 

  1. Contributions presented in the introduction lacks cohesiveness and all three contributions overlap with each other. These contributions should be revised for better clarity.

 

Response:

 

Thank you very much for your suggestion. We have carefully examined the introduction and found this problem. We will re-explain the three contributions for the convenience of readers. The end-to-end framework directly connects the input and output of the text, that is, the input data is directly transferred to the output of the model, rather than the input data is transferred to multiple intermediate modules for processing, which can avoid complex and redundant intermediate processes, and does not require the combination of multiple modules, which simplifies the construction and implementation of the entire model and reduces the complexity of implementation; The multi-scale attention mechanism can distribute the attention of different scale features according to the weight, improve the attention of the model to different scale features, improve the detection accuracy, reduce missed detection and false detection, and increase its robustness; CNN extracts text features through convolution layer, pooling layer and other operations, and has certain invariance for the position and size of text, which can make the model have better detection effect for text regions with different positions and sizes. Convolution neural network can extract text features layer by layer through multi-layer convolution layer, pooling layer and other operations, thus improving the model's ability to extract text features.

 

Comments:

 

  1. Although authors have attempted to propose a good method and achieved better accuracy still I don't understand the contribution in the method in terms of novelty.

 

Response:

 

Thank you very much for your suggestion. The advantage of this article is to combine the advantages of the three together and integrate them into a whole to play a greater role. First, directly connect the input and output of the text through the end-to-end framework, that is, directly transfer the input data to the output of the model, rather than transfer the input data to multiple intermediate modules for processing, which can avoid complex and redundant intermediate processes, It does not need the combination of multiple modules, which simplifies the construction and implementation of the whole model and reduces the complexity of implementation. Secondly, the framework can learn the mapping relationship from the original data to the target results to reduce the dependence on prior knowledge, thus improving the accuracy of the model. It can also uniformly process the training data and test data, thus enhancing the generalization performance of the model. Because the whole process of end-to-end framework is continuous, without intermediate links, it can reduce the complexity of calculation and training time, and improve the speed of training and reasoning. Next, we will integrate multi-scale attention mechanism to distribute the attention of different scale features according to the weight, improve the attention of the model to different scale features, improve the detection accuracy, reduce missed detection and false detection, increase its robustness, and improve the performance of the model. Finally, in combination with CNN, text features are extracted through operations such as convolution layer and pooling layer, which have certain invariance for the position and size of the text, which can make the model have better detection effect for text regions with different positions and sizes. The convolution neural network can extract text features layer by layer through operations such as multi-layer convolution layer and pooling layer, thus improving the feature extraction ability of the model for text.

 

Comments:

 

  1. Authors have mentioned different types of layers in the CNN however an important information is number of layers in the CNN, number of filters and their sizes, pooling size etc are missing.

 

Response:

 

Thank you very much for your suggestion. We have carefully examined this part and added the details to the pictures, making the article more vivid and convenient for readers to read.

 

Comments:

 

  1. Another important information is number of hyperparameters and trainable parameters which is completely missing.

 

Response:

 

Thank you very much for your suggestion. We have carefully examined this part and made corresponding explanations in the experimental diagram of the experimental part for the convenience of readers.

 

Comments:

 

  1. In the conclusion section, future works need to be further elaborated with some more future directions.

 

Response:

 

Thank you very much for your suggestion. We have carefully examined the conclusion and found this problem. We will supplement the future work and direction of the conclusion to facilitate readers to read.

Reviewer 4 Report

Dear colleagues,

Congratulations on the effort to write this paper. It is a well-documented article, and the described algorithm is exciting. The form of presentation of the information in your work is structured and well documented. However, I would have a tiny observation that I am sure will be easy to correct:

  - please pay attention to the bibliography elements. There are some items older than five years, so please resolve this.

Author Response

Reviewer 4:

Comments and Suggestions for Authors

 

Dear colleagues,

 

Congratulations on the effort to write this paper. It is a well-documented article, and the described algorithm is exciting. The form of presentation of the information in your work is structured and well documented. However, I would have a tiny observation that I am sure will be easy to correct:

 

Comments:

 

- please pay attention to the bibliography elements. There are some items older than five years, so please resolve this.

 

Response:

 

Thank you very much for your suggestion. We have deleted some references from a long time ago and added new articles in recent years for your convenience.

Round 2

Reviewer 1 Report

The revised version is satisfactory.

Reviewer 3 Report

Authors have addressed most of my comments. 

Back to TopTop