Next Article in Journal
Pruning Quantized Unsupervised Meta-Learning DegradingNet Solution for Industrial Equipment and Semiconductor Process Anomaly Detection and Prediction
Next Article in Special Issue
A Lightweight Forest Pest Image Recognition Model Based on Improved YOLOv8
Previous Article in Journal
Non-Iterative Cluster Routing: Analysis and Implementation Strategies
Previous Article in Special Issue
DFP-Net: A Crack Segmentation Method Based on a Feature Pyramid Network
 
 
Article
Peer-Review Record

Collaborative Encoding Method for Scene Text Recognition in Low Linguistic Resources: The Uyghur Language Case Study

Appl. Sci. 2024, 14(5), 1707; https://doi.org/10.3390/app14051707
by Miaomiao Xu 1,2,3, Jiang Zhang 1, Lianghui Xu 1, Wushour Silamu 1,2,3,* and Yanbing Li 1,2,3
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2024, 14(5), 1707; https://doi.org/10.3390/app14051707
Submission received: 17 December 2023 / Revised: 14 February 2024 / Accepted: 14 February 2024 / Published: 20 February 2024
(This article belongs to the Special Issue Deep Learning and Computer Vision for Object Recognition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper contains problems (grammatical and typos) regarding the English language. Please correct all the typo errors. Please avoid the use of many adjectives. Please avoid writing repetitive sentences. All the terms in equations must be explained.

The number of the introduction section must be one, not zero.

The first paragraph of the introduction must include many references. Generally, most of the arguments in the introduction must be supported by high-quality references.

Please insert a paragraph at the end of the introduction to describe the paper's organization.

The introduction must be rewritten. Please include a context and clearly explain the problem to be solved. How the problem is currently solved? Why the current techniques are not good? What is the aim of the paper?

Please include a paragraph at the beginning of the related work section to explain the section's organization; this will help readers understand the sequence of the information presented. Generally, the information presented in Section II must be erased. Only vague information was presented. The authors did not give an analysis of each work. Furthermore, the information is not used for comparative purposes. Authors are recommended to ask for help from someone with experience to explain how to write the related works section. Please highlight the differences between the proposed work and the works in the literature.

The methodology section is the poorest of the entire paper. The lack of scientific justification characterizes the section. The authors do not justify the selection of each method used; it seems that magic worked out everything. All the explanations are superficial; more details must be included.

Please explain why and to which extent the dual branch encoding method is more efficient. In how many different levels the filter model acquires features? To what extent is the interference of noise reduced?

How does the transformer preserve the local detail information? How is the model's understanding of global images enhanced? Why a ResNEt network was employed? What were the other architectures considered? Why were the other architectures discarded?

Why was a six-layer ViT configuration employed?

Please insert details about the screening, labeling, cropping, and calibration tasks. Please insert details about the data augmentation stage.

Please explain how the values of all the network hyperparameters were determined. Please check the data shown in Table 1. From my point of view, when the data is so different, it is not worth comparing. Please demonstrate that the differences between data are significant. Please erase the data from the raw column.

Please explain the process for selecting the methods employed for comparison. Were the experiments conducted in the same conditions? What other methods were considered for comparisons? Why were the other methods discarded?

Please insert more details about the failure cases. Please insert a results discussion subsection.

Please rewrite the conclusions. The information presented must be derived from the data shown in the body of the paper. Moreover, information regarding further work must be inserted.

 

All the references must be exhaustively reviewed to present them in a complete format.

Comments on the Quality of English Language

An exhaustive review of the English must be performed.

Author Response

Dear Expert

 

Thank you for your comprehensive review and valuable feedback on our submitted paper. Your professional insights have offered crucial guidance to our research, significantly influencing the enhancement of the paper's quality. We have diligently considered your suggestions and implemented corresponding revisions in the final version.

Once again, we appreciate your dedication and time in reviewing our work. We eagerly anticipate receiving your guidance and suggestions in the future.

 

Best regards

 

Reviewer#1, Concern # 1: The paper contains problems (grammatical and typos) regarding the English language. Please correct all the typo errors. Please avoid the use of many adjectives. Please avoid writing repetitive sentences. All the terms in equations must be explained.

Author response: Dear expert, I sincerely apologize for this situation. We have reexamined the paper and corrected the relevant errors. Once again, I would like to express our heartfelt apologies for any inconvenience caused.

 

Reviewer#1, Concern # 2: The number of the introduction section must be one, not zero.

Author response: Dear expert, thank you for your careful guidance. We have corrected the issue in question.

 

Reviewer#1, Concern # 3: The first paragraph of the introduction must include many references. Generally, most of the arguments in the introduction must be supported by high-quality references. Please insert a paragraph at the end of the introduction to describe the paper's organization. The introduction must be rewritten. Please include a context and clearly explain the problem to be solved. How the problem is currently solved? Why the current techniques are not good? What is the aim of the paper?

Author response: Dear expert, thank you for your professional advice. We have reworked the introduction section, offering a detailed exposition of the research problem. We emphasize the solution to the current issue and clearly articulate why existing technologies fall short. The paper's objectives are now well-defined, and we have optimized the overall structure.

 

Reviewer#1, Concern # 4: Please include a paragraph at the beginning of the related work section to explain the section's organization; this will help readers understand the sequence of the information presented. Generally, the information presented in Section II must be erased. Only vague information was presented. The authors did not give an analysis of each work. Furthermore, the information is not used for comparative purposes. Authors are recommended to ask for help from someone with experience to explain how to write the related works section. Please highlight the differences between the proposed work and the works in the literature.

Author response: Dear expert, thank you for your professional advice. We have rewritten the relevant work section, merging the second part with the first and adding the organizational structure of this section.

 

Reviewer#1, Concern # 5: The methodology section is the poorest of the entire paper. The lack of scientific justification characterizes the section. The authors do not justify the selection of each method used; it seems that magic worked out everything. All the explanations are superficial; more details must be included.

Author response: Dear expert, thank you for your careful guidance. We have improved the writing of the methodology section, providing more comprehensive discussions for each part and adding detailed information.

Reviewer#1, Concern # 6: Please explain why and to which extent the dual branch encoding method is more efficient. In how many different levels the filter model acquires features? To what extent is the interference of noise reduced?

Author response: Dear expert, thank you for your constructive feedback. I would like to respond to your suggestions.

 

We will address the effectiveness of dual-branch encoding from the following aspects. (1)In the convolutional operation of CNN, each kernel slides over the image, focusing on small regions with a receptive field size. This characteristic enables CNN to detect local features, such as edges and textures, at various positions, facilitating the capture of distinguishing features between similar characters in Uyghur text. (2) Transformers's self-attention mechanism establishes correlations between positions, enhancing the model's awareness of contextual information. However, Transformers lack certain inherent biases of CNN, such as translation variance and locality. Consequently, when trained with limited data, they may not generalize well. Therefore, adopting the dual-branch encoding approach was deemed essential for the low-resource Uyghur dataset.

 

In accordance with the experimental data presented in this paper, when the training set comprises 260,000 images or fewer, employing the filter method before the dual-branch encoding module in CNN results in a noteworthy 3.9% improvement in accuracy compared to using the filter module solely before the CNN branch (refer to Table 3).

 

Reviewer#1, Concern # 7: How does the transformer preserve the local detail information? How is the model's understanding of global images enhanced? Why a ResNet network was employed? What were the other architectures considered? Why were the other architectures discarded?

Author response: Dear expert, thank you for your valuable suggestions. We have improved the model's awareness of local information by incorporating ResNet and the CABM attention mechanism into the CNN branch. Drawing inspiration from the article titled "AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE," we noted that in scenarios with limited data resources, ResNet tends to outperform Transformers. Hence, based on this observation, we chose ResNet for the CNN branch.

 

Reviewer#1, Concern # 8: Why was a six-layer ViT configuration employed?

Author response: Dear expert, we noticed that increasing the complexity of encoding layers did not improve experimental results in situations with limited data resources. Consequently, we decided to set the encoding layers of Vit to 6. Table 4 presents a corresponding ablation study on the encoding layers.

 

Reviewer#1, Concern # 9: Please insert details about the screening, labeling, cropping, and calibration tasks. Please insert details about the data augmentation stage.

Author response: Dear expert, thank you for your professional suggestions. We have added more detailed explanations in the paper as per your advice.

 

Reviewer#1, Concern # 10: Please explain how the values of all the network hyperparameters were determined. Please check the data shown in Table 1. From my point of view, when the data is so different, it is not worth comparing. Please demonstrate that the differences between data are significant. Please erase the data from the raw column.

Author response: Dear expert, in response to your constructive feedback, we have added introductions to the relevant hyperparameters for each method in the manuscript and removed experimental data columns related to the RAW dataset. Throughout the experimental validation process, we observed significant disparities among the data, with particular prominence in the performance of ViTSTR and MGP. This can be attributed to the fact that, as indicated in the original literature, both methods were trained using pre-training data from ViT. Consequently, in the context of scene text recognition for the Uighur language, a shift in language resulted in suboptimal performance. Additionally, their use of a pure Transformer architecture, coupled with the relatively small size of our dataset, contributed to these experimental outcomes.

 

On the other hand, TRBA, as a feature extraction method based on CNN, exhibited notable performance when the training data was limited. In contrast, CDistNet, employing joint encoding of CNN and Transformer, yielded superior results compared to other self-attention encoding methods when the training data was limited.

 

Reviewer#1, Concern # 11: Please explain the process for selecting the methods employed for comparison. Were the experiments conducted in the same conditions? What other methods were considered for comparisons? Why were the other methods discarded?

Author response: Dear expert, to ensure fairness in these comparisons, our selected methods encompass various aspects, including those solely based on visual Transformers, such as ViTSTR and MGP-tiny-char. Classic methods for feature extraction using CNN, such as CRNN and TRBA, were also included. Additionally, we incorporated the CDistNet method, which combines CNN and Transformer for joint feature extraction. To serve as the baseline for this study, we introduced the PARSeq method, involving joint learning of internal language models. The batch size for all methods in the experiment is fixed at 224, with a training batch count of 20. All experiments were conducted in an environment with 6 NVIDIA Tesla V100 GPUs, utilizing Python 3.9.18 and Torch 1.13.1. The parameter settings for different methods in the experiment are outlined in Table 1.

Some of the most recent advanced methods currently leverage externally pre-trained large language models. However, these methods are primarily tailored for English, prompting us to avoid their inclusion in experimental comparisons. Initially, we contemplated conducting comparative experiments using Albert and independently constructed a training set comprising 26,613,556 words. The test set comprised 50,001 Uyghur words for training the language model. Unfortunately, we noted significantly poor performance throughout the model training process, prompting us to abandon this comparative approach.

 

Reviewer#1, Concern # 12: Please insert more details about the failure cases. Please insert a results discussion subsection.

Author response: Dear expert, thank you for your valuable feedback. We have reorganized the failure cases in accordance with your suggestions.

 

Reviewer#1, Concern # 13: Please rewrite the conclusions. The information presented must be derived from the data shown in the body of the paper. Moreover, information regarding further work must be inserted.

Author response: Dear expert, thank you for incorporating the professional suggestions. Based on your feedback, we have revised the conclusions and included plans for future work.

 

Reviewer#1, Concern # 14: All the references must be exhaustively reviewed to present them in a complete format.

Author response: Dear expert, thank you for your valuable feedback. we have reexamined and properly cited the references.

Reviewer 2 Report

Comments and Suggestions for Authors

Happy New Year. I am happy to present the review of this high-quality manuscript in the new year. The manuscript entitled “collaborative encoding method for scene text recognition in low-resource Uyghur language” has been thoroughly reviewed. This research introduces a Collaborative Encoding Method (CEM) for scene text recognition in low-resource Uyghur language, achieving state-of-the-art results. Some obvious strengths include the construction of a real-scene Uyghur text recognition dataset, innovative encoding modules, and the use of data augmentation. The CEM combines Transformer and CNN encoding, addressing the challenges of low-resource languages. Beyond to these, limitations may also arise from the specific focus on Uyghur language, potentially limiting the generalizability of the method to other low-resource languages. Additionally, the performance evaluation primarily focuses on Uyghur, warranting further validation across a broader spectrum of low-resource languages. Below shows the reviewer’s major concerns.

1.     How does Equation (2) contribute to the dual-branch feature extraction module of the CEM? What is the significance of the "softmax" function in Equation (2)? How does the "softmax" function in Equation (2) help in the feature selection process? Can you explain details about the role of the "W" matrix in Equation (2) and how it affects the feature selection process?

2.     Can you explain more about the role of the "f_c" and "f_t" features in Equation (4) and how they are combined to produce the final output feature "f_o"?

3.     How were the training and testing sets selected to ensure the representativeness of model training and evaluation?

4.     Can you elaborate on the meticulous and systematic tasks performed during the data processing stage to enhance the credibility of the experimental data?

5.     Can you provide insights into the process of corrections for severely distorted images and how it contributed to the dataset's credibility and usability in your experiments?

6.     What considerations were made to address potential bias in model performance evaluation, particularly in the context of Uyghur language concentration, and its impact on assessing the model's effectiveness in recognizing text from other low-resource languages?

7.     The failure cases in section 3.6 need more clarifications to demonstrate the model performance does not fully depend on imagine quality.

8.     Can you explain how the experimental design accounted for the generalizability of the proposed method to other low-resource languages, and the potential challenges specific to diverse linguistic contexts?  For example, while the CEM achieves state-of-the-art results in Uyghur scene text recognition, there is a need for further validation across a broader spectrum of low-resource languages. The lack of comprehensive validation across multiple low-resource languages may not yield the same conclusions to the method's performance in diverse linguistic settings. Please address this conclusion in the paper, including the dissimilarities of multiple low-resource languages, to show that CEM perform well in Uyghur and might do the same for other language, although some dissimilarities may lead to different performance.

Comments on the Quality of English Language

The language is precise and employs domain-specific terminology related to machine learning, model architectures, and experimental methodologies.

Author Response

Dear Expert

 

Thank you for your comprehensive review and valuable feedback on our submitted paper. Your professional insights have offered crucial guidance to our research, significantly influencing the enhancement of the paper's quality. We have diligently considered your suggestions and implemented corresponding revisions in the final version.

Once again, we appreciate your dedication and time in reviewing our work. We eagerly anticipate receiving your guidance and suggestions in the future.

 

Best regards

 

Reviewer#1, Concern # 1: How does Equation (2) contribute to the dual-branch feature extraction module of the CEM? What is the significance of the "softmax" function in Equation (2)? How does the "softmax" function in Equation (2) help in the feature selection process? Can you explain the role of the "W" matrix in Equation (2) and how it affects the feature selection process?

Author response: Dear expert, thank you for your careful guidance. The CNN feature extraction branch comprises a ResNet feature extraction network and CBAM (Channel and Spatial Attention Mechanism). CBAM introduces channel and spatial attention mechanisms, empowering the network to learn and capture key features selectively. In Equation 2, we employ the sigmoid function to generate spatial attention weights. These weights enable the network to express varying levels of attention to different spatial locations, thereby more effectively selecting and focusing on crucial features during the feature extraction process. The parameter 'w' in Formula 1 represents the weights learned through training. The model autonomously learns and adjusts these pertinent parameters during training, optimizing sensitivity to key features.

 

Reviewer#1, Concern # 2: Can you explain more about the role of the "f_c" and "f_t" features in Equation (4) and how they are combined to produce the final output feature "f_o"?

Author response: Dear expert, thank you for your careful guidance. We employed two features from the CNN and Transformer branches, Fc and Ft, respectively, and fused them through a dynamic fusion module. Firstly, we utilized the Cat operation to concatenate features from these two distinct branches. Subsequently, we activated the concatenated features using the Sigmoid activation function to obtain attention weights. Afterward, we multiplied the obtained weights with features from different branches individually. Finally, the multiplied results were fused using the Add operation.

 

Reviewer#1, Concern # 3: How were the training and testing sets selected to ensure the representativeness of model training and evaluation?

Author response: Dear expert, to ensure the rationality of the training and testing datasets during the dataset partitioning process, we allocated the majority of compliant categories in a ratio of 8:2. For categories that did not meet the requirements, such as those with fewer than 2 data samples, we randomly retained data from these categories in either the training or testing set.

 

Reviewer#1, Concern # 4: Can you elaborate on the meticulous and systematic tasks performed during the data processing stage to enhance the credibility of the experimental data?

Author response: Dear expert, thank you for your professional advice. We have incorporated relevant discussions into the manuscript. We excluded heavily occluded, extremely distorted, and repetitive data from this study during the screening process. Following this, we annotated the screened data. We conducted re-cropping in instances of unreasonable cropping, such as images with excessive blank spaces or cropped to include interfering words. A cross-check of the annotated data was performed to ensure consistency between labels and image content. To bolster the credibility of the data, the authors of this paper undertook three rounds of calibration work on the scene recognition images and their corresponding labels.

 

Reviewer#1, Concern # 5: Can you provide insights into the corrections process for severely distorted images and how it contributed to the dataset's credibility and usability in your experiments?

Author response: Dear expert, thank you for your professional advice. In situations with abundant training data, employing data augmentation techniques can enhance the model's performance, generalization capabilities, and robustness. However, in cases where resources are constrained within real-world datasets, data augmentation becomes instrumental in expanding the dataset. Nevertheless, excessive application of data augmentation to extend the training set can result in significant distortion and deformation, introducing a certain level of data noise.

 

Reviewer#1, Concern # 6: What considerations were made to address potential bias in model performance evaluation, particularly in the context of Uyghur language concentration, and its impact on assessing the model's effectiveness in recognizing text from other low-resource languages?

Author response: Dear expert, in the absence of real experimental validation, we cannot demonstrate whether our approach equally applies to other low-resource languages. However, concerning the low-resource Uyghur language dataset, the combined encoding approach of CNN and Transformer proves significantly superior to methods relying solely on CNN or Transformer. This serves as an excellent example for other low-resource language processing techniques.

 

Reviewer#1, Concern # 7: The failure cases in section 3.6 need more clarifications to demonstrate the model performance does not fully depend on imagined quality.

Author response: Dear expert, thank you for your professional advice. We have reclassified the failed cases, categorizing the reasons for failure into three types: low image quality, confusion in character features within the image, and noise interference.

 

Reviewer#1, Concern # 8: Can you explain how the experimental design accounted for the generalizability of the proposed method to other low-resource languages, and the potential challenges specific to diverse linguistic contexts?  For example, while the CEM achieves state-of-the-art results in Uyghur scene text recognition, there is a need for further validation across a broader spectrum of low-resource languages. The lack of comprehensive validation across multiple low-resource languages may not yield the same conclusions as the method's performance in diverse linguistic settings. Please address this conclusion in the paper, including the dissimilarities of multiple low-resource languages, to show that CEM performs well in Uyghur and might do the same for other languages, although some dissimilarities may lead to different performance.

Author response: Dear expert, thank you for your professional advice. In the absence of experimental validation, demonstrating that the CEM method can achieve outstanding results compared to other methods is challenging. Therefore, we have incorporated relevant discussions in the conclusion, highlighting this study's exemplary significance in the low-resource language research field. However, due to the unique characteristics of different languages, further demonstrations and validations are necessary to determine whether this method can deliver excellent performance in other low-resource languages.

Reviewer 3 Report

Comments and Suggestions for Authors

The article with the title: “Collaborative Encoding Approach for Scene Text Recognition in Low-Resource Uyghur Language” addresses many important issues for Uygur language scene text recognition.

Main strengths of this paper are listed through contributions (1) to (4) (lines 75-78).

There are some methodological inaccuracies:

- change the title to: “Collaborative Encoding Approach for Scene Text Recognition in Low Linguistic Resources: the Uyghur Language Case Study”,

- interested reader must clearly see the benefits of the author(s)’ approach and possibly apply within his(her) research or to use it as a reference, respectively. For that purpose “further study” from Sec. 5. Conclusions (lines 448 - 449) “to validate the applicability of our method through experiments, assessing its suitability for other low-resource languages” should be moved to separate section. This section should clearly list the set of “...other low level languages.” There are three groups of languages: dead languages, live languages (for example Arab like lang.) and artificial languages (like Klingon https://en.wikipedia.org/wiki/Klingon_language. I left final choice of languages to the authors. As this is (possible) outside the scope of the article do not perform experiments, just elaborate possible suitability for other low-resource language of your choice,

- What is the source of 7,267 scene images? Is there books, newspaper etc. Give more elaborated insight to the used sources,

- comment the choice of spelling: Uyghur vs Ujghur vs Uighur, use the same spelling through the article (see line 449),

- comment how to perform romanization and how to build translator from Uyghur to other language(s),

- how users outside academic community and outside author’s scope can use your results in everyday life. Are there any attempts in this direction?

Here are minor changes that should receive author(s) comments as well as feedback:

- line 74 change “.” after “follows” to “:”,

- change “study” to “article”,

- line 318: cite reference for “Adam optimizer”,

- line 166 Sec 3.1.2. “Dual…”: add additional comments “HOW” are equation derived. Clearly separate own scientific results and known facts. Note that wider audience should have more insight into this equations.

I would like to see this article published, so carefully prepare your feedback as soon as possible following the Editor's rules.

Author Response

Reviewer#2, Concern # 1: The article with the title: “Collaborative Encoding Approach for Scene Text Recognition in Low-Resource Uyghur Language” addresses many important issues for Uygur language scene text recognition. The main strengths of this paper are listed through contributions (1) to (4) (lines 75-78). There are some methodological inaccuracies:- change the title to: ‘Collaborative Encoding Approach for Scene Text Recognition in Low Linguistic Resources: the Uyghur Language Case Study'

Author response: Dear expert, thank you for your professional guidance. Based on your advice and existing descriptions, we've revised the title to "Collaborative Encoding Method for Scene Text Recognition in Low Linguistic Resources: the Uyghur Language Case Study".

Reviewer#2, Concern # 2: interested reader must clearly see the benefits of the author(s)’ approach and possibly apply within his(her) research or to use it as a reference, respectively. For that purpose “further study” from Sec. 5. Conclusions (lines 448 - 449) “to validate the applicability of our method through experiments, assessing its suitability for other low-resource languages” should be moved to separate section. This section should clearly list the set of “...other low level languages.” There are three groups of languages: dead languages, live languages (for example Arab like lang.) and artificial languages (like Klingon https://en.wikipedia.org/wiki/Klingon_language. I left final choice of languages to the authors. As this is (possible) outside the scope of the article do not perform experiments, just elaborate possible suitability for other low-resource language of your choice.

Author response: Dear expert, thank you for your professional advice. We've revised the "Conclusions" section as per your suggestions. We've replaced 'to validate the applicability of our method through experiments, assessing its suitability for other low-resource languages' with 'we are preparing to shift our next research objectives to other low-resource languages, such as Mongolian and Kazakh.' (page 14, lines 486-488)

Additionally, following your guidance, we've provided examples of other low-resource languages in this section, specifically: Mongolian, Kazakh, and Kyrgyz, which lack sufficient authentic training data. (page 14, lines 489- 492).

Reviewer#2, Concern # 3: What is the source of 7,267 scene images? Is there books, newspaper etc. Give more elaborated insight to the used sources.

Author response: Dear expert, thank you for your professional advice. We have introduced the source of 7,267 scene images. ‘The raw dataset consists of street text images, such as road signs, billboards, book titles, museum exhibit labels, architectural signage, and other typical textual elements.’ (page 7, lines 283-285).

Reviewer#2, Concern # 4: comment the choice of spelling: Uyghur vs Ujghur vs Uighur, use the same spelling through the article (see line 449).

Author response: Dear expert, thank you for your professional guidance. We have standardized the spelling of Uyghur and will consistently use it throughout the article.

Reviewer#2, Concern # 5: comment how to perform romanization and how to build translator from Uyghur to other language(s).

Author response: Dear expert, our research doesn't involve the conversion of other languages. Instead, each character from the Uyghur character set was directly converted into identifiers that the model can process. During the decoding process, we compare the decoded results with the corresponding identifiers to determine if the model has recognized them correctly.

Reviewer#2, Concern # 6: how users outside academic community and outside author’s scope can use your results in everyday life. Are there any attempts in this direction?

Author response: Dear expert, our research team focuses on Uyghur language speech recognition, Uyghur language translation (Uyghur to Chinese, Uyghur to English), and Uyghur language scene text recognition. We aim to develop versatile software integrating translation, speech, and text recognition for applications in translation services, security monitoring, and office automation (preliminary testing of the internal beta version has been conducted). We also plan to consider open-sourcing the dataset and relevant code for the scene text recognition research presented in this paper. The dataset used in this study consists of a small amount of real-world Uyghur language data. In our other research endeavors (synthetic Uyghur language dataset), we have already processed the relevant data for open access, thus enhancing the quality of service provision.

Reviewer#2, Concern # 7: Here are minor changes that should receive author(s) comments as well as feedback:

- line 74 change “.” after “follows” to “:”,

- change “study” to “article”,

- line 318: cite reference for “Adam optimizer”,

- line 166 Sec 3.1.2. “Dual…”: add additional comments “HOW” are equation derived. Clearly separate own scientific results and known facts. Note that wider audience should have more insight into this equations..

Author response: Dear expert, thank you for your professional advice. We have incorporated your suggestions as follows:

-Replaced "." with ":" after "follows" (page 2, line 80).

-Changed "study" to "article" throughout the text.

-Cited references to the "Adam optimizer" (page 8, line 331).

- section 3.1.2, added a relevant introduction to formula 3 (page 6, lines 228-239 ).

We appreciate your expertise in referencing known facts in the article.

Thank you once again for your guidance. It is through your direction that my writing has become more professional and standardized. 

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The paper was improved. However, the current version is unsuitable for publication. Many of my past suggestions were not addressed. The authors must explain why a suggestion was not considered.

The authors must prepare a point-by-point response explaining where and how the reviewer's suggestions were addressed. The details about the page and line(s) where the changes were conducted must be inserted.

In my past review, many questions were included in each paragraph; therefore, the authors must respond to each suggestion, not each paragraph.

Please support the arguments in the introduction with high-quality references. The first paragraph of the introduction must include many references.

The paper still contains many exaggerations (abuse of the adjective use). For example, the fourth conclusion in the introduction. In the related work, "has led to significant advancements." What is a significant advancement?

Please include a reference to support the idea that text recognition methods are categorized into three groups. The information shown in Section II is still vague, incomplete, and useless. No analysis of the related work was presented.

What is CEM? Please demonstrate that the model results in a more comprehensive feature representation and encoding.

The paper still contains typo errors. Observe the title of subsection 3.1. In the filter module, to which extent the noise is reduced? Please demonstrate that the model can capture key information more accurately. What is key information?

The paper still contains many exaggerations and arguments not supported by scientific evidence. Please demonstrate that "the most effective augmentation was conducted."

 

What is a more robust experimental comparison? 

Comments on the Quality of English Language

The quality of English must be extensively improved.

Author Response

Reviewer#1, Concern # 1: The paper was improved. However, the current version is unsuitable for publication. Many of my past suggestions were not addressed. The authors must explain why a suggestion was not considered. The authors must prepare a point-by-point response explaining where and how the reviewer's suggestions were addressed. The details about the page and line(s) where the changes were conducted must be inserted. In my past review, many questions were included in each paragraph; therefore, the authors must respond to each suggestion, not each paragraph.

Author response: Dear expert, thank you for your careful guidance. I apologize for the oversight. I didn't fully grasp your suggestions in the past and failed to incorporate them effectively into the paper. Thank you for giving me another chance to make things right. In previous revisions, I didn't properly highlight responses to each suggestion as many paragraphs were rewritten, resulting in a lack of clarity. I'll address each point individually in this response.

Reviewer#1, Concern # 2: Please support the arguments in the introduction with high-quality references. The first paragraph of the introduction must include many references.

Author response: Dear expert, thank you for your careful guidance. I apologize for not fully grasping your suggestions in previous drafts. Thank you for your professional guidance. Following your advice, we have incorporated relevant references supporting our arguments into the first paragraph of the introduction. (page 1, lines 24-30, Reference [1-12]).

Reviewer#1, Concern # 3: The paper still contains many exaggerations (abuse of the adjective use). For example, the fourth conclusion in the introduction. In the related work, "has led to significant advancements." What is a significant advancement?

Author response: (1)Dear expert, thank you for your guidance. We've reviewed our writing, replacing inappropriate adjectives and rephrasing sentences as needed.

(2)The fourth conclusion in the introduction has been changed to ‘This article provides reference examples for other low-resource languages lacking real-world scene text recognition data, such as Mongolian, Kazakh, and Kyrgyz. (page3, lines 95-96)’

(3)Using pre-trained large language models in scene text, existing scene text recognition achieves an average accuracy of over 97% on six commonly used English benchmark datasets, with the accuracy on the IC13 857 dataset exceeding 99%. Therefore, the previous description highlights significant progress.

Following your professional guidance, we have reorganized the description of the relevant work, improving the unreasonable statements. We have changed "has led to significant advancements" to "In recent years, large-scale pre-trained language models have demonstrated powerful language perception capabilities across various computer vision tasks". Thank you for your guidance. (page3, lines103-104)

Reviewer#1, Concern # 4: Please include a reference to support the idea that text recognition methods are categorized into three groups. The information shown in Section II is still vague, incomplete, and useless. No analysis of the related work was presented.

Author response: Dear expert, thank you for your professional advice. We have reorganized the relevant sections and added some literature to support the categorization of text recognition methods into three types (page 3, lines 103-110, Reference [17,19,27-35]). Additionally, we've included an analysis introduction of the related work (pages 3-4, lines 114-116, 119-127, 138-142, 144-147, 154-155, 157-161).

Reviewer#1, Concern # 5: What is CEM? Please demonstrate that the model results in a more comprehensive feature representation and encoding.

Author response: (1)Dear Expert, CEM stands for Collaborative Encoding Method as proposed in this paper (page 4, lines 169-170).

(2)CEM achieves optimal recognition accuracy for low-resource Uyghur scene text recognition by employing CNN-Transformer dual-branch collaborative encoding on raw+Aug1 and raw+Aug1+Aug2 training sets, outperforming the baseline method parseq, which only uses visual Transformer encoding, by 14.1% (page 10, Table2, line 358). This further validates the effectiveness of collaborative encoding. Additionally, a filter+transformer encoding model is designed in this study, achieving an accuracy of only 79.2% (page 12, Table3, row 13), significantly lower than the 94.1% of the CEM method. This further illustrates that the CEM method can achieve more comprehensive feature representation and encoding.

Reviewer#1, Concern # 6: The paper still contains typo errors. Observe the title of subsection 3.1. In the filter module, to which extent the noise is reduced? Please demonstrate that the model can capture key information more accurately. What is key information?

Author response: (1)Dear expert, thank you for your constructive suggestions. We have corrected the relevant errors in the title. (page 4, line 175)

(2)Dear expert, this paper employs the Filter module to extract multi-level low-level features, emphasizing textual information in images while reducing irrelevant background noise. To better demonstrate the effectiveness of the filtering module, we have included relevant experiments in Table 3. (page12)

This paper employs the Filter module to extract multi-level low-level features, emphasizing text information in images while reducing irrelevant background noise. To better demonstrate the effectiveness of the Filter module, we include additional experiments in Table 3 ( page 12, Table 3, rows 1 and 2, 5and 6, 9 and 10).

Rows 2 represent experiments of our lightweight model without using the Filter module. The data in rows 1 and 2, shows that the model utilizing the Filter module achieves an accuracy improvement of 8.2% compared to the model without it. Row 6 represents experiments of our base model without using the Filter module. The data in rows 5 and 6, shows that the model utilizing the Filter module achieves an accuracy improvement of 0.6% compared to the model without it.

Based on the data in the 1st, 2nd, 5th, 6th, 9th, and 10th lines, it can be observed that the filter module has the greatest impact when the CNN branch employs only 3 convolution operations. When the CNN branch uses ResNet17, its effect is reduced, but the accuracy is still 0.6% higher than models that do not use the filter module. However, data from the 9th and 10th lines indicate that enhancing the CNN branch cannot fully replace the role of the filter module. In this case, even though the CNN branch adopts the architecture of ResNet33, models using the filter module still achieve an accuracy improvement of 0.8%. This further demonstrates that the filter module can help the model better extract text features and reduce the influence of background noise. (pages 11-12, lines 407-432).

To better demonstrate the filtering effect of the Filter module, we present images showcasing the before-and-after effects of the filter module (not included in the paper). (Submit through a cover letter.)

(3)By applying the attention mechanism to channel and spatial positions through CBAM, features across different channels and spatial locations can be selectively emphasized or suppressed, allowing the model to focus more on the beneficial channel and spatial features during the recognition process. These beneficial channels and spatial features constitute the key information.

(4)To demonstrate the enhanced capability of CBAM in capturing crucial information, we designed and conducted additional experiments ( page 12, Table 3, rows 1 and 3, 5 and 7).

In row 3, experiments with our lightweight model without a CBAM module are presented. The data in rows 1 and 3 indicate a 0.2% improvement in accuracy when employing the CBAM module.

In row 7, experiments with our base model without the CBAM module are outlined. The data in rows 5 and 7 illustrate a 0.9% increase in accuracy with the utilization of the CBAM module.

Reviewer#1, Concern # 7: The paper still contains many exaggerations and arguments not supported by scientific evidence. Please demonstrate that "the most effective augmentation was conducted."

Author response: Dear expert, thank you very much for your professional advice. Our previous research involved a data augmentation strategy where we selected the most effective methods from various augmentation strategies. Specifically, in the first round of augmentation, we utilized Google's default set of 38 augmentation methods, and in the second round, we refined our selection to 12 specific methods, including Curve, Distort, Stretch, Rotate, Perspective, Shrink, TranslateX, TranslateY, Contrast, Brightness, JpegCompression, and Pixelate. This strategy yielded superior results compared to alternative augmentation strategies in the second round.

We have revised this description based on your professional advice. We have been replaced with 'Our data augmentation strategy is grounded in prior research. The approach utilized in earlier research involved employing Google's default set of 38 augmentation methods in the initial round of data augmentation. Following that, in the second round, a refined selection was made, consisting of 12 methods, namely Curve, Distort, Stretch, Rotate, Perspective, Shrink, TranslateX, TranslateY, Contrast, Brightness, JpegCompression, and Pixelate.' (page 8, lines 309-314)

Reviewer#1, Concern # 8: What is a more robust experimental comparison?

Author response: Dear expert, apologies for any previous inaccuracies. Following your professional guidance, we have revised it to "To demonstrate the effectiveness of our approach, we conducted a series of experiments on the Raw, Aug1, and Aug2 datasets." (page 10, line 342)

Thank you once again for your guidance. It is through your direction that my writing has become more professional and standardized. 

Reviewer 2 Report

Comments and Suggestions for Authors

Thanks for this new revision. It answered my questions.

Author Response

Thank you once again for your guidance. It is through your direction that my writing has become more professional and standardized.

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

The paper has been improved, and it is almost ready for publication.

Please conduct a final exhaustive review to correct minor errors (grammatical, typos...)

Comments on the Quality of English Language

Minor revision.

Author Response

Dear Expert

Thank you for your comprehensive review and valuable feedback on our submitted paper. Your professional insights have offered crucial guidance to our research, significantly influencing the enhancement of the paper's quality. We have diligently considered your suggestions and implemented corresponding revisions in the final version. Once again, we appreciate your dedication and time in reviewing our work. 

 

Best regards

 

 

Reviewer#1, Concern # 1: The paper has been improved,  and it is almost ready for publication. Please conduct a final exhaustive review to correct minor errors (grammatical, typos...).

Author response: Dear expert, based on your professional advice, we have made some improvements to the grammar in the paper. Thank you very much for your guidance.

page 2, lines 61-62

Original sentence: Raw dataset comprises 9082 street scene images

Modified sentence: The Raw dataset comprises 9082 street scene images

 

page 2, lines 67-68

Original sentence: Data augmentation techniques can help alleviate the problem of insufficient training data for low-resource languages to some extent.

Modified sentence: Data augmentation techniques can help somewhat alleviate the problem of insufficient training data for low-resource languages.

 

page 2, lines 73-74

Original sentence: better capturing the differential features between similar characters in Uyghur text, and reducing reliance on a large amount of training data.

Modified sentence: better capture the differential features between similar characters in Uyghur text and reduce reliance on a large amount of training data.

 

page 3, lines 126-128

Original sentence: These methods all utilize ViT encoders for feature extraction, followed by mapping these features to category labels through fully connected layers, achieving a certain degree of success.

Modified sentence: These methods utilize ViT encoders for feature extraction, then map these features to category labels through fully connected layers, achieving a certain degree of success.

 

page 4, lines 154-156

Original sentence: Existing typically internal language joint learning methods employ the Transformer architecture and integrate corresponding language branches into it.

Modified sentence: Existing internal language joint learning methods typically employ the Transformer architecture and integrate corresponding language branches.

 

page 4, lines174-175

Original sentence: The encoder primarily comprises three components: the Filter Module, the DBFE Module, and the DF Module.

Modified sentence: The encoder primarily comprises the Filter Module, the DBFE Module, and the DF Module.

 

page 6, lines 239-240

Original sentence: To facilitate the integration of features from the CNN and Transformer branches, we have introduced a dynamic fusion module.

Modified sentence: We have introduced a dynamic fusion module to facilitate the integration of features from the CNN and Transformer branches.

 

page 7, lines 285-287

Original sentence: During the screening process, we excluded some heavily occluded, extremely distorted, and repetitive data in this article.

Modified sentence: We excluded heavily occluded, extremely distorted, and repetitive data in this article during the screening process.

 

page 12, lines 424-426

Original sentence: Through the analysis of the data in the 1st and 3rd rows, it was found that the accuracy of the model improved by 0.2% when the CBAM module was incorporated into the CNN branch using a lightweight architecture.

Modified sentence: The analysis of the data in the 1st and 3rd rows found that the accuracy of the model improved by 0.2% when the CBAM module was incorporated into the CNN branch using a lightweight architecture.

 

page 12, lines 449-450

Original sentence: Data from the 1st, 5th, and final rows of the experiment indicate that the model performs poorly when the CNN branch is not utilized.

Modified sentence: Data from the experiment's 1st, 5th, and final rows indicate that the model performs poorly when the CNN branch is not utilized.

Back to TopTop