Next Article in Journal
DV-Hop Location Algorithm Based on RSSI Correction
Next Article in Special Issue
Keyword-Aware Transformers Network for Chinese Open-Domain Conversation Generation
Previous Article in Journal
F-LSTM: FPGA-Based Heterogeneous Computing Framework for Deploying LSTM-Based Algorithms
Previous Article in Special Issue
Research on Named Entity Recognition for Spoken Language Understanding Using Adversarial Transfer Learning
 
 
Article
Peer-Review Record

WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation

Electronics 2023, 12(5), 1140; https://doi.org/10.3390/electronics12051140
by Jinyi Zhang 1,*, Ye Tian 2, Jiannan Mao 3, Mei Han 4, Feng Wen 1, Cong Guo 1, Zhonghui Gao 1 and Tadahiro Matsumoto 3
Reviewer 1: Anonymous
Electronics 2023, 12(5), 1140; https://doi.org/10.3390/electronics12051140
Submission received: 6 February 2023 / Revised: 19 February 2023 / Accepted: 24 February 2023 / Published: 26 February 2023
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)

Round 1

Reviewer 1 Report

Based on the review provided, the manuscript appears to have some strengths and weaknesses.

 

Strengths:

The aim of the study is clearly stated and the research is focused on addressing an important problem in the field, which is the language barrier between China and Japan.

The introduction and related works section provide a comprehensive and informative background on the subject, the challenges faced in the field, and the importance of the study.

The methods used in the paper are appropriate for the goal of the study, which is to validate the efficacy of a newly constructed Japanese-Chinese parallel corpus for machine translation.

The paper presents the WCC-JC 2.0 corpus, a Japanese-Chinese spoken language corpus, which is one of the world's largest publicly available Japanese-Chinese bilingual corpora.

The paper provides a detailed description of the experiments conducted and the results obtained, which include both quantitative and qualitative evaluations.

 

Weaknesses:

The abstract does not provide a detailed description of the method used to construct the WCC-JC 2.0 corpus, which could be a limitation for readers.

The conclusion does not provide a clear explanation of how the research contributes to addressing knowledge gaps in the field, nor does it address any limitations of the study that could impact future research.

The paper could benefit from more thorough proofreading to address minor grammatical errors and typos.

Overall, the manuscript appears to be well-structured and focused on an important research problem, but could benefit from some improvements in terms of clarity and completeness of the conclusions.

Author Response

Response to Reviewer 1 Comments

 

Manuscript Title: WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation

 

Authors: Jinyi Zhang, Ye Tian, Jiannan Mao, Mei Han, Feng Wen, Cong Guo, Zhonghui Gao, Tadahiro Matsumoto

 

Reviews response:

First of all, we would like to express our sincere gratitude to the anonymous reviewers for their critical reading and enlightening suggestions for improving this manuscript.

In the following, we will describe our point-by-point reply to the reviewers’ comments.

(Please note the reviewers’ comments are written in blue italics)

Point 1:

The abstract does not provide a detailed description of the method used to construct the WCC-JC 2.0 corpus, which could be a limitation for readers.

 

 

Response:

Thank you for your comments. The part you mentioned has been rewritten as follows:

To address these shortcomings, we thoroughly analyzed the issues associated with the construction of WCC-JC 1.0 and constructed the WCC-JC 2.0 corpus by first collecting subtitle data from movie and TV series websites. Then, we manually aligned a large number of high-quality sentence pairs. Our efforts resulted in a new corpus that includes about 1.4 million sentence pairs, an 87% increase compared to WCC-JC 1.0. As a result, WCC-JC 2.0 is now one of the largest publicly available Japanese-Chinese bilingual corpora in the world.

 

Point 2:

The conclusion does not provide a clear explanation of how the research contributes to addressing knowledge gaps in the field, nor does it address any limitations of the study that could impact future research.

 

Response:

Thank you very much for your positive comments. The part you mentioned has been rewritten as follows:

The WCC-JC 2.0 corpus presented in this study is a significant contribution to the field of Japanese-Chinese spoken language corpora. The corpus, which comprised approximately 1.4 million sentence pairs of Japanese-Chinese bilingual data, was constructed through a large-scale collection of Japanese-Chinese bilingual sentences from subtitles and subsequent manual alignment. This makes the WCC-JC 2.0 corpus one of the largest publicly available Japanese-Chinese bilingual corpora, and unique in that it is predominantly based on spoken language data derived from subtitle files.

 

The effectiveness of the WCC-JC 2.0 corpus was evaluated through Japanese-Chinese translation experiments and manual evaluations. Future work will focus on unifying the WCC-JC series of corpora and exploring data augmentation techniques, which could further enhance the quality of the corpus and provide more opportunities for research in the field. The WCC-JC 2.0 corpus and the advancements made in this study have the potential to significantly reduce the knowledge gap in Japanese-Chinese NMT and further drive the development of this field.

 

Point 3:

The paper could benefit from more thorough proofreading to address minor grammatical errors and typos.

 

Response:

Thank you for pointing them out. The resubmitted paper has been grammar checked and proofread.

 

Author Response File: Author Response.pdf

Reviewer 2 Report

A parallel corpus is presented for neural machine translation (NMT) between Japanese and Chinese spoken transcripts. It is a challenging task given the more agrammatic nature of spoken utterances. This corpus is a second improved version. A transformer model is proposed for the corpus evaluation task.

Important results are presented combining human and machine translation. The results achieved in terms of BLEU, however impressive for spoken transcripts, still fall considerably below what may be considered an understandable translation, which would be attached a score above 30.

Therefore, it is expected that the authors discuss in more detail and circumstances the translation quality results, particularly the BLEU scores.  

 

Author Response

Response to Reviewer 1 Comments

 

Manuscript Title: WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation

 

Authors: Jinyi Zhang, Ye Tian, Jiannan Mao, Mei Han, Feng Wen, Cong Guo, Zhonghui Gao, Tadahiro Matsumoto

 

Reviews response:

First of all, we would like to express our sincere gratitude to the anonymous reviewers for their critical reading and enlightening suggestions for improving this manuscript.

In the following, we will describe our point-by-point reply to the reviewers’ comments.

(Please note the reviewers’ comments are written in blue italics)

Point 1:

Important results are presented combining human and machine translation. The results achieved in terms of BLEU, however impressive for spoken transcripts, still fall considerably below what may be considered an understandable translation, which would be attached a score above 30.

 

Therefore, it is expected that the authors discuss in more detail and circumstances the translation quality results, particularly the BLEU scores.

 

Response:

Thank you for your comments.

There are many possible reasons that affect low BLEU values. The most likely cause is that one Japanese sentence had been translated into many different Chinese sentences, also known as a one-to-many situation.

We investigated the duplicate Japanese sentences of WCC-JC, which are one-to-many Japanese-Chinese sentences.

The details are shown in the following Table, which shows the top 10 duplicate Japanese sentences.

As we can see, these sentences were all very common daily-use utterances in spoken language, and they were translated differently when used in different scenarios, that is, one-to-many.

 

The next following Table shows the results of one-to-many translations (only ten kinds are shown here), and the corresponding English translations show that these were colloquial short sentences.

Although the characters were different, the meanings were actually very similar, that is, only because the spoken language in different scenarios will be a little different, even if when translated into English it has basically a similar meaning.

However, when evaluating the translation results, we cannot include all the translation references. This was also a problem that was difficult to avoid with the evaluation metric of machine translation.

Author Response File: Author Response.pdf

Back to TopTop