Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Authorship Attribution in Less-Resourced Languages: A Hybrid Transformer Approach for Romanian

Appl. Sci. 2024, 14(7), 2700; https://doi.org/10.3390/app14072700

by Melania Nitu¹

and Mihai Dascalu^1,2,3,*

Reviewer 1:

Dmytro Zherlitsyn

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Appl. Sci. 2024, 14(7), 2700; https://doi.org/10.3390/app14072700

Submission received: 18 February 2024 / Revised: 14 March 2024 / Accepted: 21 March 2024 / Published: 23 March 2024

(This article belongs to the Special Issue Neural Network Technologies in Natural Language Processing and Data Mining)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper presents a novel approach to authorship attribution for Romanian texts using a hybrid Transformer model that integrates handcrafted linguistic features with contextualized embeddings. It is well-structured and offers a comprehensive study of authorship attribution for Romanian texts through a novel hybrid Transformer model. It significantly contributes to computational linguistics by integrating linguistic features with advanced machine-learning techniques, especially for less-resourced languages. However, several areas could benefit from further clarification or improvement:

1. The study's focus on Romanian texts is a strength but also limits the generalization of its findings.

4. A discussion on potential biases in the dataset and how they might affect the model's predictions would be valuable, along with considerations on how to mitigate these biases.

The paper has no significant flaws. Therefore, it can be published in its present form and may have minor revisions.

Author Response

The paper presents a novel approach to authorship attribution for Romanian texts using a hybrid Transformer model that integrates handcrafted linguistic features with contextualized embeddings. It is well-structured and offers a comprehensive study of authorship attribution for Romanian texts through a novel hybrid Transformer model. It significantly contributes to computational linguistics by integrating linguistic features with advanced machine-learning techniques, especially for less-resourced languages.
Response: Thank you kindly for your warm comments and thorough review.

However, several areas could benefit from further clarification or improvement:
1. The study's focus on Romanian texts is a strength but also limits the generalization of its findings.
Response: Even though the immediate focus is on Romanian, the methodologies applied in this study can easily be extended and applied to other less-resourced languages. The introduction of the revised manuscript has been updated to reflect this.

2. The paper acknowledges the dataset's size and diversity limitations. Expanding the dataset to include more authors, genres, and styles could enhance the model's robustness and ability to generalize across different writing styles.
Response: Our paper indeed acknowledges the limitations in the dataset's size and diversity. We have made efforts to expand the dataset to improve the model's robustness and generalization across various writing styles. However, it's important to note that expanding the dataset posed challenges due to copyright concerns. As most books are under copyright, we could only utilize the full text from publicly available Romanian books. Other researchers also acknowledge this limitation, as mentioned in studies such as [Avram, S.-M.; Oltean, M. A. Comparison of Several AI Techniques for Authorship Attribution on Romanian Texts. Mathematics, 2022, 10, 4589.]. Despite these challenges, we managed to increase the dataset's size by threefold compared to previously published datasets [https://www.kaggle.com/datasets/sandamariaavram/rost-romanian-stories-and-other-texts].

3. While the paper compares with existing methodologies, it could benefit from a more detailed discussion on why specific models were chosen as benchmarks and how the proposed model's performance compares in a broader context.
Response: The selection of our models was motivated by the objective of establishing a baseline for comparison with prior research in Romanian authorship attribution. Considering the limited existing studies on the topic (only two known to date), our goal was to fill this gap by presenting a comprehensive analysis. Unlike previous approaches that primarily relied on standalone machine learning (ML) models and a basic BERT-based approach for authorship attribution, we introduced novel methodologies. These methods combine the strength of seven ML models and introduce a hybrid BERT-based model. Our hybrid model not only uses text input but also incorporates linguistic features for predictions. As a result, our proposed approach surpasses the state-of-the-art performance of prior studies.

4. A discussion on potential biases in the dataset and how they might affect the model's predictions would be valuable, along with considerations on how to mitigate these biases.
Response: Discussing potential biases within our dataset and their impact on the model's predictions is essential. Our dataset exhibits an important degree of imbalance in the distribution of texts, with certain authors being disproportionately represented. For instance, in the PP set, we observe 2,982 paragraphs for Nicolae Iorga compared to only 68 paragraphs for Mihai Oltean. Similarly, in the FT set, there are only 20 stories from Panait Istrati compared to 132 from Anton Bacalbasa. Such imbalance in the dataset can lead to skewed outcomes or diminished accuracy, potentially resulting in discrimination against specific authors. To mitigate this bias, our methodologies incorporate various techniques, including weighted loss or stratified labels for data split.

5. While the paper mentions the choice of the dataset and codebase, additional details about the model's implementation, parameter tuning, and computational resources required could help researchers replicate the study's results more effectively.
Response: In response to this concern, we have addressed the parameter tuning process extensively in Section 3.4, providing detailed insights into our approach. We have included a comprehensive table summarizing the parametrization for machine learning models employed in our study. Additionally, we have supplemented this information with information on the computational resources required for our experiments.

The paper has no significant flaws. Therefore, it can be published in its present form and may have minor revisions.
Response: Thank you kindly again.

Reviewer 2 Report

Comments and Suggestions for Authors

Overall the paper is generally well-written and the paper seems to contain a decent contribution, but I have some reservations.

The paper contains a large amount of text that seems to be copy-pasted from other sources. There are signs to indicate the fact that the text is taken almost verbatim from the respective sources. This is not what we expect from a scientific paper.

The ithenticate score over 10% (16%) indicate the fact that either significant parts of the paper were generated with ChatGPT or that the authors have themselves taken some texts from abstracts from various papers. For example Section 2.3 (page 5) contains the following text that is taken straight from the abstract of the following paper: https://www.mdpi.com/2076-3417/12/15/7518

"Post-authorship attribution [30] is the process of using stylometric features to identify the writer of an online text snippet, such as an email, blog, forum post, or chat log. It has useful applications in manifold domains, for instance, in a verification process to proactively detect misogynistic, misandrist, xenophobic, and abusive posts on the internet or social networks. However, defining an appropriate characterization of text to capture the unique writing style of an author is a complex endeavor."

Citation is given, but this is not the norm in Computer Science to take over text from other works. Half a page is basically copied over. In two cases over 200 words are taken over.

This needs to be fixed before any reviewer will look at the contribution.

This is regretable, as the experimental section of the paper seems to be okay and there seems to be a decent contribution in there. Feel free to discuss with the editors if you are allowed to resubmit or not. For now the paper is rejected.

Author Response

We are deeply sorry for our oversight in this matter. We have reviewed the identified paragraph and rewritten and expanded it to ensure originality and compliance with academic standards. We appreciate you highlighting this issue, and we assure you that we are committed to maintaining the integrity of our research.

We appreciate the opportunity to resubmit a revised version of our manuscript.

Reviewer 3 Report

Comments and Suggestions for Authors

The paper introduces and evaluates a hybrid transformer model for authorship attribution in Romanian texts. The evaluation is performed against other alternative approaches on a a publicly available corpus of Romanian enhanced stories created by the authors. The authors state their evaluation demonstrates that their approach outperforms the alternative approaches.

(1) The paper would be improved by highlighting the universal benefit to addressing the problem of authorship attribution in Romania. While the authors state that, "the linguistic complexity of Romanian presents unique challenges on which out-of-the-box solutions do not work", the impact of the paper would be strengthened by highlighting additional areas of research that would benefit from improvements in approaches (like the one described by the author) for addressing authorship attribution in Romanian texts.

(2) The evaluation in the paper would be improved by testing for statistically significant differences among the methods shown in Table for the F1 score and Error rate. More specifically, table 3 presents an analysis of six algorithms for sentiment analysis. However, this approach only identifies that BERT Embeddings x RBI was the most accurate, and had the hightest f1. It does not establish if there are any statistically significant differences in the level of performance of these methods. The paper would be improved by performing tests for statistical significance among the compared methods to highlight which of the approaches, with respect to accuracy, and f1 are materially different from one another.

(3) In addition it is noteworthy that only two techniques are included in the evaluation that are not provided by the authors. Addressing why this is the case and highlighting that this is a universal search where only two alternate approaches (outside of the authors' approaches) exist would improve the paper.

(4) The authors are to be commended for a discussion section highlighting the subtleties of their results. However, a limitation discovered by Lynch et al. should be highligted as well. Specifically, when transformer models misclassify or misgenerate text there can be specific patterns relating to the types of errors they make. It is paramount to understand the patterns of these misclassifications and misgenerations so that users can be aware of the circumstances which are likely to result in errors.

Lynch, Christopher J., et al. "A Structured Narrative Prompt for Prompting Narratives from Large Language Models: Sentiment Assessment of ChatGPT-Generated Narratives and Real Tweets." Future Internet 15.12 (2023): 375.

(5) The authors are to be commended for including the data for the evaluation and the source code. However, the README file provided with the source code is insufficient for repeatability and replication of the experiments. Additional instructions need to be supplied to identify how readers would generate the same results appearing in the paper.

(6) The presentation of tables 2 and 8 would be improved if the numeric data in the tables was right justified. That will align significant digits in the data to enable readers to easily compare results among rows.

Author Response

Response: Thank you kindly for your thorough review.

Response: We appreciate your suggestion to highlight the universal benefits of addressing the problem of authorship attribution in Romanian texts. Based on your recommendation, we have expanded the conclusions section in our manuscript to underscore the broader implications of our research. In our revised text, we underscored that our work not only enhances the understanding of computational linguistics in multilingual environments but also has practical applications in forensic linguistics, historical document analysis, and literary studies.

Response: Thank you for the suggestion. In the revised paper, we have addressed this by introducing McNemar and Cochran's Q statistical tests to assess the performance evolution among the top three models. This enabled us to determine if there are statistically significant differences in the correctness of the predictions of these methods, enhancing the evaluation and providing insights into their comparative performance.

Response: We appreciate your suggestion, and we have updated the manuscript accordingly (section 5.1) to explain why only two techniques are included in our evaluation. It is important to note that our focus was specifically on methodologies used in Romanian studies on Authorship Attribution in order to establish a fair baseline for comparison. As mentioned in our paper, different languages present unique linguistic complexities, necessitating the consideration of language-specific methodologies. To our knowledge, there are currently only two existing studies in this domain, which we have outlined in section 2 of our manuscript. By emphasizing the limited existing approaches in Romanian Authorship Attribution research, we can provide context for our evaluation and enhance the clarity of our findings.

Response: Thank you for your suggestion. In the revised manuscript, we added a new section (5.3.) focusing on analyzing misclassification trends for our top-performing model. We conducted a cross-correlation between misclassifications and linguistic analysis (i.e., top features per author) to reveal potential causes and underlying associated patterns.

Response: We have updated the README file with a few additional details. Instructions are also included within each file, while the fine-tuning and parameterizations are described in the paper for reference and reproducibility.

Response: Table 2 has been updated. However, in Table 12 (formerly Table 8), justifying the numbers would only affect the "No. Docs" column, which would appear inconsistent with the rest of the table. Therefore, we kept the numbers in the table aligned as they were before.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

While it seems the paper was reworked, the problems still persist.

The plagiarism score is even higher than last time (31% compared to 16%). As far as I can tell, there is one mistake caused by the fact that bibliography seems to be included as well. However, even by removing the bibliography the score would still be higher than last time, which is problematic.

Before turnitin was introduced, it was acceptable to reuse up to half the material for a new publication, provided there was sufficient new material. The report however indicate that around 14% of the text was taken over or paraphrased from a single source which is not okay.

Until the situation is clarified, the paper remains rejected. Please clarify with the editors why you have obtained those scores. I would recommend that you do not resubmit the paper immediately and take your time to properly fix these issues.

It is really a pity, as the experimental section seems to be okay and otherwise it might have been an acceptable paper.

I would recommend to check each paper with Turnitin before submitting it to any journal for the future.

Author Response

We thank you for your rigor and appreciate all your insights.

We actually ran Turnitin before our previous re-submission. We were shocked by your percentages and just now rerun it. Please find attached our report.

For us, Turnitin reports 15%, but it is overinflated given that everything after Author contributions at the end is similar to our previous study. We have our affiliations, and in Table 4, we have the parameters reported as copy and pasted, but we cannot change any of these.

Otherwise, we only have small, similar common phrases identified, and there is nothing problematic on our side.

Author Response File: Author Response.pdf

Article Menu

Authorship Attribution in Less-Resourced Languages: A Hybrid Transformer Approach for Romanian

Further Information

Guidelines

MDPI Initiatives

Follow MDPI