Next Article in Journal
Convolutional Neural Network for Automatic Detection of Segments Contaminated by Interference in ECG Signal
Previous Article in Journal / Special Issue
Creativeable: Leveraging AI for Personalized Creativity Enhancement
 
 
Article
Peer-Review Record

Benchmarking Psychological Lexicons and Large Language Models for Emotion Detection in Brazilian Portuguese

by Thales David Domingues Aparecido 1,2,3,†, Alexis Carrillo 2,†, Chico Q. Camargo 3,4,5 and Massimo Stella 2,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Submission received: 17 July 2025 / Revised: 9 August 2025 / Accepted: 16 September 2025 / Published: 1 October 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This study presents a comprehensive benchmark of emotion detection methods in Brazilian Portuguese by evaluating the performance of two Large Language Models (Mistral-24B and DeepSeek-8B) against the interpretable lexicon-based EmoAtlas across multiple emotion-labeled corpora. The research also introduces a novel “emotional fingerprinting” methodology, using LLM-generated datasets to explore how models encode and recognize emotional content, contributing to multilingual NLP and model interpretability.

Here are my comments:

The performance differences between models are mostly presented in terms of accuracy/precision/recall values without statistical tests to confirm whether observed differences are significant. No confidence intervals or p-values are provided. So I recommend to perform McNemar’s test or bootstrapped confidence intervals for classification comparisons.

The paper lacks qualitative or quantitative analysis of misclassified examples. You need include an error typology or at least exemplar misclassifications per emotion and model.

Only Mistral and EmoAtlas are compared, other strong baselines such as multilingual BERT variants are excluded.

This paper lacks reviewing recent studies in introdution, like Predicting flow status of a flexible rectifier using cognitive computing; Large language models for human–robot interaction: A review.

There is no analysis of model confidence or calibration. You need include calibration plots or Expected Calibration Error (ECE) metrics, especially for LLMs.

No analysis is provided on how these models may handle gendered, racial, or political language—this is a major oversight, especially when applying emotion detection in social or political contexts.

Author Response

Thank you for your valuable feedback. We have revised the manuscript to address all of your comments and suggestions.

Comment: The performance differences between models are mostly presented in terms of accuracy/precision/recall values without statistical tests to confirm whether observed differences are significant. No confidence intervals or p-values are provided. So I recommend to perform McNemar’s test or bootstrapped confidence intervals for classification comparisons.

Response:   We did calculate 95% confidence intervals for all our reported metrics using a bootstrap method. This is detailed in the Confidence Interval Calculation section of the Methods. The values are presented in the results tables in the format of value (lower_bound, upper_bound). Non-overlapping confidence intervals indicate a significant difference between models, allowing for robust comparisons.  We chose not to use the McNemar test because it is designed for comparing only two models at a time. Given our study involved four different models (BERTimbau, Baseline, EmoAtlas, and Mistral), the bootstrap method with confidence intervals was a more comprehensive approach for comparisons across all models and metrics. 

 

Comment: The paper lacks qualitative or quantitative analysis of misclassified examples. You need include an error typology or at least exemplar misclassifications per emotion and model.

Response: We have added new content to the Discussion section that includes a detailed qualitative and quantitative analysis of misclassified examples. This new section provides a deeper understanding of the models' specific failure points, including an error typology for certain emotions and datasets.

 

Comment: Only Mistral and EmoAtlas are compared, other strong baselines such as multilingual BERT variants are excluded.

 Response: We agree that including a fine-tuned BERT variant is crucial for a comprehensive benchmark. We have now incorporated the BERTimbau-Large model into our analysis. The updated text and tables present BERTimbau's results alongside those of Mistral, EmoAtlas, and the stochastic baseline, providing a more robust comparison.

Comment: This paper lacks reviewing recent studies in introdution, like Predicting flow status of a flexible rectifier using cognitive computing; Large language models for human–robot interaction: A review.

 Response: We have reviewed the papers you recommended. While the topics of cognitive computing and human-robot interaction are highly relevant to the broader field of AI, their specific focus on engineering and multimodal applications falls outside the scope of our study on text-based emotion detection. We have maintained our original literature review, which is directly relevant to our specific task and methodologies.

 

Comment: There is no analysis of model confidence or calibration. You need include calibration plots or Expected Calibration Error (ECE) metrics, especially for LLMs.

 Response: We agree that an analysis of model confidence is valuable. However, the implementations of Mistral and EmoAtlas do not provide probability values, making a complete and fair calibration analysis across all models impossible. We chose to focus on the performance metrics that could be consistently calculated for every model.

 

Comment: No analysis is provided on how these models may handle gendered, racial, or political language—this is a major oversight, especially when applying emotion detection in social or political contexts.

 Response: We have addressed this important point by adding new content to the Discussion section. We highlight the unique capabilities of EmoAtlas's semantic frame analysis to identify and quantify biases in text, demonstrating how this method can reveal how gendered, racial, or political terms are emotionally portrayed. This offers a valuable tool for responsible emotion detection that complements the performance of the black-box models.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper proposes a benchmark for emotion detection in Brazilian Portuguese. Here are some considerations.

1. The introduction clearly states the problem and the rationale. However, the innovation is not clear. The authors state that they are using an existing tool, EmoAtlas, for emotion profiling; this appears to be an application and benchmark of the tool. Therefore, more focus should be put on explaining and underlining the innovation provided by the proposal.
2. Related works are well written and properly discuss the problem under investigation. However, the whole proposal still appears as a use case of the tool EmoAtlas. 
3. The experimental methodology is well discussed and provides relevant details for reproducibility.
4. Overall, the real point of innovation is the "emotional fingerprinting". This is extremely interesting, as it is posed by the authors as a novel "emotional benchmark". However, the main suggestion is that the authors provide further benchmarks on further use cases; the application of a benchmark to a single use case is somehow limited and does not provide the basis to make it acceptable in a "wider sense" by the research community.

Therefore, the approach is extremely interesting and may provide a different perspective on a topic which may be relevant in AI alignment, that is, recognising human emotions. However, the suggestion is to switch the narrative to further focus on the emotional profiling and to broaden the analysis by including other languages and/or models, where possible.

Author Response

Thank you for your thoughful and valuable feedback. We have revised the manuscript to address all of your comments and suggestions.

Comment: 1. The introduction clearly states the problem and the rationale. However, the innovation is not clear. The authors state that they are using an existing tool, EmoAtlas, for emotion profiling; this appears to be an application and benchmark of the tool. Therefore, more focus should be put on explaining and underlining the innovation provided by the proposal.

Comment: 2. Related works are well written and properly discuss the problem under investigation. However, the whole proposal still appears as a use case of the tool EmoAtlas. 

Response to comments 1 and 2: We have explicitly reframed the paper's narrative to focus on two core innovations: a comprehensive, multi-model benchmark for emotion detection in Brazilian Portuguese, and a novel "emotional fingerprinting" methodology. Our work is a new framework for evaluation, providing a foundational basis for wider adoption by the research community.

Comment: 3. The experimental methodology is well discussed and provides relevant details for reproducibility.

Response: We appreciate your positive comments on our methodology. 

Comment: 3. The experimental methodology is well discussed and provides relevant details for reproducibility.

Response:  We agree that a comprehensive benchmark requires a broader analysis. We have now incorporated the BERTimbau-Large model into our study, and the updated results are presented alongside those of Mistral, EmoAtlas, and the stochastic baseline. We also added a section on future work that explicitly addresses the need to expand this benchmarking to other models and languages.

Comment 4: Overall, the real point of innovation is the "emotional fingerprinting". This is extremely interesting, as it is posed by the authors as a novel "emotional benchmark". However, the main suggestion is that the authors provide further benchmarks on further use cases; the application of a benchmark to a single use case is somehow limited and does not provide the basis to make it acceptable in a "wider sense" by the research community.

Therefore, the approach is extremely interesting and may provide a different perspective on a topic which may be relevant in AI alignment, that is, recognising human emotions. However, the suggestion is to switch the narrative to further focus on the emotional profiling and to broaden the analysis by including other languages and/or models, where possible.

Response:  We have added new content to the Discussion section that includes both qualitative and quantitative analysis of misclassified examples. This new section provides a deeper understanding of the models' specific failure points, offering an error typology for certain emotions and datasets. Also, we have addressed this important point by adding new content to the Discussion section that highlights the unique capabilities of EmoAtlas's semantic frame analysis. This demonstrates how our methodology can be used to identify and quantify biases in text, offering a valuable tool for responsible emotion detection.

Reviewer 3 Report

Comments and Suggestions for Authors

This paper addresses the important and underexplored area of emotion detection in Brazilian Portuguese, specifically benchmarking lexicon-based EmoAtlas against state-of-the-art Large Language Models (LLMs). While the effort to provide the first quantitative benchmark for interpretable emotion detection in Brazilian Portuguese is commendable, several major issues require attention before publication:

1. Limited Scope of LLM Evaluation: The study primarily benchmarks only two specific state-of-the-art LLMs, Mistral 24B and DeepSeek r1 8B (DeepSeek was used for data generation, not direct benchmarking for emotion detection). While these are powerful models, relying on such a narrow selection limits the generalizability of the findings regarding LLM performance in Brazilian Portuguese emotion detection. A more comprehensive benchmarking would include a wider variety of LLM architectures and sizes, as well as more LLMs directly compared to EmoAtlas for emotion detection.

2. Omission of Brazilian Portuguese-Specific LLMs: The paper acknowledges that most NLP studies focus on English, and that "LLMs can be repurposed for various NLP tasks without requiring additional training". However, for a study specifically focused on Brazilian Portuguese, the absence of an evaluation of LLMs (or transformer models like BERT) that have been specifically pre-trained or fine-tuned for this language (e.g., BERTimbau) is a significant gap. While the paper mentions BERTimbau as a "heavy emotional profiling tool," it is not directly included in the benchmark. Including such models would provide a more relevant comparison and stronger insights into the performance of language-specific models versus more general multilingual LLMs for this task.

3. Potential for Bias in Evaluation and Presentation: The stated motivation of the work is to "validate EmoAtlas as an innovative and cost-effective alternative" , and the paper frequently highlights EmoAtlas's interpretability, efficiency, and specific strengths, even when LLMs show higher overall accuracy. While these are valid points, the consistent framing and the selective highlighting of EmoAtlas's performance in specific cases could be perceived as a positive bias towards the traditional method. The authors should strive for a more neutral and objective presentation of results, focusing on a balanced comparison of strengths and weaknesses across all benchmarked models without appearing to advocate for one over the other.

4. Reliance on LLM-Translated and Generated Datasets: A significant portion of the evaluation relies on datasets either translated by LLMs (e.g., GoEmotions translated by Gemma 3 27B and Mistral Small 24B) or entirely generated by an LLM (DeepSeek R1 8B for news headlines). This introduces a potential circularity and raises concerns about how well these datasets truly represent natural human-generated Brazilian Portuguese text and its emotional nuances. The "noise introduced by translation processes" is acknowledged, but the full implications for benchmarking are not thoroughly discussed. Future revisions should explore using more natively human-annotated Brazilian Portuguese datasets to strengthen the validity of the benchmark.

Author Response

1. Limited Scope of LLM Evaluation: The study primarily benchmarks only two specific state-of-the-art LLMs, Mistral 24B and DeepSeek r1 8B (DeepSeek was used for data generation, not direct benchmarking for emotion detection). While these are powerful models, relying on such a narrow selection limits the generalizability of the findings regarding LLM performance in Brazilian Portuguese emotion detection. A more comprehensive benchmarking would include a wider variety of LLM architectures and sizes, as well as more LLMs directly compared to EmoAtlas for emotion detection.

response: We agree that a more comprehensive evaluation is necessary. To address this, we have now included BERTimbau, a widely used and language-specific transformer model, in our direct benchmarking for emotion classification. Its performance is now directly compared against Mistral and EmoAtlas, which offers a more robust analysis of different model architectures.

2. Omission of Brazilian Portuguese-Specific LLMs: The paper acknowledges that most NLP studies focus on English, and that "LLMs can be repurposed for various NLP tasks without requiring additional training". However, for a study specifically focused on Brazilian Portuguese, the absence of an evaluation of LLMs (or transformer models like BERT) that have been specifically pre-trained or fine-tuned for this language (e.g., BERTimbau) is a significant gap. While the paper mentions BERTimbau as a "heavy emotional profiling tool," it is not directly included in the benchmark. Including such models would provide a more relevant comparison and stronger insights into the performance of language-specific models versus more general multilingual LLMs for this task.

Response: As noted above, we have addressed this point by incorporating the BERTimbau model. This addition provides a more relevant comparison and offers stronger insights into the performance differences between a general multilingual LLM, a language-specific transformer, and a lexicon-based approach for emotion detection in Brazilian Portuguese.

3. Potential for Bias in Evaluation and Presentation: The stated motivation of the work is to "validate EmoAtlas as an innovative and cost-effective alternative" , and the paper frequently highlights EmoAtlas's interpretability, efficiency, and specific strengths, even when LLMs show higher overall accuracy. While these are valid points, the consistent framing and the selective highlighting of EmoAtlas's performance in specific cases could be perceived as a positive bias towards the traditional method. The authors should strive for a more neutral and objective presentation of results, focusing on a balanced comparison of strengths and weaknesses across all benchmarked models without appearing to advocate for one over the other.

Response: We have revised the manuscript to ensure a more neutral and objective presentation of our findings. The updated version frames the work as a balanced comparison of all three models, acknowledging the distinct strengths and weaknesses of each without advocating for a single tool. We now focus on the key trade-offs between performance, computational cost, and interpretability.

4. Reliance on LLM-Translated and Generated Datasets: A significant portion of the evaluation relies on datasets either translated by LLMs (e.g., GoEmotions translated by Gemma 3 27B and Mistral Small 24B) or entirely generated by an LLM (DeepSeek R1 8B for news headlines). This introduces a potential circularity and raises concerns about how well these datasets truly represent natural human-generated Brazilian Portuguese text and its emotional nuances. The "noise introduced by translation processes" is acknowledged, but the full implications for benchmarking are not thoroughly discussed. Future revisions should explore using more natively human-annotated Brazilian Portuguese datasets to strengthen the validity of the benchmark.

Response: We have revised the manuscript to address the concerns regarding our use of LLM-translated and generated datasets. Our evaluation includes two natively human-annotated datasets: one of 4,000 stock-market tweets and another of 1,000 news headlines. These corpora serve as a benchmark of real-world human-generated text. We acknowledge the potential for "noise introduced by translation processes". However, the approach of translating the GoEmotions dataset is a well-established method in the literature for extending emotion analysis to low-resource languages. Our findings show that model performance remained consistent across datasets translated by different LLMs, which supports the viability of this strategy for comparative benchmarking.  Finally, the LLM-generated dataset was not used for general performance evaluation. Instead, it was specifically created for our novel "emotional fingerprinting" methodology, which, by design, requires synthetically generated data to analyze the internal representations of the models themselves.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

ok

Reviewer 2 Report

Comments and Suggestions for Authors

The authors provided a complete revision of their previous work, addressing raised concerns and providing more perspectives on the proposed novelty. Therefore, the paper can be considered, in my opinion, for publication.

Back to TopTop