Next Article in Journal
Decoupling Rainfall and Surface Runoff Effects Based on Spatio-Temporal Spectra of Wireless Channel State Information
Previous Article in Journal
Routing Technologies for 6G Low-Power and Lossy Networks
 
 
Article
Peer-Review Record

Zero-Shot Classification of Illicit Dark Web Content with Commercial LLMs: A Comparative Study on Accuracy, Human Consistency, and Inter-Model Agreement

Electronics 2025, 14(20), 4101; https://doi.org/10.3390/electronics14204101
by Víctor-Pablo Prado-Sánchez *, Adrián Domínguez-Díaz, Luis De-Marcos and José-Javier Martínez-Herráiz
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Electronics 2025, 14(20), 4101; https://doi.org/10.3390/electronics14204101
Submission received: 17 September 2025 / Revised: 15 October 2025 / Accepted: 17 October 2025 / Published: 19 October 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper benchmarks eight commercial LLMs on zero-shot classification of Dark Web content using the CoDA dataset, showing strong overall performance and high agreement with humans, but highlighting persistent challenges in ambiguous categories.

1.The results tables would benefit from clearer visual emphasis. In particular, highlighting the best results in bold black font (especially Table 4) would greatly improve readability and allow for easier comparison across models. 

2. Missing Baselines; while the paper provides a comprehensive comparison across leading commercial LLMs and includes human annotation consistency, it would be valuable to also benchmark against traditional machine learning, deep learning models and fine-tuned models such as DarkBERT. Such baselines are already discussed in the related work section but are not quantitatively included in the results. This provides an answer to how LLMs compare not only with humans but also with established supervised approaches trained directly on CoDA.

3. Although the study highlights the strong zero-shot performance of commercial LLMs, it does not fully address computational cost, latency, or accessibility compared to traditional ML models or variability of such factors across used commercial LLMs. Since such applications often operate under resource constraints, this would be an important dimension to discuss.
   
4. The paper focuses on macro-averaged scores but could provide deeper analysis of the most challenging categories (e.g., Violence, Electronic, Financial)*. More qualitative insights into typical misclassifications would strengthen the discussion.

5. Finally, while zero-shot performance is quite good to start with, the absence of few-shot or prompt-tuning comparisons leaves open the question of whether relatively minor adaptations could significantly close the gap with human-level annotation.




Author Response

RESPONSE LETTER

RESPONSE: We would like to sincerely thank the Editor and the reviewers for their constructive and detailed comments on our manuscript. Their thoughtful feedback across multiple rounds of review has been extremely valuable in helping us improve both the rigor and the clarity of the study. In what follows, we provide a point-by-point response to all the comments raised, describing the changes made in the revised version of the manuscript.

 

RESPONSE TO REVIEWER #1

RESPONSE: We would like to express our gratitude to reviewer No. 1 for the careful and thorough evaluation of our work. Thanks to their detailed comments and suggestions, we have been able to substantially improve the quality, accuracy, and scientific rigour of the manuscript. Below, we address each point in detail and explain the modifications made in this revision.

 

1.The results tables would benefit from clearer visual emphasis. In particular, highlighting the best results in bold black font (especially Table 4) would greatly improve readability and allow for easier comparison across models. 

RESPONSE: We appreciate the reviewer’s suggestion regarding the visual emphasis of the results tables. Following this recommendation, we have updated the manuscript to highlight the best results using bold black font in the majority of tables, including Table 4. This change improves readability and facilitates comparison across models. We thank the reviewer for helping us enhance the clarity of our presentation.

 

  1. Missing Baselines; while the paper provides a comprehensive comparison across leading commercial LLMs and includes human annotation consistency, it would be valuable to also benchmark against traditional machine learning, deep learning models and fine-tuned models such as DarkBERT. Such baselines are already discussed in the related work section but are not quantitatively included in the results. This provides an answer to how LLMs compare not only with humans but also with established supervised approaches trained directly on CoDA.

 RESPONSE: We appreciate the reviewer’s insightful observation regarding the inclusion of baseline models. In response, we have added a new results table (Table 12) that presents a comparative analysis between traditional supervised approaches (SVM, CNN, and a fine-tuned BERT model) trained directly on the CoDA dataset and the best-performing LLM in our study (DeepSeek Chat). This addition enables a more complete assessment of the relative performance of commercial LLMs compared to established supervised baselines. We thank the reviewer for this valuable suggestion, which has strengthened the robustness of our evaluation.

 

  1. Although the study highlights the strong zero-shot performance of commercial LLMs, it does not fully address computational cost, latency, or accessibility compared to traditional ML models or variability of such factors across used commercial LLMs. Since such applications often operate under resource constraints, this would be an important dimension to discuss.

RESPONSE: Thank you for raising this important point. To address the computational cost, latency, and accessibility considerations, we have incorporated a new table (Table 3) that provides a detailed comparison of the commercial LLMs used in our study. The table includes information on architecture type, number of parameters, training data sources, API provider, cost per 1,000 tokens (input/output), and estimated inference latency. This addition allows readers to better understand the practical implications and deployment feasibility of each model in resource-constrained environments. We appreciate the reviewer’s suggestion, which has enhanced the completeness and applicability of our study.

 

  1. The paper focuses on macro-averaged scores but could provide deeper analysis of the most challenging categories (e.g., Violence, Electronic, Financial)*. More qualitative insights into typical misclassifications would strengthen the discussion.

RESPONSE: We appreciate the reviewer’s suggestion to provide more insight into the most challenging categories and misclassifications. In response, we have added a new table (Table 5) that presents a representative example of misclassification made by one of the evaluated LLMs (GPT‑3.5 Turbo). The sample, originally labeled as Others, was misclassified as Drugs due to the presence of ambiguous terms such as "darkmarket" and references to marketplaces. This example illustrates how certain lexical cues may lead models to erroneous predictions in categories with semantic overlap. We believe this addition strengthens the discussion by highlighting concrete challenges LLMs face in distinguishing between subtly different illicit content types.

 

  1. Finally, while zero-shot performance is quite good to start with, the absence of few-shot or prompt-tuning comparisons leaves open the question of whether relatively minor adaptations could significantly close the gap with human-level annotation.

RESPONSE: We thank the reviewer for this valuable observation. We have addressed this point in the revised manuscript by expanding the limitations section. Specifically, we now clarify that our study employed a strict zero-shot setting using a fixed prompt that mirrors the original category definitions provided to the human annotators of the CoDA dataset. While this setup ensures comparability with the human-labeled ground truth, it does not explore the potential performance gains from few-shot examples or prompt-tuning techniques. We acknowledge that future work should examine whether minor adaptations can further improve model alignment with human annotations.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors
  1. You built this entire comparison on one specific prompt. How can you be sure you're not just ranking how well each model understands your particular wording, rather than their true capability for the task? A different prompt could easily shuffle this leaderboard.
  2. How was the 0.95 cosine similarity threshold for deduplication justified (Section 3.1), and why was no sensitivity analysis conducted to assess its impact on class imbalance, especially for underrepresented categories like "Violence" (421 documents post-deduplication, Table 1)?
  3. Given the deterministic querying with temperature=0 via APIs in March-April 2025 (Section 3.2), how does the study address potential non-reproducibility from API updates or model stochasticity in practical applications?
  4. Why are differences in macro-averaged F1-scores (e.g., DeepSeek Chat at 87.01% vs. GPT-3.5 Turbo at 83.11%, Table 3) reported without statistical tests like McNemar's or bootstrapping to confirm if they are significant rather than random variations?
  5. Although error patterns are briefly mentioned (Section 4), why is there no inclusion of confusion matrices or qualitative examples to analyze misclassifications, especially in lower-performing categories like "Violence" (F1-scores often <80%, Table 4)?
  6. Despite the sensitive nature of illicit Dark Web content, why are ethical considerations—such as risks of misclassification amplifying harm or biases in the CoDA dataset—omitted, and how was compliance with data handling policies ensured (Section 3.1)?
  7. The study reports high human alignment (weighted Cohen’s Kappa >0.84 for top models, Table 5), but why was the original CoDA dataset's inter-annotator agreement not validated or discussed, particularly for potential biases in ambiguous classes (Section 5.2)? 

Author Response

RESPONSE LETTER

RESPONSE: We would like to sincerely thank the Editor and the reviewers for their constructive and detailed comments on our manuscript. Their thoughtful feedback across multiple rounds of review has been extremely valuable in helping us improve both the rigor and the clarity of the study. In what follows, we provide a point-by-point response to all the comments raised, describing the changes made in the revised version of the manuscript.

 

RESPONSE TO REVIEWER #2

RESPONSE: We would like to express our gratitude to reviewer No. 2 for the careful and thorough evaluation of our work. Thanks to their detailed comments and suggestions, we have been able to substantially improve the quality, accuracy, and scientific rigour of the manuscript. Below, we address each point in detail and explain the modifications made in this revision.

 

  1. You built this entire comparison on one specific prompt. How can you be sure you're not just ranking how well each model understands your particular wording, rather than their true capability for the task? A different prompt could easily shuffle this leaderboard.

RESPONSE: We appreciate this important observation regarding the influence of prompt formulation on model performance. This point has been explicitly acknowledged and discussed in the revised Limitations section of the manuscript. As noted, we intentionally employed a single, fixed prompt that mirrors the category definitions originally used by the human annotators during the construction of the CoDA dataset. This design choice was made to ensure fair comparability between LLM predictions and the gold-standard human annotations.

However, we agree that using a single prompt constrains the generalizability of the findings. Different phrasings could indeed shift model rankings, as LLMs vary in their sensitivity to prompt wording. While our goal was to control this variable for consistency across models, future work should explore prompt variation or optimization techniques (e.g., few-shot prompting or instruction tuning) to assess model robustness and to verify whether leaderboard positions remain stable under different prompt formulations.

 

  1. How was the 0.95 cosine similarity threshold for deduplication justified (Section 3.1), and why was no sensitivity analysis conducted to assess its impact on class imbalance, especially for underrepresented categories like "Violence" (421 documents post-deduplication, Table 1)?

RESPONSE: Thank you for your comment regarding the justification for the cosine similarity threshold of 0.95 for deduplication and the lack of a sensitivity analysis.

As clarified in section 3.1, the threshold of 0.95 was selected based on manual inspection of borderline document pairs. We reviewed a sample of document pairs with cosine similarity values close to the threshold and confirmed that those exceeding 0.95 were nearly identical, often with differences only in minor metadata such as contact email addresses or marketplace URLs. This practical validation ensured that meaningful content was retained and redundancy was minimised.

Additionally, while a formal sensitivity analysis was not conducted due to resource limitations, we acknowledge this limitation in the discussion. However, Table 2 (titled “Category Distribution Before and After Deduplication”) provides a transparent account of the pre- and post-deduplication document counts across all categories, allowing readers to assess the impact. It is also clarified in the manuscript that the column “Pre-Deduplication Count” corresponds to the original distribution of the CoDA dataset prior to any processing.

 

  1. Given the deterministic querying with temperature=0 via APIs in March-April 2025 (Section 3.2), how does the study address potential non-reproducibility from API updates or model stochasticity in practical applications?

RESPONSE: We thank the reviewer for raising this important point regarding reproducibility and model behavior over time. In response, we have explicitly acknowledged this limitation in the updated version of the manuscript (see Section 6: Limitations). Although all models were queried deterministically (e.g., temperature = 0) through official APIs between March and April 2025, we recognize that API-based models are subject to periodic updates and underlying version changes, which can lead to non-reproducibility even when identical prompts and parameters are used.

This addition acknowledges that model outputs may change due to stochastic processes during inference or silent versioning updates, even with fixed parameters. By limiting the data collection period and fixing prompt templates and temperature, we aimed to minimize these effects. Nonetheless, we fully agree with the reviewer that long-term reproducibility remains a challenge in evaluating API-based LLMs, and we have made this clear in the manuscript.

 

  1. Why are differences in macro-averaged F1-scores (e.g., DeepSeek Chat at 87.01% vs. GPT-3.5 Turbo at 83.11%, Table 3) reported without statistical tests like McNemar's or bootstrapping to confirm if they are significant rather than random variations?

RESPONSE: We appreciate the reviewer’s insightful comment regarding the need for statistical validation when reporting differences in macro-averaged F1-scores across models. In response, we have addressed this concern by reporting 95% confidence intervals (CIs) for all evaluation metrics, including precision, recall, F1-score, and inter-rater agreement metrics (see Tables 6–11). These CIs were computed using nonparametric bootstrap resampling.

This approach enables the reader to visually and quantitatively assess the overlap (or lack thereof) between model performance scores and draw reasonable conclusions about their significance without relying solely on pairwise hypothesis tests such as McNemar’s. For example, the top-performing model, DeepSeek Chat, achieved an F1-score of 87.01% [95% CI: 86.23–87.80], while GPT-3.5 Turbo scored 83.11% [95% CI: 82.30–83.91], with non-overlapping intervals, supporting the claim of a statistically meaningful difference in macro-level performance.

While we recognize the value of additional tests such as McNemar’s for fine-grained comparisons on binary classification outcomes, our study deals with multi-class classification, and the bootstrapped CIs offer a more general and model-agnostic method for evaluating significance across a range of metrics and tasks.

We have clarified this point in the revised manuscript to better reflect the statistical rigor applied in the evaluation process.

 

  1. Although error patterns are briefly mentioned (Section 4), why is there no inclusion of confusion matrices or qualitative examples to analyze misclassifications, especially in lower-performing categories like "Violence" (F1-scores often <80%, Table 4)?

RESPONSE: We thank the reviewer for highlighting the importance of analyzing error patterns, particularly for lower-performing categories such as Violence, where F1-scores often fell below 80% (see updated Table 8). In response to this valuable suggestion, we have now incorporated both qualitative and quantitative analyses of misclassifications in the revised manuscript:

  • Qualitative Example (Table 5): We added Table 5 titled "Example of Misclassification by a Commercial LLM: Original vs. Predicted Category on a CoDA Document". This provides a concrete illustration of a real CoDA sample that was misclassified by GPT-3.5 Turbo, which erroneously labeled a document originally tagged as Others as Drugs. This example was chosen specifically to reflect the kind of lexical ambiguity and domain-specific noise that commonly leads to confusion between categories, especially in anonymized or obfuscated Dark Web content.
  • Confusion Matrix (Figure 1): Additionally, we have included a confusion matrix (Figure 1) for DeepSeek Chat, the top-performing model in our evaluation. This matrix provides a class-level breakdown of model predictions, clearly illustrating frequent misclassification patterns, including the overlap between categories like Violence, Drugs, and Others, which lack sharp lexical boundaries. The inclusion of this matrix allows for a visual and interpretable understanding of which classes are most commonly confused, and to what extent.

These additions now directly address the reviewer’s request and provide deeper insights into model behavior beyond aggregate scores, strengthening the interpretability and completeness of our evaluation framework. The relevant updates appear in Section 4 (Evaluation Protocol) and are referenced in Section 5.1 and 5.2 of the Discussion for interpretative analysis.

 

  1. Despite the sensitive nature of illicit Dark Web content, why are ethical considerations—such as risks of misclassification amplifying harm or biases in the CoDA dataset—omitted, and how was compliance with data handling policies ensured (Section 3.1)?

RESPONSE: We sincerely appreciate the reviewer’s observation regarding the importance of addressing the ethical implications of using and classifying illicit Dark Web content. In response, we have added a dedicated subsection (Section 5.4: Ethical Considerations and Dataset Bias) to explicitly reflect on these issues.

This new section discusses the following key aspects:

  1. Risk of Harm through Misclassification: We acknowledge that incorrect categorization—particularly false positives in sensitive classes like Violence or Extremism—could potentially lead to disproportionate actions or unjust consequences if applied in real-world moderation or investigative contexts. This is especially critical in automated systems. We emphasize that the models are intended for research purposes only and not for direct deployment without human oversight.
  2. Dataset Bias and Annotation Subjectivity: The CoDA dataset, while carefully constructed, lacks reported inter-annotator agreement, making it difficult to assess label consistency or potential subjective biases in edge cases. We now discuss this limitation and its implications for evaluating model alignment with human judgment, particularly in semantically ambiguous categories like Others or Financial.
  3. Data Handling and Compliance: The CoDA dataset contains only anonymized textual excerpts without any personally identifiable information (PII). Nevertheless, we note that we followed strict ethical handling procedures and ensured full compliance with the dataset's usage terms and general research policies related to Dual Use Research of Concern (DURC).

This added subsection reinforces the responsible framing of our study and ensures that the potential societal and operational implications of our findings are transparently addressed. The updated content now aligns with best practices in AI ethics, as also highlighted by prior work on model deployment in high-risk domains.

 

  1. The study reports high human alignment (weighted Cohen’s Kappa >0.84 for top models, Table 5), but why was the original CoDA dataset's inter-annotator agreement not validated or discussed, particularly for potential biases in ambiguous classes (Section 5.2)?

RESPONSE: We appreciate the reviewer’s important observation regarding the absence of inter-annotator agreement metrics for the CoDA dataset. As correctly noted, we do not include any empirical validation of inter-annotator agreement in our study, and we would like to clarify the reason behind this decision.

Unfortunately, the original CoDA dataset (Jin et al., 2022) does not publicly provide inter-annotator agreement scores (such as Cohen’s Kappa or Krippendorff’s Alpha) for its human-labeled categories. The dataset documentation does not report information about:

  • The number of annotators per document,
  • Any adjudication process in cases of disagreement,
  • Or the agreement levels for ambiguous categories.

Because this information is not available in the original publication or in the dataset release, we were unable to reproduce or verify inter-annotator consistency on the original labels, which limits our ability to assess potential human bias or inconsistency in category boundaries.

Nonetheless, we have now explicitly acknowledged this limitation in Section 5.2 of the revised manuscript. In particular, we note that the absence of inter-annotator metrics may obscure potential subjectivity in the labeling process—especially in semantically diffuse classes such as Others or Financial, where lexical ambiguity is common. We also discuss how this limitation may affect the interpretation of model alignment with human annotations.

We agree with the reviewer that future work would benefit from accessing or reconstructing such agreement data to more thoroughly understand human labeling behavior and its influence on downstream model evaluations.

 

 

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper presents a timely and well-structured comparative evaluation of eight commercial large language models (LLMs) in the context of zero-shot classification of illicit Dark Web content. The authors demonstrate strong methodological rigor, particularly in the use of a standardized prompt template, consistent evaluation metrics, and a comprehensive dataset (CoDA). However, the manuscript requires revision before it can be considered for publication.

  1. The paper lack a detailed error analysis or qualitative examples of misclassifications.
  2. The authors do not explore how prompt variations might affect model performance.
  3. The authors claim that models like Grok and DeepSeek Chat outperform GPT-4o in some categories, but do not statistically validate these differences or provide confidence intervals.
  4. The study does not address the potential impact of model size, training data, or architecture on performance, which limits the interpretability of the comparative results.
  5. The dataset impact on class distribution and model performance is not analyzed in detail.
  6. The reproducibility of the study could be improved by providing access to the Python scripts.
  7. The libraries used for evaluation (scikit-learn and statsmodels) should be versioned to ensure reproducibility and compatibility.
  8. The paper does not report basic statistics about the dataset such as average document length.
  9. The results are not compared with prior work using the same CoDA dataset, which limits the reader’s ability to assess progress over existing benchmarks.
  10. The attempt to explain why certain models performed better is insufficient and lacks depth, especially given the architectural and training differences among the models.

Author Response

RESPONSE LETTER

RESPONSE: We would like to sincerely thank the Editor and the reviewers for their constructive and detailed comments on our manuscript. Their thoughtful feedback across multiple rounds of review has been extremely valuable in helping us improve both the rigor and the clarity of the study. In what follows, we provide a point-by-point response to all the comments raised, describing the changes made in the revised version of the manuscript.

 

RESPONSE TO REVIEWER #3

RESPONSE: We would like to express our gratitude to reviewer No. 3 for the careful and thorough evaluation of our work. Thanks to their detailed comments and suggestions, we have been able to substantially improve the quality, accuracy, and scientific rigour of the manuscript. Below, we address each point in detail and explain the modifications made in this revision.

 

  1. The paper lack a detailed error analysis or qualitative examples of misclassifications.

RESPONSE: We thank the reviewer for pointing out the importance of including more detailed error analysis and qualitative examples. In response to this valuable suggestion, we have incorporated a new table (Table 5) in the revised manuscript, specifically addressing this gap.

  • Table 5, titled Example of Misclassification by a Commercial LLM: Original vs. Predicted Category on a CoDA Document, presents a real-world case where GPT-3.5 Turbo misclassified a document originally labeled as Others into the Drugs category.
  • The document is a complex, multilingual snippet involving obfuscated language and marketplace references — highlighting the kinds of lexical and contextual ambiguities that often challenge zero-shot classification.
  • This example supports the discussion in Section 5.1 and Section 5.2, where we analyze how models, especially under zero-shot prompting conditions, are prone to conservative or overgeneralized category assignments.

This qualitative addition not only provides insight into model behavior but also strengthens our error analysis by illustrating the challenges posed by semantic noise and overlap in categories from the CoDA dataset.

We appreciate the reviewer’s suggestion, which has helped improve the interpretability and depth of the discussion around model limitations.

 

  1. The authors do not explore how prompt variations might affect model performance.

RESPONSE: We appreciate the reviewer’s insightful comment regarding the role of prompt formulation in shaping model performance. In the revised version of the manuscript, we have explicitly acknowledged this point in the Limitations section.

As explained in the updated discussion, our evaluation employed a single standardized prompt, derived directly from the original CoDA dataset’s category definitions—the same descriptions used by human annotators during dataset construction. This design choice was intentional, as it ensured a fair and direct comparison between model outputs and the human-labeled ground truth, under zero-shot prompting conditions.

However, we fully recognize that different prompt phrasings or alternative formulations could influence model behavior and potentially reorder the performance rankings. As such, we now note this in the Limitations (Section 5.5), emphasizing that the results may be partially prompt-sensitive, and that future research should explore prompt engineering or prompt ensembling strategies to test the robustness of model outputs across varied instructions.

This revision strengthens the transparency of our methodology and clearly frames the limitations associated with relying on a single prompt formulation. We thank the reviewer again for highlighting this important consideration.

 

  1. The authors claim that models like Grok and DeepSeek Chat outperform GPT-4o in some categories, but do not statistically validate these differences or provide confidence intervals.

RESPONSE: We thank the reviewer for highlighting the importance of statistically validating model performance differences.

In response, we have added 95% confidence intervals (CIs) to all reported performance metrics across tables, including macro-averaged F1-scores and per-category results. These intervals were computed using nonparametric bootstrap resampling (1,000 iterations) to ensure robust statistical estimation and enable more reliable cross-model comparisons.

This information has been integrated throughout the results section (see Tables 4, 6, 8, and 9), allowing readers to visually and quantitatively assess whether the differences between models, including those between Grok, DeepSeek Chat, and GPT-4o, are likely statistically meaningful or fall within overlapping intervals.

This revision directly addresses the concern and strengthens the statistical rigor of the paper. We are grateful for the suggestion.

 

  1. The study does not address the potential impact of model size, training data, or architecture on performance, which limits the interpretability of the comparative results.

RESPONSE: We appreciate the reviewer’s insightful observation regarding the need to account for model size, architecture, and training data in interpreting comparative results.

To address this point, we have incorporated a new comparative table (Table 3) that summarizes key architectural and operational characteristics for each evaluated LLM. Specifically, Table 3 includes:

  • Model architecture type (e.g., Transformer, Mixture of Experts, Multimodal),
  • Estimated number of parameters,
  • Training data sources (e.g., web-scale, code, multimodal),
  • API provider,
  • Cost per 1K input/output tokens, and
  • Estimated inference latency.

This addition provides essential context for interpreting model performance and reinforces the discussion around trade-offs between performance, accessibility, and computational efficiency (see revised Section 3.2 and Discussion 5.3).

By integrating these characteristics, the reader can better understand how factors like architecture (e.g., GPT-4o’s multimodality), scale (e.g., 236B parameters in DeepSeek Reasoner), and training philosophy may influence outcomes beyond raw F1-scores.

We thank the reviewer for highlighting this important point, which has helped strengthen the interpretability and completeness of our comparative analysis.

 

  1. The dataset impact on class distribution and model performance is not analyzed in detail.

RESPONSE: We thank the reviewer for pointing out the importance of examining the dataset’s influence on class distribution and model performance.

In the revised manuscript, we have addressed this concern through several key additions:

  • Section 3.1 now explicitly details the class distribution in the CoDA dataset (n = 10,000 documents), supported by Table 1, which presents the number of documents per category after deduplication. This allows readers to understand the degree of class imbalance, particularly for underrepresented categories such as Violence (n = 421), Electronics (n = 488), and Extremism (n = 470).
  • We further assess the impact of class distribution on model performance in Section 4.2, where Table 8 provides per-class F1-scores across all models. This table clearly shows that performance tends to be lower in categories with fewer samples and greater semantic overlap, such as Violence, Electronics, and Financial, confirming a correlation between class imbalance and reduced classification accuracy.
  • Additionally, the Discussion section (5.1) elaborates on these findings, noting that models consistently achieve lower F1-scores in minority or lexically ambiguous categories, reflecting both dataset-induced imbalance and task-specific difficulty.
  • Finally, to support deeper understanding, we included qualitative analysis and misclassification examples in Table 5, and a confusion matrix for the top-performing model (Figure 1), both of which highlight typical errors and class confusion patterns exacerbated by class imbalance.

These additions comprehensively address the reviewer’s concern by linking dataset properties to model outcomes and reinforcing the need for nuanced evaluation beyond aggregate metrics.

 

  1. The reproducibility of the study could be improved by providing access to the Python scripts.

RESPONSE: We appreciate the reviewer’s suggestion regarding reproducibility.

To address this, we have included the full Python script used for interacting with the evaluated commercial LLMs in Appendix A of the revised manuscript. This script reproduces the exact classification pipeline described in Section 3.2 and supports all tested models under zero-shot prompting conditions.

The script is written in a modular and provider-agnostic way, allowing researchers to specify their API keys, model names, and endpoints to replicate the same process across any supported platform (OpenAI, Google, Anthropic, xAI, etc.). We also provide template prompts, input preprocessing logic, and result parsing instructions, ensuring end-to-end reproducibility of our evaluation setup.

This inclusion enhances the transparency and replicability of our experiments, and supports broader adoption of the evaluation protocol presented in this study.

 

  1. The libraries used for evaluation (scikit-learn and statsmodels) should be versioned to ensure reproducibility and compatibility.

RESPONSE: We thank the reviewer for pointing out the importance of library versioning to ensure full reproducibility.

To address this, we have explicitly listed the Python version (3.10) and the exact versions of all key libraries used in the evaluation pipeline within Appendix A of the revised manuscript. Specifically, the appendix now details:

  • pandas = 2.2.1 – used for reading, preprocessing, and saving datasets,
  • openai = 1.3.5 – employed for API communication, including compatibility with DeepSeek's OpenAI-compatible endpoint,
  • time – standard Python library for managing optional request delays.

Additionally, the code uses scikit-learn and statsmodels internally for metric calculations and statistical analysis, and their versions have been explicitly included to ensure compatibility and replicability across environments.

These additions provide a clearer specification of the computational environment and facilitate the reproducibility of results by third-party researchers.

 

  1. The paper does not report basic statistics about the dataset such as average document length.

RESPONSE: Thank you for your valuable comment. In response, the final version of the manuscript now includes comprehensive dataset statistics to improve transparency and reproducibility.

Specifically:

  • Table 1 has been added to provide a detailed breakdown of the language distribution within the CoDA dataset, covering over 40 languages. This helps readers understand the dataset's multilingual composition, which may influence classification performance.
  • Table 2 has been substantially expanded and updated to include document counts and average character lengths per category, both before and after deduplication using TF-IDF and cosine similarity. This table offers an overview of how text length varies across categories and how deduplication affected the dataset size and content structure. Importantly, the column "Pre-Deduplication Count" refers to the original document totals in CoDA prior to applying any duplicate filtering, ensuring traceability.

These additions address your concern by offering meaningful quantitative insights into the dataset's size, structure, and language diversity, key aspects for understanding the models’ performance under zero-shot classification conditions.

 

  1. The results are not compared with prior work using the same CoDA dataset, which limits the reader’s ability to assess progress over existing benchmarks.

RESPONSE: We appreciate the reviewer’s suggestion to include direct comparisons with existing benchmarks using the CoDA dataset.

In response, the revised manuscript now includes Table 12, which provides a quantitative comparison between traditional supervised models (namely SVM, CNN, and fine-tuned BERT) as reported in prior studies using CoDA, and the best-performing commercial LLM from our zero-shot evaluation (DeepSeek Chat). This table has been added in Section 4 (Results) to facilitate a more meaningful assessment of the progress achieved by recent LLMs over previously established baselines.

Specifically:

  • The table reports macro-averaged F1-scores from earlier supervised experiments (as cited in the original CoDA paper and related literature), alongside the zero-shot performance of DeepSeek Chat using our unified prompt strategy.
  • This addition bridges the gap between past benchmark efforts and our current LLM-centric evaluation, allowing readers to appreciate both the relative improvement and the robustness of state-of-the-art models even in the absence of task-specific fine-tuning.

This update strengthens the study’s contribution by situating our findings within the broader landscape of CoDA-based research.

 

  1. The attempt to explain why certain models performed better is insufficient and lacks depth, especially given the architectural and training differences among the models.

RESPONSE: Thank you for pointing out the need for a more in-depth explanation of why certain models outperformed others.

In response, we have added Table 3 to the manuscript, which presents a detailed comparison of the commercial LLMs evaluated, including key architectural and operational characteristics such as:

  • Architecture Type (e.g., Transformer-based decoder or encoder-decoder),
  • Number of Parameters,
  • Training Data Size and Type (when publicly available),
  • API Provider,
  • Cost per 1K Tokens (USD) for input and output,
  • Inference Latency.

This expanded technical context has been integrated into the Discussion (Section 5.1) to provide a clearer rationale for observed performance trends. For instance:

  • DeepSeek Chat’s superior macro-F1 score is now discussed in relation to its balance of parameter count, training scale, and token-level alignment.
  • Performance differences between GPT-4o and its mini variant are framed in light of model size, inference latency, and alignment strategies.
  • The risk-aversion behavior observed in Gemini and GPT models is interpreted in part through their conservative prompting alignment tuning, as explained in Section 5.2.

These enhancements aim to offer a deeper and more structured interpretation of the performance variation across models, addressing the reviewer’s concern.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

  

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have comprehensively addressed all comments, significantly enhancing the manuscript's statistical rigor, reproducibility, and contextual analysis through the proposed additions.

Back to TopTop